Hi Daryl, Re RAM amount - depends on your particular index (DB size doesn't help - who knows how you'll analyze/tokenize/index data, what term distribution is like, etc.)
Re master-slave - look for Collection Replication page on the Wiki Re real-time IM-like presence - perhaps you can do it all in RAM(Directory), or even InstantiatedIndex (in Lucene contrib), perhaps you can post-process results with "is the user X online?" type of looking in some fast in-memory data structure that's not necessarily Solr/Lucene. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Dev Team <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, September 26, 2008 10:38:39 AM > Subject: Re: Bunch of questions regarding enterprise configuration > > Hi Otis, > First off, thanks for your complete reply! It certainly has a lot of > good info in it. > To address some of the questions you asked, please see below: > > On Fri, Sep 26, 2008 at 1:36 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Hi, > > > > Your questions don't have simple answers, but here are some quick one. > > > > > > > > > > ----- Original Message ---- > > > I'm new to Solr, and have been reading through documentation off-and-on > > for > > > days, but still have some unanswered basic/fundamental questions that > > have a > > > huge impact on my implementation approach. > > > I am thinking of moving my company's web app's main search engine over to > > > Solr. My goal is to index 5M user records of a social networking website > > > (most of which have a free-form text portion, but the majority of data > > is > > > non-text) and have complex searches against those records back in the > > > sub-0.5s range of time. I have just under 10 application servers each > > > running my web-app, which is mostly stateless except for things like > > users' > > > online status. > > > > How many servers have you got for running Solr? (assuming you don't intend > > to put Solr on the same servers as your webapp, as it sounds like each > > webapp is maxing out its server) > > > Right now, 0. I'm still investigating how to get it working, not yet close > to estimating load. Once we get things to the testing stage, I'm sure we'll > have an idea of what kind of production hardware we'll need to > purchase/reuse/whatever. > > > > > > > > > Forgive me for asking so many in one email; feel free to change subject > > line > > > and reply about individual items. Here's the questions: > > > > > > 1. How to best organize a web-app that normally goes to a search-db to > > use > > > Solr instead? > > > a) Set up independent Solr instance, make app search it just like it used > > to > > > search database. > > > b) Integrate Solr right into app, so that app+solr get deployed together > > > (this is very possible, as our app is Java). But we run several > > instances > > > of the app so we'd be running several Solr instances too. > > > c) Set up independent Solr instance + our code (plugins or whatever?), > > have > > > web clients request DIRECTLY to the Solr app and have Solr return search > > > results directly. > > > d) Other configuration...? > > > > a) Set up Solr master + N slaves on a separate set of boxes and access them > > remotely from your webapp. If your webapp is a Java webapp, use SolrJ. > > Alternatively, if your webapp servers have enough spare CPU cycles and > > enough RAM, you could make those sam servers your 10 Solr slaves. > > > I see, thank you. > > Out of curiosity, how much RAM are we talking? (My database has about 6 gigs > of data that we'd want to index for search.) The reason I ask is because my > webapp servers do have CPU/RAM to spare. > > > > > > > > > 2. How to best handle Enums? > > > We have a bunch of enumerated data (say, for example, shoe types). What > > > "fieldType" should we use to index them? > > > Should I index them as text? If I index "sandals" then if somebody > > searches > > > for the keyword "sandals" then the documents that have shoeType=Sandals > > (eg, > > > enum-value of "07") I'd want those documents to show up. > > > > Sounds like "string" type. > > > Okay I'll look into it more, thanks. > > > > > > > > > 3. Enums are related, sort-of: > > > Sometimes our enumerated data is somewhat related. For example (in the > > "shoe > > > types" example), let's say we have "sandals", well, "crocs" are not > > > sandals, but are SORT-oF like sandals, so we'd like them to match but > > score > > > lower than an exact sandal match. How do we do this? (Is this "Changing > > > Similarity" or is that barking up the wrong tree?) > > > > One option is to have a separate sort_of_like field where you stick various > > sort-of-like "synonyms". If you are using DisMax you can include that > > sort_of_like field in the config but give it less boost than the "main" > > field. You could use index-time synonym injection for that sort_of_like > > field. > > > Wow, okay... I'll definitely have to do a bit more reading to understand > what you just said. ;) > > > > > > > > > 4. How to manage "Tags" data? > > > Users on my site can enter "tags", and we want to be able to build > > > tag-clouds, follow tag-links, and whatnot. Should I index tags as just a > > > fieldType of "text"? > > > > "text" is fine if you don't want tags to be exact. Assume "photography" > > and "photo" have the same stem. Do you want a user clicing on "photo" to > > get items tagged as "photography", too? If so, use text, else consider > > string. Treat multi-word tags as phrases. Example: > > http://www.simpy.com/user/otis/tag/%22information+retrieval%22 > > > Hmm... You raise a good question in there. Thanks for that info, I'll look > into it more. > > > > > > > > > 5. How do I load the data? > > > Loading all the data from the database (to anything!) takes a big chunk > > of > > > time. Should I export it from the database once and then load it into > > Solr > > > using CSV? > > > > If export is not slow, then upload vis CSV should be faster than adding > > docs to Solr "the usual way". But judging from your question below, you > > probably don't need the CSV approach. > > > > > > > Follow-up: How would I manage loading this/new data on an ongoing basis? > > The > > > site's users are creating data all the time, the bulk of which is old > > (i.e. > > > before today; could be bulk loaded), but after an initial bulk load it's > > > ongoing data. Should I be just building a huge Solr index on the > > filesystem > > > and making sure I don't lose it? > > > > Sounds like one-time bulk indexing followed by continous incremental > > indexing. You can have 2 masters to make things more fault-tolerant. Or > > you can store your index on a SAN. Or you can just count on your N Solr > > slaves acting as the "backup" (replicas) of your index, though they'll > > always be a little behind the master index. > > > Okay, that's what I thought. > > Where can I learn more about this master-slave configuration for Solr? > > > > > > > > > 6. How do I manage real-time data? > > > For example, let's say I have users coming online and offline all the > > time, > > > and I need to be able to search my set of "online users". How should I > > go > > > about this? Can this just be handled through index updates? > > > > Yes, though there is no real-time search in Solr just yet. There is always > > a bit of delay because of index replication (master=>slaves), index and > > cache warmups. > > > Hmm... yeah, for online-data, the "delay" is a huge problem. When people > come online and they appear offline to others --even within a minute-- then > they'll log off the site and probably never come back. It's the equivalent > of logging into an IM system. --Actually not just "equivalent", since we do > literally have an IM system. It would be like logging into, say, > MSN-Messenger but not actually "coming online" until later due to a delay. > > Normally I would think to keep this real-time data separate, but we actually > do *searches* against the online members as well. We also like to sort our > normal searches using the "online status" as a sorting criteria. > > Thanks again very much for your input. > > Sincerely, > > Daryl. > > > > > > > > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >