Re: Bunch of questions regarding enterprise configuration

Otis Gospodnetic Fri, 26 Sep 2008 07:53:20 -0700

Hi Daryl,

Re RAM amount - depends on your particular index (DB size doesn't help - who 
knows how you'll analyze/tokenize/index data, what term distribution is like, 
etc.)


Re master-slave - look for Collection Replication page on the Wiki

Re real-time IM-like presence - perhaps you can do it all in RAM(Directory), or 
even InstantiatedIndex (in Lucene contrib), perhaps you can post-process 
results with "is the user X online?" type of looking in some fast in-memory 
data structure that's not necessarily Solr/Lucene.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Dev Team <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, September 26, 2008 10:38:39 AM
> Subject: Re: Bunch of questions regarding enterprise configuration
> 
> Hi Otis,
>      First off, thanks for your complete reply! It certainly has a lot of
> good info in it.
>      To address some of the questions you asked, please see below:
> 
> On Fri, Sep 26, 2008 at 1:36 AM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
> 
> > Hi,
> >
> > Your questions don't have simple answers, but here are some quick one.
> >
> >
> >
> >
> > ----- Original Message ----
> > > I'm new to Solr, and have been reading through documentation off-and-on
> > for
> > > days, but still have some unanswered basic/fundamental questions that
> > have a
> > > huge impact on my implementation approach.
> > > I am thinking of moving my company's web app's main search engine over to
> > > Solr. My goal is to index 5M user records of a social networking website
> > > (most of which have a free-form text portion, but the majority of  data
> > is
> > > non-text) and have complex searches against those records back in the
> > > sub-0.5s range of time. I have just under 10  application servers each
> > > running my web-app, which is mostly stateless except for things like
> > users'
> > > online status.
> >
> > How many servers have you got for running Solr? (assuming you don't intend
> > to put Solr on the same servers as your webapp, as it sounds like each
> > webapp is maxing out its server)
> 
> 
> Right now, 0. I'm still investigating how to get it working, not yet close
> to estimating load. Once we get things to the testing stage, I'm sure we'll
> have an idea of what kind of production hardware we'll need to
> purchase/reuse/whatever.
> 
> 
> >
> >
> > > Forgive me for asking so many in one email; feel free to change subject
> > line
> > > and reply about individual items. Here's the questions:
> > >
> > > 1. How to best organize a web-app that normally goes to a search-db to
> > use
> > > Solr instead?
> > > a) Set up independent Solr instance, make app search it just like it used
> > to
> > > search database.
> > > b) Integrate Solr right into app, so that app+solr get deployed together
> > > (this is very possible, as our app is Java). But we run  several
> > instances
> > > of the app so we'd be running several Solr instances too.
> > > c) Set up independent Solr instance + our code (plugins or whatever?),
> > have
> > > web clients request DIRECTLY to the Solr app and have  Solr return search
> > > results directly.
> > > d) Other configuration...?
> >
> > a) Set up Solr master + N slaves on a separate set of boxes and access them
> > remotely from your webapp.  If your webapp is a Java webapp, use SolrJ.
> >  Alternatively, if your webapp servers have enough spare CPU cycles and
> > enough RAM, you could make those sam servers your 10 Solr slaves.
> 
> 
> I see, thank you.
> 
> Out of curiosity, how much RAM are we talking? (My database has about 6 gigs
> of data that we'd want to index for search.) The reason I ask is because my
> webapp servers do have CPU/RAM to spare.
> 
> 
> >
> >
> > > 2. How to best handle Enums?
> > > We have a bunch of enumerated data (say, for example, shoe types). What
> > > "fieldType" should we use to index them?
> > > Should I index them as text? If I index "sandals" then if somebody
> > searches
> > > for the keyword "sandals" then the documents that have shoeType=Sandals
> > (eg,
> > > enum-value of "07") I'd want those documents to show up.
> >
> > Sounds like "string" type.
> 
> 
> Okay I'll look into it more, thanks.
> 
> 
> >
> >
> > > 3. Enums are related, sort-of:
> > > Sometimes our enumerated data is somewhat related. For example (in the
> > "shoe
> > > types" example), let's say we have "sandals", well,  "crocs" are not
> > > sandals, but are SORT-oF like sandals, so we'd like them to match but
> > score
> > > lower than an exact sandal match. How do  we do this? (Is this "Changing
> > > Similarity" or is that barking up the wrong tree?)
> >
> > One option is to have a separate sort_of_like field where you stick various
> > sort-of-like "synonyms".  If you are using DisMax you can include that
> > sort_of_like field in the config but give it less boost than the "main"
> > field.  You could use index-time synonym injection for that sort_of_like
> > field.
> 
> 
> Wow, okay... I'll definitely have to do a bit more reading to understand
> what you just said. ;)
> 
> 
> >
> >
> > > 4. How to manage "Tags" data?
> > > Users on my site can enter "tags", and we want to be able to build
> > > tag-clouds, follow tag-links, and whatnot. Should I index tags as just a
> > > fieldType of "text"?
> >
> > "text" is fine if you don't want tags to be exact.  Assume "photography"
> > and "photo" have the same stem.  Do you want a user clicing on "photo" to
> > get items tagged as "photography", too?  If so, use text, else consider
> > string.  Treat multi-word tags as phrases.  Example:
> > http://www.simpy.com/user/otis/tag/%22information+retrieval%22
> 
> 
> Hmm... You raise a good question in there. Thanks for that info, I'll look
> into it more.
> 
> 
> > 
> >
> > > 5. How do I load the data?
> > > Loading all the data from the database (to anything!) takes a big chunk
> > of
> > > time. Should I export it from the database once and then load it into
> > Solr
> > > using CSV?
> >
> > If export is not slow, then upload vis CSV should be faster than adding
> > docs to Solr "the usual way".  But judging from your question below, you
> > probably don't need the CSV approach.
> 
> 
> >
> > > Follow-up: How would I manage loading this/new data on an ongoing basis?
> > The
> > > site's users are creating data all the time, the bulk of  which is old
> > (i.e.
> > > before today; could be bulk loaded), but after an initial bulk load it's
> > > ongoing data. Should I be just building  a huge Solr index on the
> > filesystem
> > > and making sure I don't lose it?
> >
> > Sounds like one-time bulk indexing followed by continous incremental
> > indexing.  You can have 2 masters to make things more fault-tolerant.  Or
> > you can store your index on a SAN.  Or you can just count on your N Solr
> > slaves acting as the "backup" (replicas) of your index, though they'll
> > always be a little behind the master index.
> 
> 
> Okay, that's what I thought.
> 
> Where can I learn more about this master-slave configuration for Solr?
> 
> 
> >
> >
> > > 6. How do I manage real-time data?
> > > For example, let's say I have users coming online and offline all the
> > time,
> > > and I need to be able to search my set of "online  users". How should I
> > go
> > > about this? Can this just be handled through index updates?
> >
> > Yes, though there is no real-time search in Solr just yet.  There is always
> > a bit of delay because of index replication (master=>slaves), index and
> > cache warmups.
> 
> 
> Hmm... yeah, for online-data, the "delay" is a huge problem. When people
> come online and they appear offline to others --even within a minute-- then
> they'll log off the site and probably never come back. It's the equivalent
> of logging into an IM system. --Actually not just "equivalent", since we do
> literally have an IM system. It would be like logging into, say,
> MSN-Messenger but not actually "coming online" until later due to a delay.
> 
> Normally I would think to keep this real-time data separate, but we actually
> do *searches* against the online members as well. We also like to sort our
> normal searches using the "online status" as a sorting criteria.
> 
> Thanks again very much for your input.
> 
> Sincerely,
> 
>      Daryl.
> 
> 
> 
> >
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >

Re: Bunch of questions regarding enterprise configuration

Reply via email to