Hi Otis,
     First off, thanks for your complete reply! It certainly has a lot of
good info in it.
     To address some of the questions you asked, please see below:

> Hi,
> Your questions don't have simple answers, but here are some quick one.
> > I'm new to Solr, and have been reading through documentation off-and-on
> for
> > days, but still have some unanswered basic/fundamental questions that
> have a
> > huge impact on my implementation approach.
> > I am thinking of moving my company's web app's main search engine over to
> > Solr. My goal is to index 5M user records of a social networking website
> > (most of which have a free-form text portion, but the majority of  data
> is
> > non-text) and have complex searches against those records back in the
> > sub-0.5s range of time. I have just under 10  application servers each
> > running my web-app, which is mostly stateless except for things like
> users'
> > online status.
> How many servers have you got for running Solr? (assuming you don't intend
> to put Solr on the same servers as your webapp, as it sounds like each
> webapp is maxing out its server)

Right now, 0. I'm still investigating how to get it working, not yet close
to estimating load. Once we get things to the testing stage, I'm sure we'll
have an idea of what kind of production hardware we'll need to

> > Forgive me for asking so many in one email; feel free to change subject
> line
> > and reply about individual items. Here's the questions:
> >
> > 1. How to best organize a web-app that normally goes to a search-db to
> use
> > Solr instead?
> > a) Set up independent Solr instance, make app search it just like it used
> to
> > search database.
> > b) Integrate Solr right into app, so that app+solr get deployed together
> > (this is very possible, as our app is Java). But we run  several
> instances
> > of the app so we'd be running several Solr instances too.
> > c) Set up independent Solr instance + our code (plugins or whatever?),
> have
> > web clients request DIRECTLY to the Solr app and have  Solr return search
> > results directly.
> > d) Other configuration...?
> a) Set up Solr master + N slaves on a separate set of boxes and access them
> remotely from your webapp.  If your webapp is a Java webapp, use SolrJ.
>  Alternatively, if your webapp servers have enough spare CPU cycles and
> enough RAM, you could make those sam servers your 10 Solr slaves.

I see, thank you.

Out of curiosity, how much RAM are we talking? (My database has about 6 gigs
of data that we'd want to index for search.) The reason I ask is because my
webapp servers do have CPU/RAM to spare.

> > 2. How to best handle Enums?
> > We have a bunch of enumerated data (say, for example, shoe types). What
> > "fieldType" should we use to index them?
> > Should I index them as text? If I index "sandals" then if somebody
> searches
> > for the keyword "sandals" then the documents that have shoeType=Sandals
> (eg,
> > enum-value of "07") I'd want those documents to show up.
> Sounds like "string" type.

Okay I'll look into it more, thanks.

> > 3. Enums are related, sort-of:
> > Sometimes our enumerated data is somewhat related. For example (in the
> "shoe
> > types" example), let's say we have "sandals", well,  "crocs" are not
> > sandals, but are SORT-oF like sandals, so we'd like them to match but
> score
> > lower than an exact sandal match. How do  we do this? (Is this "Changing
> > Similarity" or is that barking up the wrong tree?)
> One option is to have a separate sort_of_like field where you stick various
> sort-of-like "synonyms".  If you are using DisMax you can include that
> sort_of_like field in the config but give it less boost than the "main"
> field.  You could use index-time synonym injection for that sort_of_like
> field.

Wow, okay... I'll definitely have to do a bit more reading to understand
what you just said. ;)

> > 4. How to manage "Tags" data?
> > Users on my site can enter "tags", and we want to be able to build
> > tag-clouds, follow tag-links, and whatnot. Should I index tags as just a
> > fieldType of "text"?
> "text" is fine if you don't want tags to be exact.  Assume "photography"
> and "photo" have the same stem.  Do you want a user clicing on "photo" to
> get items tagged as "photography", too?  If so, use text, else consider
> string.  Treat multi-word tags as phrases.  Example:
> http://www.simpy.com/user/otis/tag/%22information+retrieval%22

Hmm... You raise a good question in there. Thanks for that info, I'll look
into it more.

> <http://www.simpy.com/user/otis/tag/%22information+retrieval%22>
> > 5. How do I load the data?
> > Loading all the data from the database (to anything!) takes a big chunk
> of
> > time. Should I export it from the database once and then load it into
> Solr
> > using CSV?
> If export is not slow, then upload vis CSV should be faster than adding
> docs to Solr "the usual way".  But judging from your question below, you
> probably don't need the CSV approach.

> > Follow-up: How would I manage loading this/new data on an ongoing basis?
> The
> > site's users are creating data all the time, the bulk of  which is old
> (i.e.
> > before today; could be bulk loaded), but after an initial bulk load it's
> > ongoing data. Should I be just building  a huge Solr index on the
> filesystem
> > and making sure I don't lose it?
> Sounds like one-time bulk indexing followed by continous incremental
> indexing.  You can have 2 masters to make things more fault-tolerant.  Or
> you can store your index on a SAN.  Or you can just count on your N Solr
> slaves acting as the "backup" (replicas) of your index, though they'll
> always be a little behind the master index.

Okay, that's what I thought.

Where can I learn more about this master-slave configuration for Solr?

> > 6. How do I manage real-time data?
> > For example, let's say I have users coming online and offline all the
> time,
> > and I need to be able to search my set of "online  users". How should I
> go
> > about this? Can this just be handled through index updates?
> Yes, though there is no real-time search in Solr just yet.  There is always
> a bit of delay because of index replication (master=>slaves), index and
> cache warmups.

Hmm... yeah, for online-data, the "delay" is a huge problem. When people
come online and they appear offline to others --even within a minute-- then
they'll log off the site and probably never come back. It's the equivalent
of logging into an IM system. --Actually not just "equivalent", since we do
literally have an IM system. It would be like logging into, say,
MSN-Messenger but not actually "coming online" until later due to a delay.

Normally I would think to keep this real-time data separate, but we actually
do *searches* against the online members as well. We also like to sort our
normal searches using the "online status" as a sorting criteria.

Thanks again very much for your input.



