Re: moving nontrivial application from lucene to solr

Otis Gospodnetic Mon, 29 Oct 2007 22:03:10 -0800

Hi,
Short answer to your long email: I didn't see anything Lucene-specific in your 
description that would prevent you from using Solr for this.
Figuring out when to open/start a new index would have to be done by your 
application, but with SOLR-215, Solr can now host several indices.  Yes, you 
can minimize all caches and essentially disable them, though that likely won't 
affect your indexing performance, unless you are really low on RAM, in which 
case you better run to the store. ;)


Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: jm <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, October 29, 2007 2:01:07 PM
Subject: moving nontrivial application from lucene to solr


Hello all,

I have an application running based on lucene 2.2. Maybe it is not the
most typical usage of Lucene, the main features regarding lucene are:

- I use many indexes, tens or hundreds of them(all contain the same
structure of fields etc),and the number of indexes grows with time.
The indexes are separted on a time-of-document basis, so I can have
monthly indexes, or every day, 5 days etc. I keep then all under a
given directory, what we call an index location (contains many
indexes)
- When I need more than one process indexing at the same time, I have
two use one index location per process (as writing is exclusive). So
that effectively doubles the number of indexes if I use 2 processes,
or even worse if I want more processes. I know I could merge indexes
but right now I dont do that, too much trouble.
- I am mostly worried about indexing performance, I dont mind if
searching takes longer. For speeding indexing I keep a certain amount
of indexes open etc.
- While searching, I alwasy need all docs (I use a HitCollector
extension). I use a multisearcher to search over many indexes.
- I use my custom Analyzer
- documents have 6 fields, all but one non stored. The stored one is
small, around 100 bytes. The other can vary from 1 byte to huge (only
two can be really huge)

I got to know about Solr when my code was already working so it was
not possible to change. I have played around with Solr, and even built
a smaller project with it. And know I am thinking that porting to solr
would have many advantages:
- would allow N processes to index at the same time and it would keep
only one index (or more see later). I could get rid of maintaining the
index locations etc. A big plus for me.
- I would reduce my lucene related code.
- deleting docs would work easily, now I have to do some special
treatment to delete docs in some cases cause I have to wait until
nobody is writing to the index etc
- I would take the oportunity to refactor the way I use the documents
to get even more advantages for my application unrelated to
lucene/solr
- might even be faster indexing?

Before committing to do the change I would like to request the experts
opinion on the following:
1. For starters I could live with having only a single (potentially
huge) index that is being constantly written into, and to a lesser
extent searched. But in case the index is too big, or I want better
indexing performance, I might want to have two indexes.
One would be smaller, to index all new content (and searching), and
the second one would be for searching only. At some point the small
one would be merged to the big one and emptied to start again.
I would like to know how this could be done:
-a. that is already being done in solr core (or there is a ticket open
with similar functionality)
-b. could be done with some scripting without modifying solr src.
-c should be something implemented by modifying the solr src and using
 https://issues.apache.org/jira/browse/SOLR-215 or
https://issues.apache.org/jira/browse/SOLR-255. If this case which of
both patches would be more appropiate? I see SOLR-215 is already
commited to trunk but SOLR-255 is not done yet.

2. As I said I am not worried about search performance, my queries are
all batch queries (well, sort of). All I need is maximum indexing
performance, and without too much memory if possible. From what I read
solr caches stuff heavily at various levels to provide fast searches.
As said in the wiki, i can comment out the caching sections in the
solrconfig.xml. Will this totally disable all the caching, warming
etc?? If not, do I need to modifying the source?
I would like to do that so less memory is used and I suppose trying to
make search faster is not beneficial for indexing performance.

thanks for your thoughts.

Re: moving nontrivial application from lucene to solr

Reply via email to