Re: O/S Search Comparisons

robert engels Sat, 08 Dec 2007 22:29:41 -0800

This is along the lines of what I have tried to get the Lucenecommunity to adopt for a long time.

If you want to take Lucene to the next level, it needs a "server"implementation.

Only with this can you get efficient locks, caching, transactions,which leads to more efficient indexing and searching.

IMO, the "shared" storage nature of Lucene is its biggest weakness.A lot of changes have been made to improve this, when it probablyjust needs to be dropped. If you have a network, it is really nodifferent to communicate with processes rather than storage.


On Dec 9, 2007, at 12:04 AM, Doron Cohen wrote:

Grant Ingersoll <[EMAIL PROTECTED]> wrote on 08/12/2007 16:02:31:


On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:

Sometimes, when something like this comes up, it gives you the
opportunity to take a step back and ask what are the things we
really want Lucene to be going forward (the New Year is good for
this kind of assessment as well)  What are it's strengths and
weaknesses?  What can we improve in the short term and what needs
to improve in the longer term?  Maybe it's just that time of year
to send out your Lucene Wish List... :-)


+1

There is still something for us to learn & improve in Lucene, even
if the comparison is necessarily apples/oranges or unfair.

Lucene was listed as not having "Result Excerpt" which isn't really
fair,  though it is true you have to pull in contrib/highlighter to
enable it.


Yeah, I noted that mentally, but didn't think it was a big deal since
not everyone wants it.  The other thing is, some of it comes down to
how you structure your content.  I think a lot of people use metadata
fields to provide enough "summary" info about a document.

Did it crash on the 10 GB? I thought it said that it just took way
to long (7 times the best or something). Frankly, either case is
suspect. Last summer I indexed about 5 million docs with a total
size at the *very* least of 10 GB on my 3 year old desktop. It
didn't take much more than 8 hours to index and searches where
still lightning fast. Maybe they forgot to give the JVM more than
the default amount of RAM <g>


The paper just said "ht://Dig and Lucene degraded considerably their
indexing time, and we excluded them from the final comparison".

Maybe Lucene just hit a very large segment merge and the author
incorrectly thought something had gone wrong since the addDocument
call was taking incredibly long?  In which case the new default
ConcurrentMergeScheduler should improve that.  I would expect Lucene
2.3 to now have an advantage in that it makes use of concurrency in
the hardware, out of the box, whereas likely other older engines are
single threaded.


Yep.



I've also thought about creating a simple optional threaded layer on
top of IndexWriter which uses multiple threads to add documents,
under the hood.  Such a class would expose all of the methods of
IndexWriter (would feel just like IndexWriter), except calls to add/
updateDocument would drop into a queue which multiple threads
(maintained by this class) would pull from and execute.  This would
then let Lucene make use of even more concurrency ... and saves the
"complexity" of application writers having to manage threads above
Lucene.


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing & warming.


Here's some of my wishes:

1. Better Demo

2. Alternate scoring algorithms (which implies indexing too) that
perform at or near the same level as the current ones

+1


3. A way of announcing improvements to Interfaces such that we have

better ability to add methods to interfaces, knowing full well itwill

break some people.  Same goes for deprecated.  In this day and age of
agile programming, it seems a bit restrictive to me that we wait 1+
years (the average time between major releases) to remove what we
consider to be cruft in our code or add new capabilities to
interfaces.  I would suggest we announce a deprecated method, version
it, mark it to when it is going away (i.e. This will be removed in
version 2.6) and then do so in that version.   So, if we deprecate
something in 2.3, we could, assuming consecutive numbered releases,

remove it in 2.5. This would presumably move things up a bit toabout

the 6 mos. time range.  Just a thought...  :-)

-Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: O/S Search Comparisons

Reply via email to