Re: O/S Search Comparisons

2007-12-08 Thread Michael McCandless
Sometimes, when something like this comes up, it gives you the  
opportunity to take a step back and ask what are the things we  
really want Lucene to be going forward (the New Year is good for  
this kind of assessment as well)  What are it's strengths and  
weaknesses?  What can we improve in the short term and what needs  
to improve in the longer term?  Maybe it's just that time of year  
to send out your Lucene Wish List... :-)


+1

There is still something for us to learn & improve in Lucene, even if  
the comparison is necessarily apples/oranges or unfair.


Lucene was listed as not having "Result Excerpt" which isn't really  
fair,  though it is true you have to pull in contrib/highlighter to  
enable it.


Did it crash on the 10 GB? I thought it said that it just took way  
to long (7 times the best or something). Frankly, either case is  
suspect. Last summer I indexed about 5 million docs with a total  
size at the *very* least of 10 GB on my 3 year old desktop. It  
didn't take much more than 8 hours to index and searches where  
still lightning fast. Maybe they forgot to give the JVM more than  
the default amount of RAM 


The paper just said "ht://Dig and Lucene degraded considerably their  
indexing time, and we excluded them from the final comparison".


Maybe Lucene just hit a very large segment merge and the author  
incorrectly thought something had gone wrong since the addDocument  
call was taking incredibly long?  In which case the new default  
ConcurrentMergeScheduler should improve that.  I would expect Lucene  
2.3 to now have an advantage in that it makes use of concurrency in  
the hardware, out of the box, whereas likely other older engines are  
single threaded.


I've also thought about creating a simple optional threaded layer on  
top of IndexWriter which uses multiple threads to add documents,  
under the hood.  Such a class would expose all of the methods of  
IndexWriter (would feel just like IndexWriter), except calls to add/ 
updateDocument would drop into a queue which multiple threads  
(maintained by this class) would pull from and execute.  This would  
then let Lucene make use of even more concurrency ... and saves the  
"complexity" of application writers having to manage threads above  
Lucene.


It is also possible the collection size is such that the merge cost  
was very high (too high), because the LogMergePolicy inadvertently  
optimizes every so often.  Ie, for certain "unlucky" ranges of  
collection sizes (number of documents "just above" maxBuffereDocs *  
powers-of-mergeFactor, in log-space) you will indeed see that  
amortized merge cost was far too high.  This is because  
LogMergePolicy is "pay it forward": it pays up front for continuing  
growth of the index, vs paying as-you-go which would be better.  I  
opened LUCENE-854 for this issue a while back, but it's still open.   
Eg KinoSearch's merging doesn't "inadvertently optimize" I think.



a) missing something in our defaults setup


I do think we've improved "out of the box defaults" in 2.3, not only  
with the speedups to indexing in LUCENE-843, but also changing the  
default to flushing at 16 MB instead of every 10 documents.  This  
ought to be a sizable improvement for users who just rely on Lucene's  
defaults (which is presumably the vast majority of users).



- Mark

Grant Ingersoll wrote:
All true and good points.  Lucene held up quite nicely in the  
search aspect (at least perf. wise) and I generally don't think  
making these kinds of comparisons are all that useful (we call it  
apple and oranges in English :-)  ).


What I am trying to get at is if this paper was just about Lucene  
and never mentioned a single other system, what, if anything, can  
we take from it that can help us make Lucene better.   I know, for  
instance, from my own personal experience, that 2.3 is somewhere  
in the range of 3-5+ times faster than 2.2 (which I know is faster  
than 1.9).  That being said, the paper clearly states that Lucene  
was not capable of doing the WT10g docs because performance  
degraded too much.  Now, I know Lucene is pretty darn capable of a  
lot of things and people are using it to do web search, etc. at  
very large scales (I have personally talked w/ people doing it).   
So, what I worry about is that either we are:

a) missing something in our defaults setup
b) missing something in our docs and our education efforts, or
c) we are missing some capability in our indexing such that it is  
crashing


Now, what is to be done?  It may well be nothing, but I just want  
to make sure we are comfortable with that decision or whether it  
is worth asking for a volunteer who has access to the WT10g docs  
to go have a look at it and see what happens.  I personally don't  
have access to these docs, otherwise I would try it out.  What we  
don't want to happen is for potential supporters/contributors to  
read that paper and say "Lucene isn't for me because of this."


Sometimes

Re: O/S Search Comparisons

2007-12-08 Thread Grant Ingersoll


On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:

Sometimes, when something like this comes up, it gives you the  
opportunity to take a step back and ask what are the things we  
really want Lucene to be going forward (the New Year is good for  
this kind of assessment as well)  What are it's strengths and  
weaknesses?  What can we improve in the short term and what needs  
to improve in the longer term?  Maybe it's just that time of year  
to send out your Lucene Wish List... :-)


+1

There is still something for us to learn & improve in Lucene, even  
if the comparison is necessarily apples/oranges or unfair.


Lucene was listed as not having "Result Excerpt" which isn't really  
fair,  though it is true you have to pull in contrib/highlighter to  
enable it.


Yeah, I noted that mentally, but didn't think it was a big deal since  
not everyone wants it.  The other thing is, some of it comes down to  
how you structure your content.  I think a lot of people use metadata  
fields to provide enough "summary" info about a document.





Did it crash on the 10 GB? I thought it said that it just took way  
to long (7 times the best or something). Frankly, either case is  
suspect. Last summer I indexed about 5 million docs with a total  
size at the *very* least of 10 GB on my 3 year old desktop. It  
didn't take much more than 8 hours to index and searches where  
still lightning fast. Maybe they forgot to give the JVM more than  
the default amount of RAM 


The paper just said "ht://Dig and Lucene degraded considerably their  
indexing time, and we excluded them from the final comparison".


Maybe Lucene just hit a very large segment merge and the author  
incorrectly thought something had gone wrong since the addDocument  
call was taking incredibly long?  In which case the new default  
ConcurrentMergeScheduler should improve that.  I would expect Lucene  
2.3 to now have an advantage in that it makes use of concurrency in  
the hardware, out of the box, whereas likely other older engines are  
single threaded.


Yep.




I've also thought about creating a simple optional threaded layer on  
top of IndexWriter which uses multiple threads to add documents,  
under the hood.  Such a class would expose all of the methods of  
IndexWriter (would feel just like IndexWriter), except calls to add/ 
updateDocument would drop into a queue which multiple threads  
(maintained by this class) would pull from and execute.  This would  
then let Lucene make use of even more concurrency ... and saves the  
"complexity" of application writers having to manage threads above  
Lucene.


+1  I have been thinking about this too.  Solr clearly demonstrates  
the benefits of this kind of approach, although even it doesn't make  
it seamless for users in the sense that they still need to divvy up  
the docs on the app side.


Here's some of my wishes:

1. Better Demo

2. Alternate scoring algorithms (which implies indexing too) that  
perform at or near the same level as the current ones


3. A way of announcing improvements to Interfaces such that we have  
better ability to add methods to interfaces, knowing full well it will  
break some people.  Same goes for deprecated.  In this day and age of  
agile programming, it seems a bit restrictive to me that we wait 1+  
years (the average time between major releases) to remove what we  
consider to be cruft in our code or add new capabilities to  
interfaces.  I would suggest we announce a deprecated method, version  
it, mark it to when it is going away (i.e. This will be removed in  
version 2.6) and then do so in that version.   So, if we deprecate  
something in 2.3, we could, assuming consecutive numbered releases,  
remove it in 2.5.  This would presumably move things up a bit to about  
the 6 mos. time range.  Just a thought...  :-)


-Grant





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-08 Thread Doron Cohen
Grant Ingersoll <[EMAIL PROTECTED]> wrote on 08/12/2007 16:02:31:

>
> On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:
>
> >>> Sometimes, when something like this comes up, it gives you the
> >>> opportunity to take a step back and ask what are the things we
> >>> really want Lucene to be going forward (the New Year is good for
> >>> this kind of assessment as well)  What are it's strengths and
> >>> weaknesses?  What can we improve in the short term and what needs
> >>> to improve in the longer term?  Maybe it's just that time of year
> >>> to send out your Lucene Wish List... :-)
> >
> > +1
> >
> > There is still something for us to learn & improve in Lucene, even
> > if the comparison is necessarily apples/oranges or unfair.
> >
> > Lucene was listed as not having "Result Excerpt" which isn't really
> > fair,  though it is true you have to pull in contrib/highlighter to
> > enable it.
>
> Yeah, I noted that mentally, but didn't think it was a big deal since
> not everyone wants it.  The other thing is, some of it comes down to
> how you structure your content.  I think a lot of people use metadata
> fields to provide enough "summary" info about a document.
>
> >
> >
> >> Did it crash on the 10 GB? I thought it said that it just took way
> >> to long (7 times the best or something). Frankly, either case is
> >> suspect. Last summer I indexed about 5 million docs with a total
> >> size at the *very* least of 10 GB on my 3 year old desktop. It
> >> didn't take much more than 8 hours to index and searches where
> >> still lightning fast. Maybe they forgot to give the JVM more than
> >> the default amount of RAM 
> >
> > The paper just said "ht://Dig and Lucene degraded considerably their
> > indexing time, and we excluded them from the final comparison".
> >
> > Maybe Lucene just hit a very large segment merge and the author
> > incorrectly thought something had gone wrong since the addDocument
> > call was taking incredibly long?  In which case the new default
> > ConcurrentMergeScheduler should improve that.  I would expect Lucene
> > 2.3 to now have an advantage in that it makes use of concurrency in
> > the hardware, out of the box, whereas likely other older engines are
> > single threaded.
>
> Yep.
>
> >
> >
> > I've also thought about creating a simple optional threaded layer on
> > top of IndexWriter which uses multiple threads to add documents,
> > under the hood.  Such a class would expose all of the methods of
> > IndexWriter (would feel just like IndexWriter), except calls to add/
> > updateDocument would drop into a queue which multiple threads
> > (maintained by this class) would pull from and execute.  This would
> > then let Lucene make use of even more concurrency ... and saves the
> > "complexity" of application writers having to manage threads above
> > Lucene.
>
> +1  I have been thinking about this too.  Solr clearly demonstrates
> the benefits of this kind of approach, although even it doesn't make
> it seamless for users in the sense that they still need to divvy up
> the docs on the app side.

Would be nice if this layer also took care of searchers/readers
refreshing & warming.

>
> Here's some of my wishes:
>
> 1. Better Demo
>
> 2. Alternate scoring algorithms (which implies indexing too) that
> perform at or near the same level as the current ones

+1

>
> 3. A way of announcing improvements to Interfaces such that we have
> better ability to add methods to interfaces, knowing full well it will
> break some people.  Same goes for deprecated.  In this day and age of
> agile programming, it seems a bit restrictive to me that we wait 1+
> years (the average time between major releases) to remove what we
> consider to be cruft in our code or add new capabilities to
> interfaces.  I would suggest we announce a deprecated method, version
> it, mark it to when it is going away (i.e. This will be removed in
> version 2.6) and then do so in that version.   So, if we deprecate
> something in 2.3, we could, assuming consecutive numbered releases,
> remove it in 2.5.  This would presumably move things up a bit to about
> the 6 mos. time range.  Just a thought...  :-)
>
> -Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-08 Thread robert engels
This is along the lines of what I have tried to get the Lucene  
community to adopt for a long time.


If you want to take Lucene to the next level, it needs a "server"  
implementation.


Only with this can you get efficient locks, caching, transactions,  
which leads to more efficient indexing and searching.


IMO, the "shared" storage nature of Lucene is its biggest weakness.   
A lot of changes have been made to improve this, when it probably  
just needs to be dropped. If you have a network, it is really no  
different to communicate with processes rather than storage.


On Dec 9, 2007, at 12:04 AM, Doron Cohen wrote:


Grant Ingersoll <[EMAIL PROTECTED]> wrote on 08/12/2007 16:02:31:



On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:


Sometimes, when something like this comes up, it gives you the
opportunity to take a step back and ask what are the things we
really want Lucene to be going forward (the New Year is good for
this kind of assessment as well)  What are it's strengths and
weaknesses?  What can we improve in the short term and what needs
to improve in the longer term?  Maybe it's just that time of year
to send out your Lucene Wish List... :-)


+1

There is still something for us to learn & improve in Lucene, even
if the comparison is necessarily apples/oranges or unfair.

Lucene was listed as not having "Result Excerpt" which isn't really
fair,  though it is true you have to pull in contrib/highlighter to
enable it.


Yeah, I noted that mentally, but didn't think it was a big deal since
not everyone wants it.  The other thing is, some of it comes down to
how you structure your content.  I think a lot of people use metadata
fields to provide enough "summary" info about a document.





Did it crash on the 10 GB? I thought it said that it just took way
to long (7 times the best or something). Frankly, either case is
suspect. Last summer I indexed about 5 million docs with a total
size at the *very* least of 10 GB on my 3 year old desktop. It
didn't take much more than 8 hours to index and searches where
still lightning fast. Maybe they forgot to give the JVM more than
the default amount of RAM 


The paper just said "ht://Dig and Lucene degraded considerably their
indexing time, and we excluded them from the final comparison".

Maybe Lucene just hit a very large segment merge and the author
incorrectly thought something had gone wrong since the addDocument
call was taking incredibly long?  In which case the new default
ConcurrentMergeScheduler should improve that.  I would expect Lucene
2.3 to now have an advantage in that it makes use of concurrency in
the hardware, out of the box, whereas likely other older engines are
single threaded.


Yep.




I've also thought about creating a simple optional threaded layer on
top of IndexWriter which uses multiple threads to add documents,
under the hood.  Such a class would expose all of the methods of
IndexWriter (would feel just like IndexWriter), except calls to add/
updateDocument would drop into a queue which multiple threads
(maintained by this class) would pull from and execute.  This would
then let Lucene make use of even more concurrency ... and saves the
"complexity" of application writers having to manage threads above
Lucene.


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing & warming.



Here's some of my wishes:

1. Better Demo

2. Alternate scoring algorithms (which implies indexing too) that
perform at or near the same level as the current ones


+1



3. A way of announcing improvements to Interfaces such that we have
better ability to add methods to interfaces, knowing full well it  
will

break some people.  Same goes for deprecated.  In this day and age of
agile programming, it seems a bit restrictive to me that we wait 1+
years (the average time between major releases) to remove what we
consider to be cruft in our code or add new capabilities to
interfaces.  I would suggest we announce a deprecated method, version
it, mark it to when it is going away (i.e. This will be removed in
version 2.6) and then do so in that version.   So, if we deprecate
something in 2.3, we could, assuming consecutive numbered releases,
remove it in 2.5.  This would presumably move things up a bit to  
about

the 6 mos. time range.  Just a thought...  :-)

-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]