Re: O/S Search Comparisons

2007-12-18 Thread Grant Ingersoll
My testing experience has shown around 100 to be good for things like  
Wikipedia, etc.  That is an interesting point to think about in  
regards to paying the cost once optimize is undertaken and may be  
worth exploring more.  I also wonder how partial optimizes may help.


The Javadocs say:
 Determines how often segment indices are merged by addDocument().   
With

   * smaller values, less RAM is used while indexing, and searches on
   * unoptimized indices are faster, but indexing speed is slower.   
With larger
   * values, more RAM is used during indexing, and while searches on  
unoptimized
   * indices are slower, indexing is faster.  Thus larger values (  
10) are best
   * for batch index creation, and smaller values ( 10) for indices  
that are

   * interactively maintained.
   *
   * pNote that this method is a convenience method: it
   * just calls mergePolicy.setMergeFactor as long as
   * mergePolicy is an instance of [EMAIL PROTECTED] LogMergePolicy}.
   * Otherwise an IllegalArgumentException is thrown./p
   *
   * pThis must never be less than 2.  The default value is 10.

I'd like to append to the last line to say something like:
Empirical testing suggests a maximum value around 100, but this  
depends on the collection.  Really large values ( 100) are  
discouraged.




On Dec 18, 2007, at 12:10 AM, Doron Cohen wrote:


On Dec 18, 2007 2:38 AM, Mark Miller [EMAIL PROTECTED] wrote:


For the data that I normally work with (short articles), I found that
the sweet spot was around 80-120. I actually saw a slight decrease  
going
above that...not sure if that held forever though. That was testing  
on

an earlier release  (I think 2.1?). However, if you want to test
searching it would seem that you are going to want to optimize the
index. I have always found that whatever I save by changing the merge
factor is paid back when you optimize. I have not scientifically
tested this, but found it to be the case in every speed test I ran.  
This

is an interesting thing to me for this test. Do you test with a full
optimize for indexing? If you don't, can you really test the search
performance with the advantage of a full optimize? So, if you are  
going
to optimize, why mess with the merge factor? It may still play a  
small

role, but at best I think its a pretty weak lever.



I had similar experience - set merge factor  to ~maxint and optimized
at the end, and felt like it was the same (never meassured though).
In fact, with the new concurrent merges, I think it should be faster  
to

merge on the fly?

(One comment - it is important to set back merge factor to a  
reasonable

number before the final optimize, otherwise you hit OutOfMem due to
so many segments being merged at once.)



- Mark

Grant Ingersoll wrote:

I did hear back from the authors.  Some of the issues were based on
values chosen for mergeFactor (10,000) I think, but there also  
seemed
to be some questions about parsing the TREC collection.  It was  
split

out into individual files, as opposed to trying to stream in the
documents like we do with Wikipedia, so I/O overhead may be an  
issue.
At the time, 1.9.1 did not have much TREC support, so splitting  
files

is probably the easiest way to do it.  There indexing code was based
off the demo and some LIA reading.

They thought they would try Lucene again when 2.3 comes out.  From  
our

end, I think we need to improve the docs around mergeFactor.  We
generally just say bigger is better, but my understanding is there  
is

definitely a limit to this (100??  Maybe 1000) so we should probably
suggest that in the docs.  And, of course, I think the new
contrib/benchmark has support for reading TREC (although I don't  
know

if it handles streaming it) such that I think it shouldn't be a
problem this time around.




Yes it does streaming -  TREC compressed files are read with  
GZIPInputStream
on demand - next doc's text is read/parsed only when the indexer  
requests

it,
and the indexable document is created, no doc files are created on  
disk.





At any rate, I think we are for the most part doing the right  
things.
Anyone have any thoughts on advice about an upper bound for  
mergeFactor?


Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:


On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:

+1  I have been thinking about this too.  Solr clearly  
demonstrates
the benefits of this kind of approach, although even it doesn't  
make
it seamless for users in the sense that they still need to  
divvy up

the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.


Solr has well-tested code that provides all this functionality and
more (except for automatically spawning extra indexing threads,  
which
I agree would be a useful addition).  It does heavily depend on  
1.5's

java.util.concurrent package, though.  Many people seem like using
Solr as an embedded library layer on top of Lucene to do it all
in-process, as 

Re: O/S Search Comparisons

2007-12-17 Thread Grant Ingersoll
I did hear back from the authors.  Some of the issues were based on  
values chosen for mergeFactor (10,000) I think, but there also seemed  
to be some questions about parsing the TREC collection.  It was split  
out into individual files, as opposed to trying to stream in the  
documents like we do with Wikipedia, so I/O overhead may be an issue.   
At the time, 1.9.1 did not have much TREC support, so splitting files  
is probably the easiest way to do it.  There indexing code was based  
off the demo and some LIA reading.


They thought they would try Lucene again when 2.3 comes out.  From our  
end, I think we need to improve the docs around mergeFactor.  We  
generally just say bigger is better, but my understanding is there is  
definitely a limit to this (100??  Maybe 1000) so we should probably  
suggest that in the docs.  And, of course, I think the new contrib/ 
benchmark has support for reading TREC (although I don't know if it  
handles streaming it) such that I think it shouldn't be a problem this  
time around.


At any rate, I think we are for the most part doing the right things.   
Anyone have any thoughts on advice about an upper bound for mergeFactor?


Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:


On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.


Solr has well-tested code that provides all this functionality and  
more (except for automatically spawning extra indexing threads,  
which I agree would be a useful addition).  It does heavily depend  
on 1.5's java.util.concurrent package, though.  Many people seem  
like using Solr as an embedded library layer on top of Lucene to do  
it all in-process, as well.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-17 Thread Mark Miller
For the data that I normally work with (short articles), I found that 
the sweet spot was around 80-120. I actually saw a slight decrease going 
above that...not sure if that held forever though. That was testing on 
an earlier release  (I think 2.1?). However, if you want to test 
searching it would seem that you are going to want to optimize the 
index. I have always found that whatever I save by changing the merge 
factor is paid back when you optimize. I have not scientifically 
tested this, but found it to be the case in every speed test I ran. This 
is an interesting thing to me for this test. Do you test with a full 
optimize for indexing? If you don't, can you really test the search 
performance with the advantage of a full optimize? So, if you are going 
to optimize, why mess with the merge factor? It may still play a small 
role, but at best I think its a pretty weak lever.


- Mark

Grant Ingersoll wrote:
I did hear back from the authors.  Some of the issues were based on 
values chosen for mergeFactor (10,000) I think, but there also seemed 
to be some questions about parsing the TREC collection.  It was split 
out into individual files, as opposed to trying to stream in the 
documents like we do with Wikipedia, so I/O overhead may be an issue.  
At the time, 1.9.1 did not have much TREC support, so splitting files 
is probably the easiest way to do it.  There indexing code was based 
off the demo and some LIA reading.


They thought they would try Lucene again when 2.3 comes out.  From our 
end, I think we need to improve the docs around mergeFactor.  We 
generally just say bigger is better, but my understanding is there is 
definitely a limit to this (100??  Maybe 1000) so we should probably 
suggest that in the docs.  And, of course, I think the new 
contrib/benchmark has support for reading TREC (although I don't know 
if it handles streaming it) such that I think it shouldn't be a 
problem this time around.


At any rate, I think we are for the most part doing the right things.  
Anyone have any thoughts on advice about an upper bound for mergeFactor?


Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:


On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.


Solr has well-tested code that provides all this functionality and 
more (except for automatically spawning extra indexing threads, which 
I agree would be a useful addition).  It does heavily depend on 1.5's 
java.util.concurrent package, though.  Many people seem like using 
Solr as an embedded library layer on top of Lucene to do it all 
in-process, as well.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-17 Thread Doron Cohen
On Dec 18, 2007 2:38 AM, Mark Miller [EMAIL PROTECTED] wrote:

 For the data that I normally work with (short articles), I found that
 the sweet spot was around 80-120. I actually saw a slight decrease going
 above that...not sure if that held forever though. That was testing on
 an earlier release  (I think 2.1?). However, if you want to test
 searching it would seem that you are going to want to optimize the
 index. I have always found that whatever I save by changing the merge
 factor is paid back when you optimize. I have not scientifically
 tested this, but found it to be the case in every speed test I ran. This
 is an interesting thing to me for this test. Do you test with a full
 optimize for indexing? If you don't, can you really test the search
 performance with the advantage of a full optimize? So, if you are going
 to optimize, why mess with the merge factor? It may still play a small
 role, but at best I think its a pretty weak lever.


I had similar experience - set merge factor  to ~maxint and optimized
at the end, and felt like it was the same (never meassured though).
In fact, with the new concurrent merges, I think it should be faster to
merge on the fly?

(One comment - it is important to set back merge factor to a reasonable
number before the final optimize, otherwise you hit OutOfMem due to
so many segments being merged at once.)


 - Mark

 Grant Ingersoll wrote:
  I did hear back from the authors.  Some of the issues were based on
  values chosen for mergeFactor (10,000) I think, but there also seemed
  to be some questions about parsing the TREC collection.  It was split
  out into individual files, as opposed to trying to stream in the
  documents like we do with Wikipedia, so I/O overhead may be an issue.
  At the time, 1.9.1 did not have much TREC support, so splitting files
  is probably the easiest way to do it.  There indexing code was based
  off the demo and some LIA reading.
 
  They thought they would try Lucene again when 2.3 comes out.  From our
  end, I think we need to improve the docs around mergeFactor.  We
  generally just say bigger is better, but my understanding is there is
  definitely a limit to this (100??  Maybe 1000) so we should probably
  suggest that in the docs.  And, of course, I think the new
  contrib/benchmark has support for reading TREC (although I don't know
  if it handles streaming it) such that I think it shouldn't be a
  problem this time around.


Yes it does streaming -  TREC compressed files are read with GZIPInputStream
on demand - next doc's text is read/parsed only when the indexer requests
it,
and the indexable document is created, no doc files are created on disk.


 
  At any rate, I think we are for the most part doing the right things.
  Anyone have any thoughts on advice about an upper bound for mergeFactor?
 
  Cheers,
  Grant
 
 
  On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
 
  On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
 
  +1  I have been thinking about this too.  Solr clearly demonstrates
  the benefits of this kind of approach, although even it doesn't make
  it seamless for users in the sense that they still need to divvy up
  the docs on the app side.
 
  Would be nice if this layer also took care of searchers/readers
  refreshing  warming.
 
  Solr has well-tested code that provides all this functionality and
  more (except for automatically spawning extra indexing threads, which
  I agree would be a useful addition).  It does heavily depend on 1.5's
  java.util.concurrent package, though.  Many people seem like using
  Solr as an embedded library layer on top of Lucene to do it all
  in-process, as well.
 
  -Mike
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  --
  Grant Ingersoll
  http://lucene.grantingersoll.com
 
  Lucene Helpful Hints:
  http://wiki.apache.org/lucene-java/BasicsOfPerformance
  http://wiki.apache.org/lucene-java/LuceneFAQ
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: O/S Search Comparisons

2007-12-10 Thread Grant Ingersoll


On Dec 7, 2007, at 3:01 PM, Mark Miller wrote:

Yes, and even if they did not use the stock defaults, I would bet  
there would be complaints about what was done wrong at every turn.  
This seems like a very difficult thing to do. How long does it take  
to fully learn how to correctly utilize each search engine for the  
task at hand? I am sure longer than these busy men could possibly  
take. It seems that such a comparison could only be done  
legitimately if experts for each search engine set up the indexing/ 
searching processes. Even then the results seem like they could be  
difficult to measure...eg was each search engine configured so that  
they would only break on spaces for indexing and do nothing else  
special at all? So many small settings and knowledge need to ensure  
each engine is on level ground...


This is why I have called on NIST/TREC to open source their  
collections.  Until then, Lucene and the other O/S search engines will  
be reliant on those contributors who have access to them, which is  
spotty at best.  (And, yes, I know, TREC is not the be all, end all of  
IR evaluations, but it is a common ground for doing research)  See http://www.gossamer-threads.com/lists/lucene/java-dev/52022?search_string=TREC;#52022


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-10 Thread Mike Klaas

On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.


Solr has well-tested code that provides all this functionality and  
more (except for automatically spawning extra indexing threads, which  
I agree would be a useful addition).  It does heavily depend on 1.5's  
java.util.concurrent package, though.  Many people seem like using  
Solr as an embedded library layer on top of Lucene to do it all in- 
process, as well.


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-09 Thread Michael McCandless


Well, at some point the answer is use Solr.  I think Lucene should  
stay focused on being a good search library/component, and server  
level capabilities should be handled by Solr or the application layer  
on top of Lucene.


That said, I still think there is a need for a layer that handles/ 
hides threading, does search refreshing/warming, etc., on the Lucene  
side.  Actually, LuceneIndexAccessor (LUCENE-390) is already a step  
in that direction.


Mike

On Dec 9, 2007, at 1:29 AM, robert engels wrote:

This is along the lines of what I have tried to get the Lucene  
community to adopt for a long time.


If you want to take Lucene to the next level, it needs a server  
implementation.


Only with this can you get efficient locks, caching, transactions,  
which leads to more efficient indexing and searching.


IMO, the shared storage nature of Lucene is its biggest  
weakness.  A lot of changes have been made to improve this, when it  
probably just needs to be dropped. If you have a network, it is  
really no different to communicate with processes rather than storage.


On Dec 9, 2007, at 12:04 AM, Doron Cohen wrote:


Grant Ingersoll [EMAIL PROTECTED] wrote on 08/12/2007 16:02:31:



On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:


Sometimes, when something like this comes up, it gives you the
opportunity to take a step back and ask what are the things we
really want Lucene to be going forward (the New Year is good for
this kind of assessment as well)  What are it's strengths and
weaknesses?  What can we improve in the short term and what needs
to improve in the longer term?  Maybe it's just that time of year
to send out your Lucene Wish List... :-)


+1

There is still something for us to learn  improve in Lucene, even
if the comparison is necessarily apples/oranges or unfair.

Lucene was listed as not having Result Excerpt which isn't really
fair,  though it is true you have to pull in contrib/highlighter to
enable it.


Yeah, I noted that mentally, but didn't think it was a big deal  
since

not everyone wants it.  The other thing is, some of it comes down to
how you structure your content.  I think a lot of people use  
metadata

fields to provide enough summary info about a document.





Did it crash on the 10 GB? I thought it said that it just took way
to long (7 times the best or something). Frankly, either case is
suspect. Last summer I indexed about 5 million docs with a total
size at the *very* least of 10 GB on my 3 year old desktop. It
didn't take much more than 8 hours to index and searches where
still lightning fast. Maybe they forgot to give the JVM more than
the default amount of RAM g


The paper just said ht://Dig and Lucene degraded considerably  
their

indexing time, and we excluded them from the final comparison.

Maybe Lucene just hit a very large segment merge and the author
incorrectly thought something had gone wrong since the addDocument
call was taking incredibly long?  In which case the new default
ConcurrentMergeScheduler should improve that.  I would expect  
Lucene

2.3 to now have an advantage in that it makes use of concurrency in
the hardware, out of the box, whereas likely other older engines  
are

single threaded.


Yep.




I've also thought about creating a simple optional threaded  
layer on

top of IndexWriter which uses multiple threads to add documents,
under the hood.  Such a class would expose all of the methods of
IndexWriter (would feel just like IndexWriter), except calls to  
add/

updateDocument would drop into a queue which multiple threads
(maintained by this class) would pull from and execute.  This would
then let Lucene make use of even more concurrency ... and saves the
complexity of application writers having to manage threads above
Lucene.


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.



Here's some of my wishes:

1. Better Demo

2. Alternate scoring algorithms (which implies indexing too) that
perform at or near the same level as the current ones


+1



3. A way of announcing improvements to Interfaces such that we have
better ability to add methods to interfaces, knowing full well it  
will
break some people.  Same goes for deprecated.  In this day and  
age of

agile programming, it seems a bit restrictive to me that we wait 1+
years (the average time between major releases) to remove what we
consider to be cruft in our code or add new capabilities to
interfaces.  I would suggest we announce a deprecated method,  
version

it, mark it to when it is going away (i.e. This will be removed in
version 2.6) and then do so in that version.   So, if we deprecate
something in 2.3, we could, assuming consecutive numbered releases,
remove it in 2.5.  This would 

Re: O/S Search Comparisons

2007-12-08 Thread Grant Ingersoll


On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:

Sometimes, when something like this comes up, it gives you the  
opportunity to take a step back and ask what are the things we  
really want Lucene to be going forward (the New Year is good for  
this kind of assessment as well)  What are it's strengths and  
weaknesses?  What can we improve in the short term and what needs  
to improve in the longer term?  Maybe it's just that time of year  
to send out your Lucene Wish List... :-)


+1

There is still something for us to learn  improve in Lucene, even  
if the comparison is necessarily apples/oranges or unfair.


Lucene was listed as not having Result Excerpt which isn't really  
fair,  though it is true you have to pull in contrib/highlighter to  
enable it.


Yeah, I noted that mentally, but didn't think it was a big deal since  
not everyone wants it.  The other thing is, some of it comes down to  
how you structure your content.  I think a lot of people use metadata  
fields to provide enough summary info about a document.





Did it crash on the 10 GB? I thought it said that it just took way  
to long (7 times the best or something). Frankly, either case is  
suspect. Last summer I indexed about 5 million docs with a total  
size at the *very* least of 10 GB on my 3 year old desktop. It  
didn't take much more than 8 hours to index and searches where  
still lightning fast. Maybe they forgot to give the JVM more than  
the default amount of RAM g


The paper just said ht://Dig and Lucene degraded considerably their  
indexing time, and we excluded them from the final comparison.


Maybe Lucene just hit a very large segment merge and the author  
incorrectly thought something had gone wrong since the addDocument  
call was taking incredibly long?  In which case the new default  
ConcurrentMergeScheduler should improve that.  I would expect Lucene  
2.3 to now have an advantage in that it makes use of concurrency in  
the hardware, out of the box, whereas likely other older engines are  
single threaded.


Yep.




I've also thought about creating a simple optional threaded layer on  
top of IndexWriter which uses multiple threads to add documents,  
under the hood.  Such a class would expose all of the methods of  
IndexWriter (would feel just like IndexWriter), except calls to add/ 
updateDocument would drop into a queue which multiple threads  
(maintained by this class) would pull from and execute.  This would  
then let Lucene make use of even more concurrency ... and saves the  
complexity of application writers having to manage threads above  
Lucene.


+1  I have been thinking about this too.  Solr clearly demonstrates  
the benefits of this kind of approach, although even it doesn't make  
it seamless for users in the sense that they still need to divvy up  
the docs on the app side.


Here's some of my wishes:

1. Better Demo

2. Alternate scoring algorithms (which implies indexing too) that  
perform at or near the same level as the current ones


3. A way of announcing improvements to Interfaces such that we have  
better ability to add methods to interfaces, knowing full well it will  
break some people.  Same goes for deprecated.  In this day and age of  
agile programming, it seems a bit restrictive to me that we wait 1+  
years (the average time between major releases) to remove what we  
consider to be cruft in our code or add new capabilities to  
interfaces.  I would suggest we announce a deprecated method, version  
it, mark it to when it is going away (i.e. This will be removed in  
version 2.6) and then do so in that version.   So, if we deprecate  
something in 2.3, we could, assuming consecutive numbered releases,  
remove it in 2.5.  This would presumably move things up a bit to about  
the 6 mos. time range.  Just a thought...  :-)


-Grant





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-08 Thread Doron Cohen
Grant Ingersoll [EMAIL PROTECTED] wrote on 08/12/2007 16:02:31:


 On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:

  Sometimes, when something like this comes up, it gives you the
  opportunity to take a step back and ask what are the things we
  really want Lucene to be going forward (the New Year is good for
  this kind of assessment as well)  What are it's strengths and
  weaknesses?  What can we improve in the short term and what needs
  to improve in the longer term?  Maybe it's just that time of year
  to send out your Lucene Wish List... :-)
 
  +1
 
  There is still something for us to learn  improve in Lucene, even
  if the comparison is necessarily apples/oranges or unfair.
 
  Lucene was listed as not having Result Excerpt which isn't really
  fair,  though it is true you have to pull in contrib/highlighter to
  enable it.

 Yeah, I noted that mentally, but didn't think it was a big deal since
 not everyone wants it.  The other thing is, some of it comes down to
 how you structure your content.  I think a lot of people use metadata
 fields to provide enough summary info about a document.

 
 
  Did it crash on the 10 GB? I thought it said that it just took way
  to long (7 times the best or something). Frankly, either case is
  suspect. Last summer I indexed about 5 million docs with a total
  size at the *very* least of 10 GB on my 3 year old desktop. It
  didn't take much more than 8 hours to index and searches where
  still lightning fast. Maybe they forgot to give the JVM more than
  the default amount of RAM g
 
  The paper just said ht://Dig and Lucene degraded considerably their
  indexing time, and we excluded them from the final comparison.
 
  Maybe Lucene just hit a very large segment merge and the author
  incorrectly thought something had gone wrong since the addDocument
  call was taking incredibly long?  In which case the new default
  ConcurrentMergeScheduler should improve that.  I would expect Lucene
  2.3 to now have an advantage in that it makes use of concurrency in
  the hardware, out of the box, whereas likely other older engines are
  single threaded.

 Yep.

 
 
  I've also thought about creating a simple optional threaded layer on
  top of IndexWriter which uses multiple threads to add documents,
  under the hood.  Such a class would expose all of the methods of
  IndexWriter (would feel just like IndexWriter), except calls to add/
  updateDocument would drop into a queue which multiple threads
  (maintained by this class) would pull from and execute.  This would
  then let Lucene make use of even more concurrency ... and saves the
  complexity of application writers having to manage threads above
  Lucene.

 +1  I have been thinking about this too.  Solr clearly demonstrates
 the benefits of this kind of approach, although even it doesn't make
 it seamless for users in the sense that they still need to divvy up
 the docs on the app side.

Would be nice if this layer also took care of searchers/readers
refreshing  warming.


 Here's some of my wishes:

 1. Better Demo

 2. Alternate scoring algorithms (which implies indexing too) that
 perform at or near the same level as the current ones

+1


 3. A way of announcing improvements to Interfaces such that we have
 better ability to add methods to interfaces, knowing full well it will
 break some people.  Same goes for deprecated.  In this day and age of
 agile programming, it seems a bit restrictive to me that we wait 1+
 years (the average time between major releases) to remove what we
 consider to be cruft in our code or add new capabilities to
 interfaces.  I would suggest we announce a deprecated method, version
 it, mark it to when it is going away (i.e. This will be removed in
 version 2.6) and then do so in that version.   So, if we deprecate
 something in 2.3, we could, assuming consecutive numbered releases,
 remove it in 2.5.  This would presumably move things up a bit to about
 the 6 mos. time range.  Just a thought...  :-)

 -Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-08 Thread robert engels
This is along the lines of what I have tried to get the Lucene  
community to adopt for a long time.


If you want to take Lucene to the next level, it needs a server  
implementation.


Only with this can you get efficient locks, caching, transactions,  
which leads to more efficient indexing and searching.


IMO, the shared storage nature of Lucene is its biggest weakness.   
A lot of changes have been made to improve this, when it probably  
just needs to be dropped. If you have a network, it is really no  
different to communicate with processes rather than storage.


On Dec 9, 2007, at 12:04 AM, Doron Cohen wrote:


Grant Ingersoll [EMAIL PROTECTED] wrote on 08/12/2007 16:02:31:



On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote:


Sometimes, when something like this comes up, it gives you the
opportunity to take a step back and ask what are the things we
really want Lucene to be going forward (the New Year is good for
this kind of assessment as well)  What are it's strengths and
weaknesses?  What can we improve in the short term and what needs
to improve in the longer term?  Maybe it's just that time of year
to send out your Lucene Wish List... :-)


+1

There is still something for us to learn  improve in Lucene, even
if the comparison is necessarily apples/oranges or unfair.

Lucene was listed as not having Result Excerpt which isn't really
fair,  though it is true you have to pull in contrib/highlighter to
enable it.


Yeah, I noted that mentally, but didn't think it was a big deal since
not everyone wants it.  The other thing is, some of it comes down to
how you structure your content.  I think a lot of people use metadata
fields to provide enough summary info about a document.





Did it crash on the 10 GB? I thought it said that it just took way
to long (7 times the best or something). Frankly, either case is
suspect. Last summer I indexed about 5 million docs with a total
size at the *very* least of 10 GB on my 3 year old desktop. It
didn't take much more than 8 hours to index and searches where
still lightning fast. Maybe they forgot to give the JVM more than
the default amount of RAM g


The paper just said ht://Dig and Lucene degraded considerably their
indexing time, and we excluded them from the final comparison.

Maybe Lucene just hit a very large segment merge and the author
incorrectly thought something had gone wrong since the addDocument
call was taking incredibly long?  In which case the new default
ConcurrentMergeScheduler should improve that.  I would expect Lucene
2.3 to now have an advantage in that it makes use of concurrency in
the hardware, out of the box, whereas likely other older engines are
single threaded.


Yep.




I've also thought about creating a simple optional threaded layer on
top of IndexWriter which uses multiple threads to add documents,
under the hood.  Such a class would expose all of the methods of
IndexWriter (would feel just like IndexWriter), except calls to add/
updateDocument would drop into a queue which multiple threads
(maintained by this class) would pull from and execute.  This would
then let Lucene make use of even more concurrency ... and saves the
complexity of application writers having to manage threads above
Lucene.


+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.


Would be nice if this layer also took care of searchers/readers
refreshing  warming.



Here's some of my wishes:

1. Better Demo

2. Alternate scoring algorithms (which implies indexing too) that
perform at or near the same level as the current ones


+1



3. A way of announcing improvements to Interfaces such that we have
better ability to add methods to interfaces, knowing full well it  
will

break some people.  Same goes for deprecated.  In this day and age of
agile programming, it seems a bit restrictive to me that we wait 1+
years (the average time between major releases) to remove what we
consider to be cruft in our code or add new capabilities to
interfaces.  I would suggest we announce a deprecated method, version
it, mark it to when it is going away (i.e. This will be removed in
version 2.6) and then do so in that version.   So, if we deprecate
something in 2.3, we could, assuming consecutive numbered releases,
remove it in 2.5.  This would presumably move things up a bit to  
about

the 6 mos. time range.  Just a thought...  :-)

-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-07 Thread Grant Ingersoll
Yeah, I wasn't too excited over it and I certainly didn't lose any  
sleep over it, but there are some interesting things of note in there  
concerning Lucene, including the claim that it fell over on indexing  
WT10g docs (page 40) and I am always looking for ways to improve  
things.  Overall, I think Lucene held up pretty well in the  
evaluation, and I know how suspect _any_ evaluation is given the  
myriad ways of doing search.  Still, when a well-respected researcher  
in the field says Lucene didn't do so hot in certain areas, I don't  
think we can dismiss them out of hand.   So regardless of the tests  
being right or wrong, they are worth either addressing the failures in  
Lucene or the failures in the test such that we make sure we are  
properly educating our users on how best to use Lucene.


I emailed the authors asking for information on how the test was run  
etc., so we'll see if anything comes of it.


On Dec 7, 2007, at 12:04 PM, robert engels wrote:

I wouldn't get too excited over this. Once again, it does not seem  
the evaluator understands the nature of GC based systems, and the  
memory statistics are quite out of whack. But it is hard to tell  
because there is no data on how memory consumption was actually  
measured.


A far better way of measuring memory consumption is to cap the  
process at different levels (max ram sizes), and compare the  
performance at each level.


There is also fact that a process takes memory from disk cache, and  
visa versa, that heavily affects search performance, etc.


Since there is no detailed data (that I could find) about system  
configuration, etc. the results are highly suspect.


There is also no mention of performance on multi-processor systems.  
Some systems (like Lucene) pay a penalty to support multi-processing  
(both in Java and Lucene), and only realize this benefit when  
operating in a multi-processor environment.


Based on the shear speed of XMLSearch and Zettair those seem likely  
candidates to inspect their design.


On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:


Was wondering if people have seen 
http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf

Has some interesting comparisons.  Obviously, the comparison of  
Lucene indexing is done w/ 1.9 so it probably needs to be done  
again.  Just wondering if people see any opportunities to improve  
Lucene from it.I am going to try and contact the authors to see  
if I can get what there setup values were (mergeFactor, Analyzer,  
etc.) as I think it would be interesting to run the tests again on  
2.3.


-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-07 Thread Mark Miller
Did it crash on the 10 GB? I thought it said that it just took way to 
long (7 times the best or something). Frankly, either case is suspect. 
Last summer I indexed about 5 million docs with a total size at the 
*very* least of 10 GB on my 3 year old desktop. It didn't take much more 
than 8 hours to index and searches where still lightning fast. Maybe 
they forgot to give the JVM more than the default amount of RAM g


- Mark

Grant Ingersoll wrote:
All true and good points.  Lucene held up quite nicely in the search 
aspect (at least perf. wise) and I generally don't think making these 
kinds of comparisons are all that useful (we call it apple and oranges 
in English :-)  ).


What I am trying to get at is if this paper was just about Lucene and 
never mentioned a single other system, what, if anything, can we take 
from it that can help us make Lucene better.   I know, for instance, 
from my own personal experience, that 2.3 is somewhere in the range of 
3-5+ times faster than 2.2 (which I know is faster than 1.9).  That 
being said, the paper clearly states that Lucene was not capable of 
doing the WT10g docs because performance degraded too much.  Now, I 
know Lucene is pretty darn capable of a lot of things and people are 
using it to do web search, etc. at very large scales (I have 
personally talked w/ people doing it).  So, what I worry about is that 
either we are:

a) missing something in our defaults setup
b) missing something in our docs and our education efforts, or
c) we are missing some capability in our indexing such that it is 
crashing


Now, what is to be done?  It may well be nothing, but I just want to 
make sure we are comfortable with that decision or whether it is worth 
asking for a volunteer who has access to the WT10g docs to go have a 
look at it and see what happens.  I personally don't have access to 
these docs, otherwise I would try it out.  What we don't want to 
happen is for potential supporters/contributors to read that paper and 
say Lucene isn't for me because of this.


Sometimes, when something like this comes up, it gives you the 
opportunity to take a step back and ask what are the things we really 
want Lucene to be going forward (the New Year is good for this kind of 
assessment as well)  What are it's strengths and weaknesses?  What can 
we improve in the short term and what needs to improve in the longer 
term?  Maybe it's just that time of year to send out your Lucene Wish 
List... :-)


Cheers,
Grant

PS:  Samir, any chance of contributing back your ranking algorithms?  :-)


On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:


There is an expression in French that says comparer des pommes et des
poires which literally means to compare apples and pears.  That's 
what

this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for 
example,
retrieval effectiveness (aka search quality), search time, indexing 
time,

index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded.  There is always a kind of trade-off: for example, beside
other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing 
with
lucene but if we consider searching time lucene is better than 
zettair. Why?

Because of many reasons but probably zettair hasn't the complex document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. 
tf-idf).
Some systems computes and stores the scores at indexing time which 
make them
faster at searching time but less flexible if you want to 
change/implement a

new ranking algorithm.

Still, when a well-respected researcher in the field says Lucene 
didn't do

so hot in certain areas,

If we consider the search quality, that's simply not true if we know 
how to

implement in Lucene popular ranking algorithm such OkapiBM25 (at least).
I've been working with Lucene for four years now, all experiments of my
thesis have been done using Lucene (with many adaptations to 
implement the
most recent ranking algorithm including different language model, 
divergence
from randomness, etc.).  I also participated to major IR campaigns 
(NTCIR,

CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5 


-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE 


RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL 


EF2006.pdf, ...)   for other information search the web ;-)

Samir



-Message d'origine-
De : Mark Miller [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 7 décembre 2007 21:01
À : java-dev@lucene.apache.org
Objet : Re: O/S Search Comparisons

Yes, and even if they did not use the stock defaults, I would bet

Re: O/S Search Comparisons

2007-12-07 Thread Grant Ingersoll
All true and good points.  Lucene held up quite nicely in the search  
aspect (at least perf. wise) and I generally don't think making these  
kinds of comparisons are all that useful (we call it apple and oranges  
in English :-)  ).


What I am trying to get at is if this paper was just about Lucene and  
never mentioned a single other system, what, if anything, can we take  
from it that can help us make Lucene better.   I know, for instance,  
from my own personal experience, that 2.3 is somewhere in the range of  
3-5+ times faster than 2.2 (which I know is faster than 1.9).  That  
being said, the paper clearly states that Lucene was not capable of  
doing the WT10g docs because performance degraded too much.  Now, I  
know Lucene is pretty darn capable of a lot of things and people are  
using it to do web search, etc. at very large scales (I have  
personally talked w/ people doing it).  So, what I worry about is that  
either we are:

a) missing something in our defaults setup
b) missing something in our docs and our education efforts, or
c) we are missing some capability in our indexing such that it is  
crashing


Now, what is to be done?  It may well be nothing, but I just want to  
make sure we are comfortable with that decision or whether it is worth  
asking for a volunteer who has access to the WT10g docs to go have a  
look at it and see what happens.  I personally don't have access to  
these docs, otherwise I would try it out.  What we don't want to  
happen is for potential supporters/contributors to read that paper and  
say Lucene isn't for me because of this.


Sometimes, when something like this comes up, it gives you the  
opportunity to take a step back and ask what are the things we really  
want Lucene to be going forward (the New Year is good for this kind of  
assessment as well)  What are it's strengths and weaknesses?  What can  
we improve in the short term and what needs to improve in the longer  
term?  Maybe it's just that time of year to send out your Lucene Wish  
List... :-)


Cheers,
Grant

PS:  Samir, any chance of contributing back your ranking  
algorithms?  :-)



On Dec 7, 2007, at 5:41 PM, Samir Abdou wrote:


There is an expression in French that says comparer des pommes et des
poires which literally means to compare apples and pears.  That's  
what

this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for  
example,
retrieval effectiveness (aka search quality), search time, indexing  
time,

index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded.  There is always a kind of trade-off: for example,  
beside

other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing  
with
lucene but if we consider searching time lucene is better than  
zettair. Why?
Because of many reasons but probably zettair hasn't the complex  
document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf- 
idf).
Some systems computes and stores the scores at indexing time which  
make them
faster at searching time but less flexible if you want to change/ 
implement a

new ranking algorithm.

Still, when a well-respected researcher in the field says Lucene  
didn't do

so hot in certain areas,

If we consider the search quality, that's simply not true if we know  
how to
implement in Lucene popular ranking algorithm such OkapiBM25 (at  
least).
I've been working with Lucene for four years now, all experiments of  
my
thesis have been done using Lucene (with many adaptations to  
implement the
most recent ranking algorithm including different language model,  
divergence
from randomness, etc.).  I also participated to major IR campaigns  
(NTCIR,

CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
EF2006.pdf, ...)   for other information search the web ;-)

Samir



-Message d'origine-
De : Mark Miller [mailto:[EMAIL PROTECTED]
Envoyé : vendredi 7 décembre 2007 21:01
À : java-dev@lucene.apache.org
Objet : Re: O/S Search Comparisons

Yes, and even if they did not use the stock defaults, I would bet  
there
would be complaints about what was done wrong at every turn. This  
seems
like a very difficult thing to do. How long does it take to fully  
learn
how to correctly utilize each search engine for the task at hand? I  
am
sure longer than these busy men could possibly take. It seems that  
such
a comparison could only be done legitimately if experts for each  
search

engine set up the indexing/searching

RE: O/S Search Comparisons

2007-12-07 Thread Samir Abdou
There is an expression in French that says comparer des pommes et des
poires which literally means to compare apples and pears.  That's what
this paper is about. For my point of view, such a comparison would be
interesting only if a cross analysis of different criterions (for example,
retrieval effectiveness (aka search quality), search time, indexing time,
index size, query language, index structure, and so on...) is done.
Comparing different systems based only on one criterion is not
well-grounded.  There is always a kind of trade-off: for example, beside
other parameters (ranking algorithm, frequencies statistics, document
structure, etc.), indexing with zettair is much faster than indexing with
lucene but if we consider searching time lucene is better than zettair. Why?
Because of many reasons but probably zettair hasn't the complex document
structure of lucene besides the ranking algorithm (Okapi BM25 vs. tf-idf).
Some systems computes and stores the scores at indexing time which make them
faster at searching time but less flexible if you want to change/implement a
new ranking algorithm. 

Still, when a well-respected researcher in the field says Lucene didn't do
so hot in certain areas,

If we consider the search quality, that's simply not true if we know how to
implement in Lucene popular ranking algorithm such OkapiBM25 (at least).
I've been working with Lucene for four years now, all experiments of my
thesis have been done using Lucene (with many adaptations to implement the
most recent ranking algorithm including different language model, divergence
from randomness, etc.).  I also participated to major IR campaigns (NTCIR,
CLEF and TREC) and the results are not bad at all (see
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/CLIR/NTCIR5
-OV-CLIR-KishidaK.pdf for NTCIR-5 or
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/NTCIR6-OVE
RVIEW.pdf for NTCIR-6, for CLEF have a look at
http://www.clef-campaign.org/2006/working_notes/workingnotes2006/dinunzioOCL
EF2006.pdf, ...)   for other information search the web ;-)

Samir 


 -Message d'origine-
 De : Mark Miller [mailto:[EMAIL PROTECTED]
 Envoyé : vendredi 7 décembre 2007 21:01
 À : java-dev@lucene.apache.org
 Objet : Re: O/S Search Comparisons
 
 Yes, and even if they did not use the stock defaults, I would bet there
 would be complaints about what was done wrong at every turn. This seems
 like a very difficult thing to do. How long does it take to fully learn
 how to correctly utilize each search engine for the task at hand? I am
 sure longer than these busy men could possibly take. It seems that such
 a comparison could only be done legitimately if experts for each search
 engine set up the indexing/searching processes. Even then the results
 seem like they could be difficult to measure...eg was each search
 engine
 configured so that they would only break on spaces for indexing and do
 nothing else special at all? So many small settings and knowledge need
 to ensure each engine is on level ground...
 
 I doubt it will ever happen, but some sort of open source search off
 would be pretty cool g. Then each camp could properly configure their
 search engine for each task.
 
 - Mark
 
 Mike Klaas wrote:
  There is a good chance that they were using stock indexing defaults,
  based on:
 
  Lucene:
   In the present work, the simple applications
  bundled with the library were used to index the collection. 
 
  On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:
 
  Yeah, I wasn't too excited over it and I certainly didn't lose any
  sleep over it, but there are some interesting things of note in
 there
  concerning Lucene, including the claim that it fell over on indexing
  WT10g docs (page 40) and I am always looking for ways to improve
  things.  Overall, I think Lucene held up pretty well in the
  evaluation, and I know how suspect _any_ evaluation is given the
  myriad ways of doing search.  Still, when a well-respected
 researcher
  in the field says Lucene didn't do so hot in certain areas, I don't
  think we can dismiss them out of hand.   So regardless of the tests
  being right or wrong, they are worth either addressing the failures
  in Lucene or the failures in the test such that we make sure we are
  properly educating our users on how best to use Lucene.
 
  I emailed the authors asking for information on how the test was run
  etc., so we'll see if anything comes of it.
 
  On Dec 7, 2007, at 12:04 PM, robert engels wrote:
 
  I wouldn't get too excited over this. Once again, it does not seem
  the evaluator understands the nature of GC based systems, and the
  memory statistics are quite out of whack. But it is hard to tell
  because there is no data on how memory consumption was actually
  measured.
 
  A far better way of measuring memory consumption is to cap the
  process at different levels (max ram sizes), and compare the
  performance at each level.
 
  There is also fact that a process

Re: O/S Search Comparisons

2007-12-07 Thread Mark Miller
Yes, and even if they did not use the stock defaults, I would bet there 
would be complaints about what was done wrong at every turn. This seems 
like a very difficult thing to do. How long does it take to fully learn 
how to correctly utilize each search engine for the task at hand? I am 
sure longer than these busy men could possibly take. It seems that such 
a comparison could only be done legitimately if experts for each search 
engine set up the indexing/searching processes. Even then the results 
seem like they could be difficult to measure...eg was each search engine 
configured so that they would only break on spaces for indexing and do 
nothing else special at all? So many small settings and knowledge need 
to ensure each engine is on level ground...


I doubt it will ever happen, but some sort of open source search off 
would be pretty cool g. Then each camp could properly configure their 
search engine for each task.


- Mark

Mike Klaas wrote:
There is a good chance that they were using stock indexing defaults, 
based on:


Lucene:
 In the present work, the simple applications
bundled with the library were used to index the collection. 

On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:

Yeah, I wasn't too excited over it and I certainly didn't lose any 
sleep over it, but there are some interesting things of note in there 
concerning Lucene, including the claim that it fell over on indexing 
WT10g docs (page 40) and I am always looking for ways to improve 
things.  Overall, I think Lucene held up pretty well in the 
evaluation, and I know how suspect _any_ evaluation is given the 
myriad ways of doing search.  Still, when a well-respected researcher 
in the field says Lucene didn't do so hot in certain areas, I don't 
think we can dismiss them out of hand.   So regardless of the tests 
being right or wrong, they are worth either addressing the failures 
in Lucene or the failures in the test such that we make sure we are 
properly educating our users on how best to use Lucene.


I emailed the authors asking for information on how the test was run 
etc., so we'll see if anything comes of it.


On Dec 7, 2007, at 12:04 PM, robert engels wrote:

I wouldn't get too excited over this. Once again, it does not seem 
the evaluator understands the nature of GC based systems, and the 
memory statistics are quite out of whack. But it is hard to tell 
because there is no data on how memory consumption was actually 
measured.


A far better way of measuring memory consumption is to cap the 
process at different levels (max ram sizes), and compare the 
performance at each level.


There is also fact that a process takes memory from disk cache, and 
visa versa, that heavily affects search performance, etc.


Since there is no detailed data (that I could find) about system 
configuration, etc. the results are highly suspect.


There is also no mention of performance on multi-processor systems. 
Some systems (like Lucene) pay a penalty to support multi-processing 
(both in Java and Lucene), and only realize this benefit when 
operating in a multi-processor environment.


Based on the shear speed of XMLSearch and Zettair those seem likely 
candidates to inspect their design.


On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:

Was wondering if people have seen 
http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf


Has some interesting comparisons.  Obviously, the comparison of 
Lucene indexing is done w/ 1.9 so it probably needs to be done 
again.  Just wondering if people see any opportunities to improve 
Lucene from it.I am going to try and contact the authors to see 
if I can get what there setup values were (mergeFactor, Analyzer, 
etc.) as I think it would be interesting to run the tests again on 
2.3.


-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: O/S Search Comparisons

2007-12-07 Thread Mike Klaas
There is a good chance that they were using stock indexing defaults,  
based on:


Lucene:
 In the present work, the simple applications
bundled with the library were used to index the collection. 

On 7-Dec-07, at 10:27 AM, Grant Ingersoll wrote:

Yeah, I wasn't too excited over it and I certainly didn't lose any  
sleep over it, but there are some interesting things of note in  
there concerning Lucene, including the claim that it fell over on  
indexing WT10g docs (page 40) and I am always looking for ways to  
improve things.  Overall, I think Lucene held up pretty well in the  
evaluation, and I know how suspect _any_ evaluation is given the  
myriad ways of doing search.  Still, when a well-respected  
researcher in the field says Lucene didn't do so hot in certain  
areas, I don't think we can dismiss them out of hand.   So  
regardless of the tests being right or wrong, they are worth either  
addressing the failures in Lucene or the failures in the test such  
that we make sure we are properly educating our users on how best  
to use Lucene.


I emailed the authors asking for information on how the test was  
run etc., so we'll see if anything comes of it.


On Dec 7, 2007, at 12:04 PM, robert engels wrote:

I wouldn't get too excited over this. Once again, it does not seem  
the evaluator understands the nature of GC based systems, and the  
memory statistics are quite out of whack. But it is hard to tell  
because there is no data on how memory consumption was actually  
measured.


A far better way of measuring memory consumption is to cap the  
process at different levels (max ram sizes), and compare the  
performance at each level.


There is also fact that a process takes memory from disk cache,  
and visa versa, that heavily affects search performance, etc.


Since there is no detailed data (that I could find) about system  
configuration, etc. the results are highly suspect.


There is also no mention of performance on multi-processor  
systems. Some systems (like Lucene) pay a penalty to support multi- 
processing (both in Java and Lucene), and only realize this  
benefit when operating in a multi-processor environment.


Based on the shear speed of XMLSearch and Zettair those seem  
likely candidates to inspect their design.


On Dec 7, 2007, at 7:03 AM, Grant Ingersoll wrote:

Was wondering if people have seen http://wrg.upf.edu/WRG/dctos/ 
Middleton-Baeza.pdf


Has some interesting comparisons.  Obviously, the comparison of  
Lucene indexing is done w/ 1.9 so it probably needs to be done  
again.  Just wondering if people see any opportunities to improve  
Lucene from it.I am going to try and contact the authors to  
see if I can get what there setup values were (mergeFactor,  
Analyzer, etc.) as I think it would be interesting to run the  
tests again on 2.3.


-Grant



 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]