I did hear back from the authors. Some of the issues were based on values chosen for mergeFactor (10,000) I think, but there also seemed to be some questions about parsing the TREC collection. It was split out into individual files, as opposed to trying to stream in the documents like we do with Wikipedia, so I/O overhead may be an issue. At the time, 1.9.1 did not have much TREC support, so splitting files is probably the easiest way to do it. There indexing code was based off the demo and some LIA reading.

They thought they would try Lucene again when 2.3 comes out. From our end, I think we need to improve the docs around mergeFactor. We generally just say bigger is better, but my understanding is there is definitely a limit to this (100?? Maybe 1000) so we should probably suggest that in the docs. And, of course, I think the new contrib/ benchmark has support for reading TREC (although I don't know if it handles streaming it) such that I think it shouldn't be a problem this time around.

At any rate, I think we are for the most part doing the right things. Anyone have any thoughts on advice about an upper bound for mergeFactor?

Cheers,
Grant


On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:

On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:

+1  I have been thinking about this too.  Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.

Would be nice if this layer also took care of searchers/readers
refreshing & warming.

Solr has well-tested code that provides all this functionality and more (except for automatically spawning extra indexing threads, which I agree would be a useful addition). It does heavily depend on 1.5's java.util.concurrent package, though. Many people seem like using Solr as an embedded library layer on top of Lucene to do it all in-process, as well.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to