I did hear back from the authors. Some of the issues were based on
values chosen for mergeFactor (10,000) I think, but there also seemed
to be some questions about parsing the TREC collection. It was split
out into individual files, as opposed to trying to stream in the
documents like we do with Wikipedia, so I/O overhead may be an issue.
At the time, 1.9.1 did not have much TREC support, so splitting files
is probably the easiest way to do it. There indexing code was based
off the demo and some LIA reading.
They thought they would try Lucene again when 2.3 comes out. From our
end, I think we need to improve the docs around mergeFactor. We
generally just say bigger is better, but my understanding is there is
definitely a limit to this (100?? Maybe 1000) so we should probably
suggest that in the docs. And, of course, I think the new contrib/
benchmark has support for reading TREC (although I don't know if it
handles streaming it) such that I think it shouldn't be a problem this
time around.
At any rate, I think we are for the most part doing the right things.
Anyone have any thoughts on advice about an upper bound for mergeFactor?
Cheers,
Grant
On Dec 10, 2007, at 2:54 PM, Mike Klaas wrote:
On 8-Dec-07, at 10:04 PM, Doron Cohen wrote:
+1 I have been thinking about this too. Solr clearly demonstrates
the benefits of this kind of approach, although even it doesn't make
it seamless for users in the sense that they still need to divvy up
the docs on the app side.
Would be nice if this layer also took care of searchers/readers
refreshing & warming.
Solr has well-tested code that provides all this functionality and
more (except for automatically spawning extra indexing threads,
which I agree would be a useful addition). It does heavily depend
on 1.5's java.util.concurrent package, though. Many people seem
like using Solr as an embedded library layer on top of Lucene to do
it all in-process, as well.
-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]