Grant Ingersoll <[EMAIL PROTECTED]> wrote on 08/12/2007 16:02:31: > > On Dec 8, 2007, at 4:51 AM, Michael McCandless wrote: > > >>> Sometimes, when something like this comes up, it gives you the > >>> opportunity to take a step back and ask what are the things we > >>> really want Lucene to be going forward (the New Year is good for > >>> this kind of assessment as well) What are it's strengths and > >>> weaknesses? What can we improve in the short term and what needs > >>> to improve in the longer term? Maybe it's just that time of year > >>> to send out your Lucene Wish List... :-) > > > > +1 > > > > There is still something for us to learn & improve in Lucene, even > > if the comparison is necessarily apples/oranges or unfair. > > > > Lucene was listed as not having "Result Excerpt" which isn't really > > fair, though it is true you have to pull in contrib/highlighter to > > enable it. > > Yeah, I noted that mentally, but didn't think it was a big deal since > not everyone wants it. The other thing is, some of it comes down to > how you structure your content. I think a lot of people use metadata > fields to provide enough "summary" info about a document. > > > > > > >> Did it crash on the 10 GB? I thought it said that it just took way > >> to long (7 times the best or something). Frankly, either case is > >> suspect. Last summer I indexed about 5 million docs with a total > >> size at the *very* least of 10 GB on my 3 year old desktop. It > >> didn't take much more than 8 hours to index and searches where > >> still lightning fast. Maybe they forgot to give the JVM more than > >> the default amount of RAM <g> > > > > The paper just said "ht://Dig and Lucene degraded considerably their > > indexing time, and we excluded them from the final comparison". > > > > Maybe Lucene just hit a very large segment merge and the author > > incorrectly thought something had gone wrong since the addDocument > > call was taking incredibly long? In which case the new default > > ConcurrentMergeScheduler should improve that. I would expect Lucene > > 2.3 to now have an advantage in that it makes use of concurrency in > > the hardware, out of the box, whereas likely other older engines are > > single threaded. > > Yep. > > > > > > > I've also thought about creating a simple optional threaded layer on > > top of IndexWriter which uses multiple threads to add documents, > > under the hood. Such a class would expose all of the methods of > > IndexWriter (would feel just like IndexWriter), except calls to add/ > > updateDocument would drop into a queue which multiple threads > > (maintained by this class) would pull from and execute. This would > > then let Lucene make use of even more concurrency ... and saves the > > "complexity" of application writers having to manage threads above > > Lucene. > > +1 I have been thinking about this too. Solr clearly demonstrates > the benefits of this kind of approach, although even it doesn't make > it seamless for users in the sense that they still need to divvy up > the docs on the app side.
Would be nice if this layer also took care of searchers/readers refreshing & warming. > > Here's some of my wishes: > > 1. Better Demo > > 2. Alternate scoring algorithms (which implies indexing too) that > perform at or near the same level as the current ones +1 > > 3. A way of announcing improvements to Interfaces such that we have > better ability to add methods to interfaces, knowing full well it will > break some people. Same goes for deprecated. In this day and age of > agile programming, it seems a bit restrictive to me that we wait 1+ > years (the average time between major releases) to remove what we > consider to be cruft in our code or add new capabilities to > interfaces. I would suggest we announce a deprecated method, version > it, mark it to when it is going away (i.e. This will be removed in > version 2.6) and then do so in that version. So, if we deprecate > something in 2.3, we could, assuming consecutive numbered releases, > remove it in 2.5. This would presumably move things up a bit to about > the 6 mos. time range. Just a thought... :-) > > -Grant --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]