Re: Beyond Lucene 2.0 Index Design

2007-01-12 Thread Paul Elschot
Gentlemen, On Friday 12 January 2007 21:00, Chuck Williams wrote: > > Doug Cutting wrote on 01/12/2007 09:49 AM: > > Marvin Humphrey wrote: > >> Can you show us some code or pseudo-code for a BooleanScorer that > >> would use impact-sorted posting lists? > > > > Another way to interpret this prop

Re: Beyond Lucene 2.0 Index Design

2007-01-12 Thread Chuck Williams
Doug Cutting wrote on 01/12/2007 09:49 AM: > Marvin Humphrey wrote: >> Can you show us some code or pseudo-code for a BooleanScorer that >> would use impact-sorted posting lists? > > Another way to interpret this proposal is index-only: the low-level > indexing APIs should be general enough to per

Re: Beyond Lucene 2.0 Index Design

2007-01-12 Thread Doug Cutting
Marvin Humphrey wrote: Can you show us some code or pseudo-code for a BooleanScorer that would use impact-sorted posting lists? Another way to interpret this proposal is index-only: the low-level indexing APIs should be general enough to permit impact-sorted posting lists, and perhaps an impa

RE: Beyond Lucene 2.0 Index Design

2007-01-12 Thread Dalton, Jeffery
Thanks Grant, I will take a look at this. > -Original Message- > From: Grant Ingersoll [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 11, 2007 8:12 AM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > Hi Jeff, > > Wond

RE: Beyond Lucene 2.0 Index Design

2007-01-12 Thread Dalton, Jeffery
y 10, 2007 5:41 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > I have a couple of questions about the original post of the > new index design: > > (1) Question on the posting list > > > f. > ,],...[docN, freq > &g

RE: Beyond Lucene 2.0 Index Design

2007-01-12 Thread Dalton, Jeffery
IL PROTECTED] > Sent: Wednesday, January 10, 2007 5:12 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > Hi, Jeff, > > I like the idea of impact based scoring. However, could you > elaborate more on why we only need to use single field at

Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread Marvin Humphrey
On Jan 11, 2007, at 8:37 PM, Ming Lei wrote: But practically, the approximation (as in my original post) should work well enough for large corpus and relevancy-driven retrieval. The saving on disk access for large corpus (implies very long posting list) will be huge by impact-sorted posting

Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread Ming Lei
Marvin, Several posts back on this thread, I talked about an algorithm of impact-sorted posting list for conjunctive boolean query. Your concerns on impact-sorting in boolean retrieval model is valid. But practically, the approximation (as in my original post) should work well enough for large corp

Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread Marvin Humphrey
On Jan 11, 2007, at 2:30 PM, jian chen wrote: It seems to me that the impacted-sorted list makes sense if you are trying to do pure vector space based ranking. This is from what I have read from the research papers. They all talk about how to optimize the vector space model using this imp

Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread jian chen
I also got the same question. It seems it is very hard to efficiently do phrase based query. I think most search engines do phrase based query, or at least appear to be. So, like in google, the query result must contain all the words user searched on. It seems to me that the impacted-sorted list

Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread Marvin Humphrey
On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote: e. f. ],...[docN, freq ,]) How do you build an efficient PhraseScorer to work with an impact- sorted posting list? The way PhraseScorer currently works is: find a doc that contains all terms, then see if the terms occur consecutively in

Re: Beyond Lucene 2.0 Index Design

2007-01-11 Thread Grant Ingersoll
Hi Jeff, Wondering if you (and/or others) would be interested in taking a look at https://issues.apache.org/jira/browse/LUCENE-662 and vetting the new interfaces, etc. to see if you could come up w/ a prototype implementation. This would help move along 662 as it would sort out some of t

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread Ming Lei
The idea of "impact" and "impact-sorted posting list" should practically work with boolean model by approximation in the following way: (1) Index Structure Inverted-Index : * posting-list: + (sorted by impact) occurrence: position (2) Retrieval Algorithm for boolean query "a AND b" set an impa

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread Ming Lei
ses). In > > > addition to having fewer posting lists to > examine, > > you often don't need > > > to read to the end of long posting lists when > > processing with a > > > score-at-a-time approach (see Anh/Moffat's > Pruned > > Query Eva

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread Ming Lei
gt; > score-at-a-time approach (see Anh/Moffat's Pruned > Query Evaluation Using > > Pre-Computed Impacts, SIGIR 2006) for details on > one potential > > algorithm. > > > > I'm not quite sure what you mean when mention > leaving them out and >

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread Ming Lei
offat's Pruned > Query Evaluation Using > > Pre-Computed Impacts, SIGIR 2006) for details on > one potential > > algorithm. > > > > I'm not quite sure what you mean when mention > leaving them out and > > re-calculating them at merge time. > > >

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread jian chen
. - Jeff > -Original Message- > From: Marvin Humphrey [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 09, 2007 2:58 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote: >

Re: Beyond Lucene 2.0 Index Design

2007-01-10 Thread jian chen
when mention leaving them out and re-calculating them at merge time. - Jeff > -Original Message- > From: Marvin Humphrey [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 09, 2007 2:58 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > >

RE: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Dalton, Jeffery
ginal Message- > From: Marvin Humphrey [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 09, 2007 2:58 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote: > > > e. > >

RE: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Dalton, Jeffery
rers to perform scoring and intersection. The end product would be a very scalable and flexible solution. - Jeff > -Original Message- > From: Doron Cohen [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 09, 2007 5:27 PM > To: java-dev@lucene.apache.org > Subject: Re:

Re: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Doron Cohen
Scoring today goes doc-at-a-time - all scorers and term-posting-readers advance together; once a new doc is processed, scoring of previous docs is known and final. This allows maintaining a finite size queue for collecting best hits. Then, for huge collections, having to exhaustively scan all posti

Re: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Marvin Humphrey
On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote: e. f. ],...[docN, freq ,]) Does the impact have any use after it's used to sort the postings? Can we leave it out of the index format and recalculate at merge-time? Marvin Humphrey Rectangular Research http://www.rectangular.com/ ---

Beyond Lucene 2.0 Index Design

2007-01-09 Thread Dalton, Jeffery
Hi, I wanted to start some discussion about possible future Lucene file / index formats. This is an extension to the discussion on Flexible Lucene Indexing discussed on the wiki: http://wiki.apache.org/jakarta-lucene/FlexibleIndexing Note: Related sources are listed at the end. I would like