Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-29 Thread Michael McCandless
I think that's a good idea for Lucy. Mike On Fri, Mar 26, 2010 at 10:58 AM, Marvin Humphrey wrote: > On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: >> > Maybe aggressive automatic data-reduction makes more sense in the context >> > of >> > "flexible matching", which is more

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-29 Thread Michael McCandless
On Thu, Mar 25, 2010 at 1:20 PM, Marvin Humphrey wrote: > On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: >> >> Also, will Lucy store the original stats? >> > >> > These? >> > >> > * Total number of tokens in the field. >> > * Number of unique terms in the field. >> > * D

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-26 Thread Marvin Humphrey
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: > > Maybe aggressive automatic data-reduction makes more sense in the context of > > "flexible matching", which is more expansive than "flexible scoring"? > > I think so. Maybe it shouldn't be called a Similarity (which to me > (

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-25 Thread Marvin Humphrey
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: > >> Also, will Lucy store the original stats? > > > > These? > > > > * Total number of tokens in the field. > > * Number of unique terms in the field. > > * Doc boost. > > * Field boost. > > Also sum(tf). Robert can gene

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-25 Thread Michael McCandless
On Mon, Mar 22, 2010 at 12:45 PM, Marvin Humphrey wrote: > On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote: >> Also, will Lucy store the original stats? > > These? > > * Total number of tokens in the field. > * Number of unique terms in the field. > * Doc boost. > * Fiel

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-22 Thread Marvin Humphrey
On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote: > Also, will Lucy store the original stats? These? * Total number of tokens in the field. * Number of unique terms in the field. * Doc boost. * Field boost. That would depend on which Similiarity the user specs for

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-18 Thread Michael McCandless
On Mon, Mar 15, 2010 at 7:49 PM, Marvin Humphrey wrote: > On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote: >> I mean specifically one should not have to commit to the precise >> scoring model they will use for a given field, when they index that >> field. > > Yeah, I've never se

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-15 Thread Marvin Humphrey
On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote: > I mean specifically one should not have to commit to the precise > scoring model they will use for a given field, when they index that > field. Yeah, I've never seen committing to a precise scoring model at index-time via Sim ch

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-15 Thread Robert Muir
>>> But I don't like baking in search concepts at index time... >> > Many scoring models are possible if you store enough stats in the > index. > in general the missing stats seem to fit in two buckets/categories: 1) length normalization pivot: average length in bytes, terms, unique terms 2) term

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-15 Thread Michael McCandless
On Mon, Mar 15, 2010 at 12:03 AM, Marvin Humphrey wrote: > On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote: > >> I still don't think similarity should have any bearing during indexing. > > Similarity has always, from day one, affected the contents of the index. This > idea that

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-14 Thread Marvin Humphrey
On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote: > I still don't think similarity should have any bearing during indexing. Similarity has always, from day one, affected the contents of the index. This idea that it should be totally divorced from indexing is, in fact, a very si

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-13 Thread Michael McCandless
On Fri, Mar 12, 2010 at 8:31 PM, Marvin Humphrey wrote: > On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote: >> > So there would be polymorphism in the decoding phase while we're supplying >> > information the Similarity object needs to make its similarity judgments. >> > However,

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-13 Thread Michael McCandless
On Thu, Mar 11, 2010 at 12:35 PM, Marvin Humphrey wrote: > On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote: > >> We ask it to give us a Codec. > > There's a conflict between the segment-wide role of the "Codec" class and its > role as specifier for posting format. > > In some se

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-12 Thread Marvin Humphrey
On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote: > > So there would be polymorphism in the decoding phase while we're supplying > > information the Similarity object needs to make its similarity judgments. > > However, that polymorphism would be handled internally -- it wouldn't

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-11 Thread Marvin Humphrey
On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote: > We ask it to give us a Codec. There's a conflict between the segment-wide role of the "Codec" class and its role as specifier for posting format. In some sense, you could argue that the "codec" reads/writes the entire index se

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-11 Thread Michael McCandless
On Tue, Mar 9, 2010 at 3:58 PM, Marvin Humphrey wrote: > On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote: >> >> >> You said "of course" before but... how in your proposal could one >> >> store all stats for a given field during indexing, but then sometimes >> >> use match-only a

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-09 Thread Marvin Humphrey
On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote: > > >> You said "of course" before but... how in your proposal could one > >> store all stats for a given field during indexing, but then sometimes > >> use match-only and sometimes full-scoring when querying against that > >> fie

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-09 Thread Michael McCandless
On Tue, Mar 9, 2010 at 10:03 AM, Marvin Humphrey wrote: > On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote: >> > For what it's worth, that's sort of the way KS used to work: >> > Schema/FieldType >> > information was stored entirely in source code. That's changed and now we >>

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-09 Thread Marvin Humphrey
On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote: > > For what it's worth, that's sort of the way KS used to work: > > Schema/FieldType > > information was stored entirely in source code. That's changed and now we > > serialize the whole schema including all Analyzers, but sourc

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-09 Thread Michael McCandless
On Mon, Mar 8, 2010 at 9:47 PM, Marvin Humphrey wrote: > On Mon, Mar 08, 2010 at 01:13:53PM -0500, Michael McCandless wrote: >> I think we can actually do so w/o losing Lucene's loose typing if we >> simply peeled out [say] a FieldType class that holds the settings you >> now set on each field (om

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-08 Thread Marvin Humphrey
On Mon, Mar 08, 2010 at 01:13:53PM -0500, Michael McCandless wrote: > I think we can actually do so w/o losing Lucene's loose typing if we > simply peeled out [say] a FieldType class that holds the settings you > now set on each field (omitTFAP, omitNorms, TermVector, Store, > Index), and Field ins

RE: Baby steps towards making Lucene's scoring more flexible...

2010-03-08 Thread Steven A Rowe
On 03/08/2010 at 2:10 PM, Michael McCandless wrote: > On Mon, Mar 8, 2010 at 2:07 PM, Steven A Rowe wrote: > > On 03/08/2010 at 1:57 PM, Steven A Rowe wrote: > > > On 03/08/2010 at 1:13 PM, Michael McCandless wrote: > > > > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey > > > > wrote: > > > > >

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-08 Thread Michael McCandless
On Mon, Mar 8, 2010 at 2:07 PM, Steven A Rowe wrote: > On 03/08/2010 at 1:57 PM, Steven A Rowe wrote: >> On 03/08/2010 at 1:13 PM, Michael McCandless wrote: >> > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey >> > wrote: >> > > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:

RE: Baby steps towards making Lucene's scoring more flexible...

2010-03-08 Thread Steven A Rowe
On 03/08/2010 at 1:57 PM, Steven A Rowe wrote: > On 03/08/2010 at 1:13 PM, Michael McCandless wrote: > > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey > > wrote: > > > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: > > > > > What's the flex API for specifying a custom postin

RE: Baby steps towards making Lucene's scoring more flexible...

2010-03-08 Thread Steven A Rowe
On 03/08/2010 at 1:13 PM, Michael McCandless wrote: > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey > wrote: > > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: > > > > What's the flex API for specifying a custom posting format? > > > > > > You implement a Codecs class, whic

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-08 Thread Michael McCandless
On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey wrote: > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: >> It won't encounter an unknown posting format. It's the codec. It >> knows all posting formats by the time it sees it. > > OK, so you're not going to handle this the way

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-07 Thread Marvin Humphrey
On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote: > It won't encounter an unknown posting format. It's the codec. It > knows all posting formats by the time it sees it. OK, so you're not going to handle this the way Lucene handles field types and accept a new codec spec refere

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-06 Thread Michael McCandless
On Fri, Mar 5, 2010 at 1:54 PM, Marvin Humphrey wrote: > On Thu, Mar 04, 2010 at 12:23:38PM -0500, Michael McCandless wrote: >> > In a multi-node search cluster, pre-calculating norms at index-time >> > wouldn't work well without additional communication between nodes to >> > gather corpus-wide st

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-05 Thread Marvin Humphrey
On Thu, Mar 04, 2010 at 12:23:38PM -0500, Michael McCandless wrote: > > In a multi-node search cluster, pre-calculating norms at index-time > > wouldn't work well without additional communication between nodes to > > gather corpus-wide stats. But I suspect the same trick that works > > for IDF in

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-04 Thread Michael McCandless
On Tue, Mar 2, 2010 at 4:12 PM, Marvin Humphrey wrote: > On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote: >> The problem is, these scoring models need the avg field length (in >> tokens) across the entire index, to compute the norms. >> >> Ie, you can't do that on writing a sing

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-02 Thread Marvin Humphrey
On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote: > The problem is, these scoring models need the avg field length (in > tokens) across the entire index, to compute the norms. > > Ie, you can't do that on writing a single segment. I don't see why not. We can just move everything

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-02 Thread Michael McCandless
On Sun, Feb 28, 2010 at 1:38 PM, Marvin Humphrey wrote: > On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote: > >> * Store additional per-doc stats in the index, eg in a custom >> posting list, > > Inline, as in a payload? Of course that can work, but if the data > is common

Re: Baby steps towards making Lucene's scoring more flexible...

2010-02-28 Thread Marvin Humphrey
On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote: > * Store additional per-doc stats in the index, eg in a custom > posting list, Inline, as in a payload? Of course that can work, but if the data is common over multiple postings, you pay in space to gain locality. KinoS

Baby steps towards making Lucene's scoring more flexible...

2010-02-26 Thread Michael McCandless
In thinking about & discussing with Robert how to allow Lucene to support other scoring models, eg lnu.ltc, BM25, etc I think a relatively contained set of changes can give us a solid step forward. Something like this: * Store additional per-doc stats in the index, eg in a custom posting