I think that's a good idea for Lucy.
Mike
On Fri, Mar 26, 2010 at 10:58 AM, Marvin Humphrey
wrote:
> On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
>> > Maybe aggressive automatic data-reduction makes more sense in the context
>> > of
>> > "flexible matching", which is more
On Thu, Mar 25, 2010 at 1:20 PM, Marvin Humphrey wrote:
> On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
>> >> Also, will Lucy store the original stats?
>> >
>> > These?
>> >
>> > * Total number of tokens in the field.
>> > * Number of unique terms in the field.
>> > * D
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
> > Maybe aggressive automatic data-reduction makes more sense in the context of
> > "flexible matching", which is more expansive than "flexible scoring"?
>
> I think so. Maybe it shouldn't be called a Similarity (which to me
> (
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
> >> Also, will Lucy store the original stats?
> >
> > These?
> >
> > * Total number of tokens in the field.
> > * Number of unique terms in the field.
> > * Doc boost.
> > * Field boost.
>
> Also sum(tf). Robert can gene
On Mon, Mar 22, 2010 at 12:45 PM, Marvin Humphrey
wrote:
> On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote:
>> Also, will Lucy store the original stats?
>
> These?
>
> * Total number of tokens in the field.
> * Number of unique terms in the field.
> * Doc boost.
> * Fiel
On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote:
> Also, will Lucy store the original stats?
These?
* Total number of tokens in the field.
* Number of unique terms in the field.
* Doc boost.
* Field boost.
That would depend on which Similiarity the user specs for
On Mon, Mar 15, 2010 at 7:49 PM, Marvin Humphrey wrote:
> On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
>> I mean specifically one should not have to commit to the precise
>> scoring model they will use for a given field, when they index that
>> field.
>
> Yeah, I've never se
On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote:
> I mean specifically one should not have to commit to the precise
> scoring model they will use for a given field, when they index that
> field.
Yeah, I've never seen committing to a precise scoring model at index-time via
Sim ch
>>> But I don't like baking in search concepts at index time...
>>
> Many scoring models are possible if you store enough stats in the
> index.
>
in general the missing stats seem to fit in two buckets/categories:
1) length normalization pivot: average length in bytes, terms, unique terms
2) term
On Mon, Mar 15, 2010 at 12:03 AM, Marvin Humphrey
wrote:
> On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote:
>
>> I still don't think similarity should have any bearing during indexing.
>
> Similarity has always, from day one, affected the contents of the index. This
> idea that
On Sat, Mar 13, 2010 at 06:41:26AM -0500, Michael McCandless wrote:
> I still don't think similarity should have any bearing during indexing.
Similarity has always, from day one, affected the contents of the index. This
idea that it should be totally divorced from indexing is, in fact, a very
si
On Fri, Mar 12, 2010 at 8:31 PM, Marvin Humphrey wrote:
> On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote:
>> > So there would be polymorphism in the decoding phase while we're supplying
>> > information the Similarity object needs to make its similarity judgments.
>> > However,
On Thu, Mar 11, 2010 at 12:35 PM, Marvin Humphrey
wrote:
> On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote:
>
>> We ask it to give us a Codec.
>
> There's a conflict between the segment-wide role of the "Codec" class and its
> role as specifier for posting format.
>
> In some se
On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote:
> > So there would be polymorphism in the decoding phase while we're supplying
> > information the Similarity object needs to make its similarity judgments.
> > However, that polymorphism would be handled internally -- it wouldn't
On Mon, Mar 08, 2010 at 02:10:35PM -0500, Michael McCandless wrote:
> We ask it to give us a Codec.
There's a conflict between the segment-wide role of the "Codec" class and its
role as specifier for posting format.
In some sense, you could argue that the "codec" reads/writes the entire index
se
On Tue, Mar 9, 2010 at 3:58 PM, Marvin Humphrey wrote:
> On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote:
>>
>> >> You said "of course" before but... how in your proposal could one
>> >> store all stats for a given field during indexing, but then sometimes
>> >> use match-only a
On Tue, Mar 09, 2010 at 01:18:12PM -0500, Michael McCandless wrote:
>
> >> You said "of course" before but... how in your proposal could one
> >> store all stats for a given field during indexing, but then sometimes
> >> use match-only and sometimes full-scoring when querying against that
> >> fie
On Tue, Mar 9, 2010 at 10:03 AM, Marvin Humphrey wrote:
> On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote:
>> > For what it's worth, that's sort of the way KS used to work:
>> > Schema/FieldType
>> > information was stored entirely in source code. That's changed and now we
>>
On Tue, Mar 09, 2010 at 05:06:08AM -0500, Michael McCandless wrote:
> > For what it's worth, that's sort of the way KS used to work:
> > Schema/FieldType
> > information was stored entirely in source code. That's changed and now we
> > serialize the whole schema including all Analyzers, but sourc
On Mon, Mar 8, 2010 at 9:47 PM, Marvin Humphrey wrote:
> On Mon, Mar 08, 2010 at 01:13:53PM -0500, Michael McCandless wrote:
>> I think we can actually do so w/o losing Lucene's loose typing if we
>> simply peeled out [say] a FieldType class that holds the settings you
>> now set on each field (om
On Mon, Mar 08, 2010 at 01:13:53PM -0500, Michael McCandless wrote:
> I think we can actually do so w/o losing Lucene's loose typing if we
> simply peeled out [say] a FieldType class that holds the settings you
> now set on each field (omitTFAP, omitNorms, TermVector, Store,
> Index), and Field ins
On 03/08/2010 at 2:10 PM, Michael McCandless wrote:
> On Mon, Mar 8, 2010 at 2:07 PM, Steven A Rowe wrote:
> > On 03/08/2010 at 1:57 PM, Steven A Rowe wrote:
> > > On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
> > > > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
> > > > wrote:
> > > > >
On Mon, Mar 8, 2010 at 2:07 PM, Steven A Rowe wrote:
> On 03/08/2010 at 1:57 PM, Steven A Rowe wrote:
>> On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
>> > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
>> > wrote:
>> > > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
On 03/08/2010 at 1:57 PM, Steven A Rowe wrote:
> On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
> > On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
> > wrote:
> > > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> > > > > What's the flex API for specifying a custom postin
On 03/08/2010 at 1:13 PM, Michael McCandless wrote:
> On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey
> wrote:
> > On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> > > > What's the flex API for specifying a custom posting format?
> > >
> > > You implement a Codecs class, whic
On Sun, Mar 7, 2010 at 1:21 PM, Marvin Humphrey wrote:
> On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
>> It won't encounter an unknown posting format. It's the codec. It
>> knows all posting formats by the time it sees it.
>
> OK, so you're not going to handle this the way
On Sat, Mar 06, 2010 at 05:07:18AM -0500, Michael McCandless wrote:
> It won't encounter an unknown posting format. It's the codec. It
> knows all posting formats by the time it sees it.
OK, so you're not going to handle this the way Lucene handles field types and
accept a new codec spec refere
On Fri, Mar 5, 2010 at 1:54 PM, Marvin Humphrey wrote:
> On Thu, Mar 04, 2010 at 12:23:38PM -0500, Michael McCandless wrote:
>> > In a multi-node search cluster, pre-calculating norms at index-time
>> > wouldn't work well without additional communication between nodes to
>> > gather corpus-wide st
On Thu, Mar 04, 2010 at 12:23:38PM -0500, Michael McCandless wrote:
> > In a multi-node search cluster, pre-calculating norms at index-time
> > wouldn't work well without additional communication between nodes to
> > gather corpus-wide stats. But I suspect the same trick that works
> > for IDF in
On Tue, Mar 2, 2010 at 4:12 PM, Marvin Humphrey wrote:
> On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
>> The problem is, these scoring models need the avg field length (in
>> tokens) across the entire index, to compute the norms.
>>
>> Ie, you can't do that on writing a sing
On Tue, Mar 02, 2010 at 05:55:44AM -0500, Michael McCandless wrote:
> The problem is, these scoring models need the avg field length (in
> tokens) across the entire index, to compute the norms.
>
> Ie, you can't do that on writing a single segment.
I don't see why not. We can just move everything
On Sun, Feb 28, 2010 at 1:38 PM, Marvin Humphrey wrote:
> On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote:
>
>> * Store additional per-doc stats in the index, eg in a custom
>> posting list,
>
> Inline, as in a payload? Of course that can work, but if the data
> is common
On Fri, Feb 26, 2010 at 12:50:44PM -0500, Michael McCandless wrote:
> * Store additional per-doc stats in the index, eg in a custom
> posting list,
Inline, as in a payload? Of course that can work, but if the data is common
over multiple postings, you pay in space to gain locality. KinoS
In thinking about & discussing with Robert how to allow Lucene to
support other scoring models, eg lnu.ltc, BM25, etc I think a
relatively contained set of changes can give us a solid step forward.
Something like this:
* Store additional per-doc stats in the index, eg in a custom
posting
34 matches
Mail list logo