Re: Flexible index format / Payloads Cont'd

Nicolas Lalevée Sun, 08 Oct 2006 08:51:54 -0700

Le Samedi 05 Août 2006 09:54, Nicolas Lalevée a écrit :
> Le Jeudi 3 Août 2006 21:49, Marvin Humphrey a écrit :
> > On Jul 31, 2006, at 8:25 AM, Nicolas Lalevée wrote:
> > > That looks good, but there is one restriction : it have to be per
> > > document.
> >
> > Yes, what I laid out was per-document - for each document, the fdx
> > file would keep a file pointer and an integer mapping to a codec.
> >
> > > In fact I was thinking about a more generic version that will allow
> > > the format
> > > compatibility, keeping .fdx as is :
> > >
> > > FieldData (.fdt) -->  <DocFieldData>SegSize
> > > DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
> > >
> > > And a default FieldsDataWriter will be the actual one, it will read
> > > the
> > > RawData as Bits, Value, with Value -->  String | BinaryValue,....
> > > Then, for my app, I will provide some custom FieldsDataWriter that
> > > will do
> > > exactly what I want.
> >
> > OK, that's quite similar, but with the info specifying how to
> > deserialize the document stored in fdt rather than fdx.
>
> In fact, you're not obliged to put a "codec" thing. If in your app your
> data will always have the same form, then you just put the data and no
> codec info. For my use case, I would skipped the bits about
> compressed/binary, and I will only put what I want : a pointer to a type, a
> pointer to a lang, and the value.
> One important note about this design is that the index would only be read
> by my custom reader and write by my custom writter.
>
> > However, I
> > don't think what you're describing makes the field storage in Lucene
> > arbitrarily extensible, since you're just going to override
> > FieldsWriter/FieldsReader rather than modify them so that they can
> > use arbitrary codecs.
>
> If you override FieldsWriter/FieldsReader, then you can put the
> writing/reading code you want, so you implement an arbitrary codec.
>
> > I think what I want to do is turn Lucene into an Object-Oriented
> > Database, or at least have Lucene adopt some characteristics of an
> > ODBMS.  However, I haven't used a real ODBMS and I'm not up on the
> > theory, so I can't say for sure.  I've been doing a little reading
> > here and there on object databases, but I've been extraordinarily
> > busy the last few weeks and haven't been able to study it in depth.
> >
> > The main point is this:
> >
> > Lucene users have diverse needs for what gets stored in the document/
> > field storage.  We've been meeting those needs by assigning more and
> > more bit flags.  That can't continue that ad infinitum.  However, we
> > *can* meet everyone's needs by applying a variant of the "Replace
> > Conditionals With Polymorphism" refactoring technique...
> >
> > http://xrl.us/p3kn (Link to www.eli.sdsu.edu)
> >
> > Think of those bit flags as an if-else chain.  Instead of all those
> > conditionals describing all the attributes of the Lucene Document you
> > want to store at that file pointer, we allow you to put whatever kind
> > of serialized object you desire there.  Maybe it's a Lucene
> > Document.  Maybe it's a FrechDocument.  Maybe it's a
> > RussianDocument.  Maybe it's a wrapped-up jpg.  You choose.
> >
> > Instead of continually adding to the complexity of the
> > deserialization algorithm, we we make that deserialization algorithm
> > user-definable.
>
> In fact, this is exactly my point. :-)
>
> If people thinks it is interesting, I can try to do a prototype.


As some has maybe noticed, I have done a patch about customizing the field 
data storage. See LUCENE-662. And I think I have finished playing with it.

Now I am interested in the other part of Lucene, the indexing part and 
payload. The critical point for the application I am developping is about 
faceted search. Yet I have implemented it as Solr does, ie with some filters. 
The problem with a such design is that the filters have to regenerated each 
time the index is changed. I am starting to think about to put in a Lucene 
index some precomputed filters: a new set of files *.fil(ter) will contain 
some serialized bitset. So modifying the index will automatically update the 
filters. The filters could be retreive from the IndexReader, maybe loaded 
only on request and cached. When adding a document, just a new parameter is 
needed, specifying the filters values.

It sounds to me a good start to add some payload info in lucene without 
breaking the API and the index format.

WDYT ?

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible index format / Payloads Cont'd

Reply via email to