Re: Lucene Index backboned by DB

Robert Kirchgessner Wed, 16 Nov 2005 08:22:16 -0800

Hi, 

> 1) It might be OK to implement retrieving field values separately for a
> document. However, I think from a simplicity point of view, it might be
> better to have the application code do this drudgery. Adding this feature
> could complicate the nice and simple design of Lucene without much benefit.


Yes, it is possible to store document parts in a second index or in a
database. If only a small amount of documents at a time must be
loaded into memory, there are good solutions for that with the existing
storage model.

> 2) The application could separately a document into several documents, for
> example, one document for indexing mainly, the other documents for storing
> binary values for different fields. Thus, giving the relevant doc id, its
> associated binary value for a particular field could be loaded very fast
> with just a disk lookup (looking up the fdx file).

But consider the problem of using the information stored in a field for
sorting purposes (dates, any numeric attributes of documents). There
is now a special case implemented in Lucene: norms are stored as bytes
per document in a single file per segment. They a designed to be loaded at 
once into memory to completely avoid disk lookup for every document to
be weighted.

>
> This way, only the relevant field is loaded into memory rather than all of
> the fields for a doc. There is no change on Lucene side, only some more
> work for the application code.

I may have missed something, but I don't know how to implement
fast custom sorting without a disc access per document with the
current Lucene interface. Please tell me if I'm completely wrong.

Having the possibility of storing a binary field of fixed length in a
separate file to be loaded at once into memory for fast access would
solve the problem. Is it worth the effort ? I don't think it would
make interface that much more complicated.

Any comments?



>
> My view for a search library (or in general, a library), should be small
> and efficient, since it is used by lot of applications, any additional
> feature could potentially impact its robustness and liability to
> performance drawback.
>
> Welcome for any critics or comments?
>
> Jian
>
> On 11/15/05, Robert Kirchgessner <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > a discussion in
> >
> > http://issues.apache.org/jira/browse/LUCENE-196
> >
> > might be of interest to you.
> >
> > Did you think about storing the large pieces of documents
> > in a database to reduce the size of Lucene index?
> >
> > I think there are good reasons to adding support for
> > storing fields in separate files:
> >
> > 1. One could define a binary field of fixed length and store it
> > in a separate file. Then load it into memory and have fast
> > access for field contents.
> >
> > A use case might be: store calendar date (YYYY-MM-DD)
> > in three bytes, 4 bits for months, 5 bits for days and up to
> > 15 bits for years. If you want to retrieve hits sorted by date
> > you can load the fields file of size (3 * documents in index) bytes
> > and support sorting by date without accessing hard drive
> > for reading dates.
> >
> > 2. One could store document contents in a separate
> > file and fields of small size like title and some metadata
> > in the way it is stored now. It could speed up access to
> > fields. It would be interesting to know whether you gain
> > significant perfomance leaving the big chunks out, i.e.
> > not storing them in index.
> >
> > In my opinion 1. is the most interesting case: storing some
> > binary fields (dates, prices, length, any numeric metrics of
> > documents) would enable *really* fast sorting of hits.
> >
> > Any thoughts about this?
> >
> > Regards,
> >
> > Robert
> >
> >
> >
> > We have a similiar problem
> >
> > Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora:
> > > Hi all,
> > > in our testing application using lucene 1.4.3. Thanks you guys for
> > > that great job.
> > > We have index file around 12GiB, one file (merged). To retrieve hits it
> > > takes nice small amount of the time, but reading fields takes 10-100
> > > times more (the stored ones). I think because all the fields are read.
> > > I would like to try implement lucene index files as tables in db with
> > > some lazy fields loading. As I have searched web I have found only
> > > impl. of the store.Directory (bdb), but it only holds data as binary
> > > streams. This technique will be not so helpful because BLOB operations
> > > are not fast performing. On another side I will have a lack of the
> > > freedom from documents fields variability but I can omit a lot of the
> > > skipping and many opened files. Also IndexWriter can have document/term
> > > locking granuality.
> > > So I think that way leads to extends IndexWriter / IndexReader and have
> > > own implementation of index.Segment* classes. It is the best way or I
> > > missing smthg how achieve this?
> > > If it is bad idea, I will be happy to heard another possibilities.
> > >
> > > I would like also join development of the lucene. Is there some points
> > > how to start?
> > >
> > > Thx for reading this,
> > > sorry if I did some mistakes
> > >
> > > Karel
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Index backboned by DB

Reply via email to