Hi, > 1) It might be OK to implement retrieving field values separately for a > document. However, I think from a simplicity point of view, it might be > better to have the application code do this drudgery. Adding this feature > could complicate the nice and simple design of Lucene without much benefit.
Yes, it is possible to store document parts in a second index or in a database. If only a small amount of documents at a time must be loaded into memory, there are good solutions for that with the existing storage model. > 2) The application could separately a document into several documents, for > example, one document for indexing mainly, the other documents for storing > binary values for different fields. Thus, giving the relevant doc id, its > associated binary value for a particular field could be loaded very fast > with just a disk lookup (looking up the fdx file). But consider the problem of using the information stored in a field for sorting purposes (dates, any numeric attributes of documents). There is now a special case implemented in Lucene: norms are stored as bytes per document in a single file per segment. They a designed to be loaded at once into memory to completely avoid disk lookup for every document to be weighted. > > This way, only the relevant field is loaded into memory rather than all of > the fields for a doc. There is no change on Lucene side, only some more > work for the application code. I may have missed something, but I don't know how to implement fast custom sorting without a disc access per document with the current Lucene interface. Please tell me if I'm completely wrong. Having the possibility of storing a binary field of fixed length in a separate file to be loaded at once into memory for fast access would solve the problem. Is it worth the effort ? I don't think it would make interface that much more complicated. Any comments? > > My view for a search library (or in general, a library), should be small > and efficient, since it is used by lot of applications, any additional > feature could potentially impact its robustness and liability to > performance drawback. > > Welcome for any critics or comments? > > Jian > > On 11/15/05, Robert Kirchgessner <[EMAIL PROTECTED]> wrote: > > Hi, > > > > a discussion in > > > > http://issues.apache.org/jira/browse/LUCENE-196 > > > > might be of interest to you. > > > > Did you think about storing the large pieces of documents > > in a database to reduce the size of Lucene index? > > > > I think there are good reasons to adding support for > > storing fields in separate files: > > > > 1. One could define a binary field of fixed length and store it > > in a separate file. Then load it into memory and have fast > > access for field contents. > > > > A use case might be: store calendar date (YYYY-MM-DD) > > in three bytes, 4 bits for months, 5 bits for days and up to > > 15 bits for years. If you want to retrieve hits sorted by date > > you can load the fields file of size (3 * documents in index) bytes > > and support sorting by date without accessing hard drive > > for reading dates. > > > > 2. One could store document contents in a separate > > file and fields of small size like title and some metadata > > in the way it is stored now. It could speed up access to > > fields. It would be interesting to know whether you gain > > significant perfomance leaving the big chunks out, i.e. > > not storing them in index. > > > > In my opinion 1. is the most interesting case: storing some > > binary fields (dates, prices, length, any numeric metrics of > > documents) would enable *really* fast sorting of hits. > > > > Any thoughts about this? > > > > Regards, > > > > Robert > > > > > > > > We have a similiar problem > > > > Am Dienstag, 15. November 2005 23:23 schrieb Karel Tejnora: > > > Hi all, > > > in our testing application using lucene 1.4.3. Thanks you guys for > > > that great job. > > > We have index file around 12GiB, one file (merged). To retrieve hits it > > > takes nice small amount of the time, but reading fields takes 10-100 > > > times more (the stored ones). I think because all the fields are read. > > > I would like to try implement lucene index files as tables in db with > > > some lazy fields loading. As I have searched web I have found only > > > impl. of the store.Directory (bdb), but it only holds data as binary > > > streams. This technique will be not so helpful because BLOB operations > > > are not fast performing. On another side I will have a lack of the > > > freedom from documents fields variability but I can omit a lot of the > > > skipping and many opened files. Also IndexWriter can have document/term > > > locking granuality. > > > So I think that way leads to extends IndexWriter / IndexReader and have > > > own implementation of index.Segment* classes. It is the best way or I > > > missing smthg how achieve this? > > > If it is bad idea, I will be happy to heard another possibilities. > > > > > > I would like also join development of the lucene. Is there some points > > > how to start? > > > > > > Thx for reading this, > > > sorry if I did some mistakes > > > > > > Karel > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]