[ http://issues.apache.org/jira/browse/LUCENE-196?page=all ]
Otis Gospodnetic resolved LUCENE-196:
-------------------------------------
Resolution: Duplicate
Assign To: (was: Lucene Developers)
Thanks Christian. I think LUCENE-545 provided the solution to selective field
loading now.
> [PATCH] Added support for segmented field data files and cached directories
> ---------------------------------------------------------------------------
>
> Key: LUCENE-196
> URL: http://issues.apache.org/jira/browse/LUCENE-196
> Project: Lucene - Java
> Type: Improvement
> Components: Index
> Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: All
> Platform: All
> Reporter: Christian Kohlschütter
> Priority: Minor
> Attachments: docStore-patch.txt, docStore-test-patch.txt,
> docStore-test-patch.txt, docStore-test-patch.txt, newDocStore-patch.txt,
> newDocStore-test-patch.txt
>
> Hello,
>
> I would like to contribute the following enhancement, hoping that it would be
> as useful for you as it is for me.
>
> For one of my applications, it was necessary to reprocess the Documents
> returned by a search in a Lucene index according to some Field values (for
> applying an "edit distance" function on unindexed fields, in my case).
>
> Because Lucene has to load every possibly relevant document (*all* fields,
> including the ones which are irrelevant for the algorithm) from disk into
> memory for this operation - doing so is extensively time-consuming.
>
> As far as I can see, currently, there is no satisfying solution to improve
> this situation except buffering all data in RAM using a RAMDirectory.
>
> But what if the field data is just too big to fit in RAM?
>
> My patch will handle this by splitting the monolithic "*.fdt"-Field data file
> into several "data store" files .fdt, .fd1, .fd2 and so on.
>
> These "data store" files are connected as a linked-list which permits you to
> load only the part of the field data that is relevant for the current
> operation.
>
> So, you can load all field data (as in the current implementation), or the
> fields from a specific interval [0;n] of data stores. Store 0 represents the
> data in the ".fdt" file, all data stores with ids > 0 are represented by
> files
> ".fd1", ".fd2", and so on.
>
> In my case, I would then simply cache the ".fdt" (data store 0) file in RAM
> (using a symbolic link to shm-/tmp), but leave all other .fd* files on
> harddisk. The .fdt file only contains the relevant field for my algorithm
> (which therefore remains quite small); all the other fields are stored in the
> rather big ".fd0" file. So, accessing Fields in .fdt requires no disk I/O,
> which speeds up things remarkably.
>
> You can compare this feature with having multiple tables in a relational
> database that are linked with 1..1 cardinality instead of having one big
> table.
>
> My proposed enhancement requires some API additions, which I try to explain
> now.
>
> To specify the desired data store for a Field, simply call the new method
> "Field setDataStore(int)" (docstore 0 is the default):
> doc.add(Field.Keyword("fieldA", "this is in docstore 0"));
> doc.add(Field.Keyword("fieldB", "this is in docstore 1").setDataStore(1));
>
> In this example, fieldA would be stored in ".fdt"; fieldB in ".fd1".
>
> When you retrieve the Document object (example docId = 123) using an
> IndexReader, you have the following options:
> "indexReader.document(123)" would load all fields from all data stores.
> "indexReader.document(123, 0)" would load only the fields from data store 0.
> "indexReader.document(123, 1)" would explictly load only the fields from data
> stores 0 and 1.
>
> The method "IndexReader.document(int n, int k)" is defined to fetch all
> fields
> from all data stores *at least* up to ID k. That way, existing IndexReader
> subclasses do not have to be modified, as I provide an overridable method in
> IndexReader which simply calls document(int n).
>
> A more concrete example is attached to this feature request as a
> JUnit-Testcase, as well as the patch itself.
>
> Have fun with it!
>
>
> Best regards,
>
> Christian Kohlschuetter
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]