[
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4599:
---------------------------------
Attachment: LUCENE-4599.patch
Initial patch. It makes term vectors behave like Lucene 4.1 stored fields: one
index file which is loaded into memory in a memory-efficient way and one data
file that stores the actual term vectors (so 2 files instead of 3 with the
current term vectors impl).
All core tests except TestIndexWriter.testEmptyDirRollback pass (because this
test expects that there are 3 files for term vectors).
This is only work in progress, I still need to:
- add tests to try to visit all branches,
- override the default merge(MergeState) impl
I've tested this patch against 100000 docs from the 1K wikipedia dump, and term
vectors were ~20% smaller (I should try against a corpus with bigger docs to
get more relevant results).
If you have ideas to efficiently compress term vectors, you're welcome!
Currently this patch does nothing crazy and stores terms and positions
sequentially:
{code}
term1 - positions for term1 - offsets for term1 - payloads for term1 - term2 -
...{code}
Given that many terms are likely to have a frequency of 1, it might be more
efficient to pack the positions/offsets for several terms alltogether(?)
> Compressed term vectors
> -----------------------
>
> Key: LUCENE-4599
> URL: https://issues.apache.org/jira/browse/LUCENE-4599
> Project: Lucene - Core
> Issue Type: Task
> Components: core/codecs, core/termvectors
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: 4.1
>
> Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with
> stored fields.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]