[ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4599:
---------------------------------

    Attachment: LUCENE-4599.patch

Initial patch. It makes term vectors behave like Lucene 4.1 stored fields: one 
index file which is loaded into memory in a memory-efficient way and one data 
file that stores the actual term vectors (so 2 files instead of 3 with the 
current term vectors impl).

All core tests except TestIndexWriter.testEmptyDirRollback pass (because this 
test expects that there are 3 files for term vectors).

This is only work in progress, I still need to:
 - add tests to try to visit all branches,
 - override the default merge(MergeState) impl

I've tested this patch against 100000 docs from the 1K wikipedia dump, and term 
vectors were ~20% smaller (I should try against a corpus with bigger docs to 
get more relevant results).

If you have ideas to efficiently compress term vectors, you're welcome! 
Currently this patch does nothing crazy and stores terms and positions 
sequentially:
{code}
term1 - positions for term1 - offsets for term1 - payloads for term1 - term2 - 
...{code}

Given that many terms are likely to have a frequency of 1, it might be more 
efficient to pack the positions/offsets for several terms alltogether(?)
                
> Compressed term vectors
> -----------------------
>
>                 Key: LUCENE-4599
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4599
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to