[jira] [Commented] (LUCENE-4599) Compressed term vectors

Michael McCandless (JIRA) Sat, 08 Dec 2012 08:09:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527180#comment-13527180
 ]


Michael McCandless commented on LUCENE-4599:
--------------------------------------------

bq. Does it make sense to put this in an FST where the key is the term bytes 
and the value is what you're doing now for the positions, offsets, and payloads 
in a byte array? 

That's a neat idea :)  We should [almost] just be able to use 
MemoryPostingsFormat, since it already stores all postings in an FST.

bq. I think a FST would not compress as much as what LZ4 or Deflate can do? But 
maybe it could speed up TermsEnum.seekCeil on large documents so it might be an 
interesting idea regarding random access speed?

Likely it would not compress as well, since LZ4/Deflate are able to share 
common infix fragments too, but FST only shares prefix/suffix.  It'd be 
interesting to test ... but we should explore this (FST-backed 
TermVectorsFormat) in a new issue I think ... this issue seems awesome enough 
already :)

bq. Or... can we simply reference the terms by ord (an int) instead of writing 
each term bytes?

Using ords matching the main terms dict is a neat idea too!  It would be much 
more compact ... but, when reading the term vectors we'd need to resolve-by-ord 
against the main terms dictionary (not all postings formats support that: it's 
optional, and eg our default PF doesn't), which would likely be slower than 
today.

bq. Is that information available somewhere when writing/merging term vectors?

Unfortunately, no.  We only assign ords when it's time to flush the segment ... 
but we write term vectors "live" as we index each document.  If we changed 
that, eg buffered up term vectors, then we could get the ords when we wrote 
them.
                
> Compressed term vectors
> -----------------------
>
>                 Key: LUCENE-4599
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4599
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/codecs, core/termvectors
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 4.1
>
>         Attachments: LUCENE-4599.patch
>
>
> We should have codec-compressed term vectors similarly to what we have with 
> stored fields.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4599) Compressed term vectors

Reply via email to