[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

Martijn van Groningen (JIRA) Wed, 24 May 2017 06:23:21 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Martijn van Groningen updated LUCENE-7304:
------------------------------------------
    Attachment: LUCENE-7304.patch

It has been a while, but I had some time to get back to this. Updated patch to 
all changes that have happened so far in master (iterator based doc values api, 
two phase query execution and score supplier).

I ran the same performance test as before and due to doc values compression, 
the offset field now takes 337387 bytes instead of 839592 bytes before, which 
is good!

I'm still thinking about other ways of encoding the block of documents. Right 
now the parent document have a doc values field with the offset to the first 
child docid. Instead child documents can have a doc values field with the 
offset to its parent docid. That way parent doc can be indexed first before the 
child docs.



> Doc values based block join implementation
> ------------------------------------------
>
>                 Key: LUCENE-7304
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7304
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Priority: Minor
>         Attachments: LUCENE-5092-20140313.patch, LUCENE-7304-20160531.patch, 
> LUCENE-7304-20160606.patch, LUCENE_7304.patch, LUCENE_7304.patch, 
> LUCENE-7304.patch
>
>
> At query time the block join relies on a bitset for finding the previous 
> parent doc during advancing the doc id iterator. On large indices these 
> bitsets can consume large amounts of jvm heap space.  Also typically due the 
> nature how these bitsets are set, the 'FixedBitSet' implementation is used.
> The idea I had was to replace the bitset usage by a numeric doc values field 
> that stores offsets. Each child doc stores how many docids it is from its 
> parent doc and each parent stores how many docids it is apart from its first 
> child. At query time this information can be used to perform the block join.
> I think another benefit of this approach is that external tools can now 
> easily determine if a doc is part of a block of documents and perhaps this 
> also helps index time sorting?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7304) Doc values based block join implementation

Reply via email to