[ 
https://issues.apache.org/jira/browse/SOLR-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294202#comment-17294202
 ] 

Joel Bernstein edited comment on SOLR-15210 at 3/3/21, 1:29 AM:
----------------------------------------------------------------

Let's have the best of both worlds. We can lazily build up a bitset of 
documents to ignore for each worker. We can then apply this bitset before the 
sorting stage.

Here is the basic idea:

1) In the writer thread hash each key and decide if the worker should send the 
doc out.
2) When the writer finds a key that shouldn't be sent out, add the docId to an 
ignore bitset for the specific worker.
3) After each run combine the ignore bitsets with a cached set of ignore 
bitsets per worker.
4) Before performing the sort, turn off all bits for each worker that 
intersects the workers ignore bitset.

Basically this lazily builds a set of documents per worker that should NOT be 
sent out. This cache will warm over time making the exports faster over time.



was (Author: joel.bernstein):
Let's have the best of both worlds. We can lazily build up a bitset of 
documents to ignore for each worker. We can then apply this bitset before the 
sorting stage.

Here is the basic idea:

1) In the writer thread hash each key and decide if the worker should send the 
doc out.
2) When the writer finds a key that shouldn't be sent out, add the docId to an 
ignore bitset for the specific worker.
3) After each run combine the ignore bitsets with a cached set of ignore 
bitsets per worker.
4) Before performing the sort, turn off all bits for each worker that 
intersects the ignore bitset.

Basically this lazily builds a set of documents per worker that should NOT be 
sent out. This cache will warm over time making the exports faster over time.


> ParallelStream should execute hashing & filtering directly in ExportWriter
> --------------------------------------------------------------------------
>
>                 Key: SOLR-15210
>                 URL: https://issues.apache.org/jira/browse/SOLR-15210
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> Currently ParallelStream uses {{HashQParserPlugin}} to partition the work 
> based on a hashed value of {{partitionKeys}}. Unfortunately, this filter has 
> a high initial runtime cost because it has to materialize all values of 
> {{partitionKeys}} on each worker in order to calculate their hash and decide 
> whether a particular doc belongs to the worker's partition.
> The alternative approach would be for the worker to collect and sort all 
> documents and only then filter out the ones that belong to the current 
> partition just before they are written out by {{ExportWriter}} - at this 
> point we have to materialize the fields anyway but also we can benefit from a 
> (minimal) BytesRef caching that the FieldWriters use. On the other hand we 
> pay the price of sorting all documents, and we also lose the query filter 
> caching that the {{HashQParserPlugin}} uses.
> This tradeoff is not obvious but should be investigated to see if it offers 
> better performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to