[ 
https://issues.apache.org/jira/browse/SOLR-14470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112099#comment-17112099
 ] 

Andrzej Bialecki commented on SOLR-14470:
-----------------------------------------

For some reason Jira didn't add a link to the PR: 
[https://github.com/apache/lucene-solr/pull/1506]

The implementation simply reuses the streaming API to process documents just 
before they are sent out from /export, and it's purely optional - it's used 
only when {{expr=}} parameter is specified.

I had to do some restructuring of {{ExportWriter}} so the diff may seem large, 
but that was also to increase the reuse of already existing methods - the 
actual changes to ExportWriter that matter are just 20-some lines that hook-up 
the special streaming shim (ExportWriterStream).

> Add streaming expressions to /export handler
> --------------------------------------------
>
>                 Key: SOLR-14470
>                 URL: https://issues.apache.org/jira/browse/SOLR-14470
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Export Writer, streaming expressions
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> Many streaming scenarios would greatly benefit from the ability to perform 
> partial rollups (or other transformations) as early as possible, in order to 
> minimize the amount of data that has to be sent from shards to the 
> aggregating node.
> This can be implemented as a subset of streaming expressions that process the 
> data directly inside each local {{ExportHandler}} and outputs only the 
> records from the resulting stream. 
> Conceptually it would be similar to the way Hadoop {{Combiner}} works. As is 
> the case with {{Combiner}}, because the input data is processed in batches 
> there would be no guarantee that only 1 record per unique sort values would 
> be emitted - in fact, in most cases multiple partial aggregations would be 
> emitted. Still, in many scenarios this would allow reducing the amount of 
> data to be sent by several orders of magnitude.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to