[ 
https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012457#comment-15012457
 ] 

Dennis Gove commented on SOLR-8281:
-----------------------------------

I'd see us needing to make a couple of changes.

*RollupStream*:
Instead of just adding a raw value to the tuple this should add a tuple itself 
which contains metadata about the metric. Metadata is required to perform 
merges on certain metrics (such as mean).

*MergeRollupStream*:
The construction of this will first validate that all tuples in substreams are 
mergable. It can do this by asking the substreams for the metrics it intends to 
calculate or return. Note that this requires a new function in the TupleStream 
interface whose job it is to return all metrics calculated in that stream or 
substreams. 

The read() implementation of this stream will need to read all tuples from the 
substream (most likely a ParallelStream but doesn't have to be). Each tuple 
will be added to a map with map\[tupleKey\] = tuple. tupleKey is whatever 
defines a unique tuple (ie, the group by fields). If a "same" tuple exists in 
the map already then the existing tuple and read tuple will be merged by 
calling existingTuple = metric.merge(existingTuple, readTuple) for each metric 
and then put back into the map. The end result is that the map contains the 
merged tuples.

MergeRollupStream::read() will then return the first tuple from the map. Note, 
we can use a sorted map or some way to return sorted values from a map so that 
we can enforce some sort on the read tuples. Also, this allows us to 
effectively resort the stream to something useful for wrapping streams.

I may be leaving something out but I believe this approach (or at least the one 
I've designed in my head) will give us what we need.

An open question is do we return from the MergeRollupStream metrics containing 
this metadata or should we strip the metadata out? I think we should return it 
but am not wedded to that idea.

> Add RollupMergeStream to Streaming API
> --------------------------------------
>
>                 Key: SOLR-8281
>                 URL: https://issues.apache.org/jira/browse/SOLR-8281
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>
> The RollupMergeStream merges the aggregate results emitted by the 
> RollupStream on *worker* nodes.
> This is designed to be used in conjunction with the HashJoinStream to perform 
> rollup Aggregations on the joined Tuples. The HashJoinStream will require the 
> tuples to be partitioned on the Join keys. To avoid needing to repartition on 
> the *group by* fields for the RollupStream, we can perform a merge of the 
> rolled up Tuples coming from the workers.
> The construct would like this:
> {code}
> mergeRollup (...
>                       parallel (...
>                                     rollup (...
>                                                 hashJoin (
>                                                                   search(...),
>                                                                   search(...),
>                                                                   on="fieldA" 
>                                                 )
>                                      )
>                          )
>                )
> {code}
> The pseudo code above would push the *hashJoin* and *rollup* to the *worker* 
> nodes. The emitted rolled up tuples would be merged by the mergeRollup.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to