[
https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012457#comment-15012457
]
Dennis Gove commented on SOLR-8281:
-----------------------------------
I'd see us needing to make a couple of changes.
*RollupStream*:
Instead of just adding a raw value to the tuple this should add a tuple itself
which contains metadata about the metric. Metadata is required to perform
merges on certain metrics (such as mean).
*MergeRollupStream*:
The construction of this will first validate that all tuples in substreams are
mergable. It can do this by asking the substreams for the metrics it intends to
calculate or return. Note that this requires a new function in the TupleStream
interface whose job it is to return all metrics calculated in that stream or
substreams.
The read() implementation of this stream will need to read all tuples from the
substream (most likely a ParallelStream but doesn't have to be). Each tuple
will be added to a map with map\[tupleKey\] = tuple. tupleKey is whatever
defines a unique tuple (ie, the group by fields). If a "same" tuple exists in
the map already then the existing tuple and read tuple will be merged by
calling existingTuple = metric.merge(existingTuple, readTuple) for each metric
and then put back into the map. The end result is that the map contains the
merged tuples.
MergeRollupStream::read() will then return the first tuple from the map. Note,
we can use a sorted map or some way to return sorted values from a map so that
we can enforce some sort on the read tuples. Also, this allows us to
effectively resort the stream to something useful for wrapping streams.
I may be leaving something out but I believe this approach (or at least the one
I've designed in my head) will give us what we need.
An open question is do we return from the MergeRollupStream metrics containing
this metadata or should we strip the metadata out? I think we should return it
but am not wedded to that idea.
> Add RollupMergeStream to Streaming API
> --------------------------------------
>
> Key: SOLR-8281
> URL: https://issues.apache.org/jira/browse/SOLR-8281
> Project: Solr
> Issue Type: Bug
> Reporter: Joel Bernstein
> Assignee: Joel Bernstein
>
> The RollupMergeStream merges the aggregate results emitted by the
> RollupStream on *worker* nodes.
> This is designed to be used in conjunction with the HashJoinStream to perform
> rollup Aggregations on the joined Tuples. The HashJoinStream will require the
> tuples to be partitioned on the Join keys. To avoid needing to repartition on
> the *group by* fields for the RollupStream, we can perform a merge of the
> rolled up Tuples coming from the workers.
> The construct would like this:
> {code}
> mergeRollup (...
> parallel (...
> rollup (...
> hashJoin (
> search(...),
> search(...),
> on="fieldA"
> )
> )
> )
> )
> {code}
> The pseudo code above would push the *hashJoin* and *rollup* to the *worker*
> nodes. The emitted rolled up tuples would be merged by the mergeRollup.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]