[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API
[ https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012646#comment-15012646 ] Joel Bernstein commented on SOLR-8281: -- Thinking about this some more, possibly this is a job for the ReducerStream. We could add Operations to the reducer stream and have the operations perform the merge. In we went this route we would scrap the MergeRollupStream and change this ticket to "Add operations to the ReducerStream". > Add RollupMergeStream to Streaming API > -- > > Key: SOLR-8281 > URL: https://issues.apache.org/jira/browse/SOLR-8281 > Project: Solr > Issue Type: Bug >Reporter: Joel Bernstein >Assignee: Joel Bernstein > > The RollupMergeStream merges the aggregate results emitted by the > RollupStream on *worker* nodes. > This is designed to be used in conjunction with the HashJoinStream to perform > rollup Aggregations on the joined Tuples. The HashJoinStream will require the > tuples to be partitioned on the Join keys. To avoid needing to repartition on > the *group by* fields for the RollupStream, we can perform a merge of the > rolled up Tuples coming from the workers. > The construct would like this: > {code} > mergeRollup (... > parallel (... > rollup (... > hashJoin ( > search(...), > search(...), > on="fieldA" > ) > ) > ) >) > {code} > The pseudo code above would push the *hashJoin* and *rollup* to the *worker* > nodes. The emitted rolled up tuples would be merged by the mergeRollup. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API
[ https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012651#comment-15012651 ] Dennis Gove commented on SOLR-8281: --- To be honest I think this logic should live in the ParallelStream. As a user of this stream I would expect it to properly merge all workers together, including metrics calculated in those workers. > Add RollupMergeStream to Streaming API > -- > > Key: SOLR-8281 > URL: https://issues.apache.org/jira/browse/SOLR-8281 > Project: Solr > Issue Type: Bug >Reporter: Joel Bernstein >Assignee: Joel Bernstein > > The RollupMergeStream merges the aggregate results emitted by the > RollupStream on *worker* nodes. > This is designed to be used in conjunction with the HashJoinStream to perform > rollup Aggregations on the joined Tuples. The HashJoinStream will require the > tuples to be partitioned on the Join keys. To avoid needing to repartition on > the *group by* fields for the RollupStream, we can perform a merge of the > rolled up Tuples coming from the workers. > The construct would like this: > {code} > mergeRollup (... > parallel (... > rollup (... > hashJoin ( > search(...), > search(...), > on="fieldA" > ) > ) > ) >) > {code} > The pseudo code above would push the *hashJoin* and *rollup* to the *worker* > nodes. The emitted rolled up tuples would be merged by the mergeRollup. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API
[ https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012650#comment-15012650 ] Dennis Gove commented on SOLR-8281: --- To be honest I think this logic should live in the ParallelStream. As a user of this stream I would expect it to properly merge all workers together, including metrics calculated in those workers. > Add RollupMergeStream to Streaming API > -- > > Key: SOLR-8281 > URL: https://issues.apache.org/jira/browse/SOLR-8281 > Project: Solr > Issue Type: Bug >Reporter: Joel Bernstein >Assignee: Joel Bernstein > > The RollupMergeStream merges the aggregate results emitted by the > RollupStream on *worker* nodes. > This is designed to be used in conjunction with the HashJoinStream to perform > rollup Aggregations on the joined Tuples. The HashJoinStream will require the > tuples to be partitioned on the Join keys. To avoid needing to repartition on > the *group by* fields for the RollupStream, we can perform a merge of the > rolled up Tuples coming from the workers. > The construct would like this: > {code} > mergeRollup (... > parallel (... > rollup (... > hashJoin ( > search(...), > search(...), > on="fieldA" > ) > ) > ) >) > {code} > The pseudo code above would push the *hashJoin* and *rollup* to the *worker* > nodes. The emitted rolled up tuples would be merged by the mergeRollup. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API
[ https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012457#comment-15012457 ] Dennis Gove commented on SOLR-8281: --- I'd see us needing to make a couple of changes. *RollupStream*: Instead of just adding a raw value to the tuple this should add a tuple itself which contains metadata about the metric. Metadata is required to perform merges on certain metrics (such as mean). *MergeRollupStream*: The construction of this will first validate that all tuples in substreams are mergable. It can do this by asking the substreams for the metrics it intends to calculate or return. Note that this requires a new function in the TupleStream interface whose job it is to return all metrics calculated in that stream or substreams. The read() implementation of this stream will need to read all tuples from the substream (most likely a ParallelStream but doesn't have to be). Each tuple will be added to a map with map\[tupleKey\] = tuple. tupleKey is whatever defines a unique tuple (ie, the group by fields). If a "same" tuple exists in the map already then the existing tuple and read tuple will be merged by calling existingTuple = metric.merge(existingTuple, readTuple) for each metric and then put back into the map. The end result is that the map contains the merged tuples. MergeRollupStream::read() will then return the first tuple from the map. Note, we can use a sorted map or some way to return sorted values from a map so that we can enforce some sort on the read tuples. Also, this allows us to effectively resort the stream to something useful for wrapping streams. I may be leaving something out but I believe this approach (or at least the one I've designed in my head) will give us what we need. An open question is do we return from the MergeRollupStream metrics containing this metadata or should we strip the metadata out? I think we should return it but am not wedded to that idea. > Add RollupMergeStream to Streaming API > -- > > Key: SOLR-8281 > URL: https://issues.apache.org/jira/browse/SOLR-8281 > Project: Solr > Issue Type: Bug >Reporter: Joel Bernstein >Assignee: Joel Bernstein > > The RollupMergeStream merges the aggregate results emitted by the > RollupStream on *worker* nodes. > This is designed to be used in conjunction with the HashJoinStream to perform > rollup Aggregations on the joined Tuples. The HashJoinStream will require the > tuples to be partitioned on the Join keys. To avoid needing to repartition on > the *group by* fields for the RollupStream, we can perform a merge of the > rolled up Tuples coming from the workers. > The construct would like this: > {code} > mergeRollup (... > parallel (... > rollup (... > hashJoin ( > search(...), > search(...), > on="fieldA" > ) > ) > ) >) > {code} > The pseudo code above would push the *hashJoin* and *rollup* to the *worker* > nodes. The emitted rolled up tuples would be merged by the mergeRollup. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API
[ https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012755#comment-15012755 ] Joel Bernstein commented on SOLR-8281: -- Early versions of the ParallelStream handled the merging of Rollups. But I pulled it out because I felt this needed more thought. The nice thing about adding operations to the ReducerStream is that it makes the ReducerStream much more useful. So even we don't use to merge Rollups it's worth doing. But this construct seems nice: {code} reduce (... parallel (... rollup (... hashJoin ( search(...), search(...), on="fieldA" ) ) ) ) {code} Actually this is even nicer {code} reduce (... parallel (... reduce (... hashJoin ( search(...), search(...), on="fieldA" ) ) ) ) {code} In this case the ReducerStream replaces the RollupStream. To support this we would need an Operation to rollup the Metrics. > Add RollupMergeStream to Streaming API > -- > > Key: SOLR-8281 > URL: https://issues.apache.org/jira/browse/SOLR-8281 > Project: Solr > Issue Type: Bug >Reporter: Joel Bernstein >Assignee: Joel Bernstein > > The RollupMergeStream merges the aggregate results emitted by the > RollupStream on *worker* nodes. > This is designed to be used in conjunction with the HashJoinStream to perform > rollup Aggregations on the joined Tuples. The HashJoinStream will require the > tuples to be partitioned on the Join keys. To avoid needing to repartition on > the *group by* fields for the RollupStream, we can perform a merge of the > rolled up Tuples coming from the workers. > The construct would like this: > {code} > mergeRollup (... > parallel (... > rollup (... > hashJoin ( > search(...), > search(...), > on="fieldA" > ) > ) > ) >) > {code} > The pseudo code above would push the *hashJoin* and *rollup* to the *worker* > nodes. The emitted rolled up tuples would be merged by the mergeRollup. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org