[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API

2015-11-18 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012646#comment-15012646
 ] 

Joel Bernstein commented on SOLR-8281:
--

Thinking about this some more, possibly this is a job for the ReducerStream. We 
could add Operations to the reducer stream and have the operations perform the 
merge.

In we went this route we would scrap the MergeRollupStream and change this 
ticket to "Add operations to the ReducerStream".

> Add RollupMergeStream to Streaming API
> --
>
> Key: SOLR-8281
> URL: https://issues.apache.org/jira/browse/SOLR-8281
> Project: Solr
>  Issue Type: Bug
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>
> The RollupMergeStream merges the aggregate results emitted by the 
> RollupStream on *worker* nodes.
> This is designed to be used in conjunction with the HashJoinStream to perform 
> rollup Aggregations on the joined Tuples. The HashJoinStream will require the 
> tuples to be partitioned on the Join keys. To avoid needing to repartition on 
> the *group by* fields for the RollupStream, we can perform a merge of the 
> rolled up Tuples coming from the workers.
> The construct would like this:
> {code}
> mergeRollup (...
>   parallel (...
> rollup (...
> hashJoin (
>   search(...),
>   search(...),
>   on="fieldA" 
> )
>  )
>  )
>)
> {code}
> The pseudo code above would push the *hashJoin* and *rollup* to the *worker* 
> nodes. The emitted rolled up tuples would be merged by the mergeRollup.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API

2015-11-18 Thread Dennis Gove (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012651#comment-15012651
 ] 

Dennis Gove commented on SOLR-8281:
---

To be honest I think this logic should live in the ParallelStream. As a user of 
this stream I would expect it to properly merge all workers together, including 
metrics calculated in those workers. 

> Add RollupMergeStream to Streaming API
> --
>
> Key: SOLR-8281
> URL: https://issues.apache.org/jira/browse/SOLR-8281
> Project: Solr
>  Issue Type: Bug
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>
> The RollupMergeStream merges the aggregate results emitted by the 
> RollupStream on *worker* nodes.
> This is designed to be used in conjunction with the HashJoinStream to perform 
> rollup Aggregations on the joined Tuples. The HashJoinStream will require the 
> tuples to be partitioned on the Join keys. To avoid needing to repartition on 
> the *group by* fields for the RollupStream, we can perform a merge of the 
> rolled up Tuples coming from the workers.
> The construct would like this:
> {code}
> mergeRollup (...
>   parallel (...
> rollup (...
> hashJoin (
>   search(...),
>   search(...),
>   on="fieldA" 
> )
>  )
>  )
>)
> {code}
> The pseudo code above would push the *hashJoin* and *rollup* to the *worker* 
> nodes. The emitted rolled up tuples would be merged by the mergeRollup.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API

2015-11-18 Thread Dennis Gove (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012650#comment-15012650
 ] 

Dennis Gove commented on SOLR-8281:
---

To be honest I think this logic should live in the ParallelStream. As a user of 
this stream I would expect it to properly merge all workers together, including 
metrics calculated in those workers. 

> Add RollupMergeStream to Streaming API
> --
>
> Key: SOLR-8281
> URL: https://issues.apache.org/jira/browse/SOLR-8281
> Project: Solr
>  Issue Type: Bug
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>
> The RollupMergeStream merges the aggregate results emitted by the 
> RollupStream on *worker* nodes.
> This is designed to be used in conjunction with the HashJoinStream to perform 
> rollup Aggregations on the joined Tuples. The HashJoinStream will require the 
> tuples to be partitioned on the Join keys. To avoid needing to repartition on 
> the *group by* fields for the RollupStream, we can perform a merge of the 
> rolled up Tuples coming from the workers.
> The construct would like this:
> {code}
> mergeRollup (...
>   parallel (...
> rollup (...
> hashJoin (
>   search(...),
>   search(...),
>   on="fieldA" 
> )
>  )
>  )
>)
> {code}
> The pseudo code above would push the *hashJoin* and *rollup* to the *worker* 
> nodes. The emitted rolled up tuples would be merged by the mergeRollup.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API

2015-11-18 Thread Dennis Gove (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012457#comment-15012457
 ] 

Dennis Gove commented on SOLR-8281:
---

I'd see us needing to make a couple of changes.

*RollupStream*:
Instead of just adding a raw value to the tuple this should add a tuple itself 
which contains metadata about the metric. Metadata is required to perform 
merges on certain metrics (such as mean).

*MergeRollupStream*:
The construction of this will first validate that all tuples in substreams are 
mergable. It can do this by asking the substreams for the metrics it intends to 
calculate or return. Note that this requires a new function in the TupleStream 
interface whose job it is to return all metrics calculated in that stream or 
substreams. 

The read() implementation of this stream will need to read all tuples from the 
substream (most likely a ParallelStream but doesn't have to be). Each tuple 
will be added to a map with map\[tupleKey\] = tuple. tupleKey is whatever 
defines a unique tuple (ie, the group by fields). If a "same" tuple exists in 
the map already then the existing tuple and read tuple will be merged by 
calling existingTuple = metric.merge(existingTuple, readTuple) for each metric 
and then put back into the map. The end result is that the map contains the 
merged tuples.

MergeRollupStream::read() will then return the first tuple from the map. Note, 
we can use a sorted map or some way to return sorted values from a map so that 
we can enforce some sort on the read tuples. Also, this allows us to 
effectively resort the stream to something useful for wrapping streams.

I may be leaving something out but I believe this approach (or at least the one 
I've designed in my head) will give us what we need.

An open question is do we return from the MergeRollupStream metrics containing 
this metadata or should we strip the metadata out? I think we should return it 
but am not wedded to that idea.

> Add RollupMergeStream to Streaming API
> --
>
> Key: SOLR-8281
> URL: https://issues.apache.org/jira/browse/SOLR-8281
> Project: Solr
>  Issue Type: Bug
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>
> The RollupMergeStream merges the aggregate results emitted by the 
> RollupStream on *worker* nodes.
> This is designed to be used in conjunction with the HashJoinStream to perform 
> rollup Aggregations on the joined Tuples. The HashJoinStream will require the 
> tuples to be partitioned on the Join keys. To avoid needing to repartition on 
> the *group by* fields for the RollupStream, we can perform a merge of the 
> rolled up Tuples coming from the workers.
> The construct would like this:
> {code}
> mergeRollup (...
>   parallel (...
> rollup (...
> hashJoin (
>   search(...),
>   search(...),
>   on="fieldA" 
> )
>  )
>  )
>)
> {code}
> The pseudo code above would push the *hashJoin* and *rollup* to the *worker* 
> nodes. The emitted rolled up tuples would be merged by the mergeRollup.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8281) Add RollupMergeStream to Streaming API

2015-11-18 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012755#comment-15012755
 ] 

Joel Bernstein commented on SOLR-8281:
--

Early versions of the ParallelStream handled the merging of Rollups. But I 
pulled it out because I felt this needed more thought.

The nice thing about adding operations to the ReducerStream is that it makes 
the ReducerStream much more useful. So even we don't use to merge Rollups it's 
worth doing.

But this construct seems nice:

{code}
reduce (...
  parallel (...
rollup (...
hashJoin (
  search(...),
  search(...),
  on="fieldA" 
)
 )
 )
   )
{code}

Actually this is even nicer

{code}
reduce  (...
  parallel (...
reduce (...
hashJoin (
  search(...),
  search(...),
  on="fieldA" 
)
 )
 )
   )
{code}

In this case the ReducerStream replaces the RollupStream. 

To support this we would need an Operation to rollup the Metrics.

> Add RollupMergeStream to Streaming API
> --
>
> Key: SOLR-8281
> URL: https://issues.apache.org/jira/browse/SOLR-8281
> Project: Solr
>  Issue Type: Bug
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>
> The RollupMergeStream merges the aggregate results emitted by the 
> RollupStream on *worker* nodes.
> This is designed to be used in conjunction with the HashJoinStream to perform 
> rollup Aggregations on the joined Tuples. The HashJoinStream will require the 
> tuples to be partitioned on the Join keys. To avoid needing to repartition on 
> the *group by* fields for the RollupStream, we can perform a merge of the 
> rolled up Tuples coming from the workers.
> The construct would like this:
> {code}
> mergeRollup (...
>   parallel (...
> rollup (...
> hashJoin (
>   search(...),
>   search(...),
>   on="fieldA" 
> )
>  )
>  )
>)
> {code}
> The pseudo code above would push the *hashJoin* and *rollup* to the *worker* 
> nodes. The emitted rolled up tuples would be merged by the mergeRollup.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org