[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-30 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387237#comment-14387237
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1670181 from [~joel.bernstein] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1670181 ]

SOLR-7082: Syntactic sugar for metric gathering

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to perform operations the streams. 
> Some examples are the UniqueStream, MergeStream and ReducerStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-30 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387207#comment-14387207
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1670176 from [~joel.bernstein] in branch 'dev/trunk'
[ https://svn.apache.org/r1670176 ]

SOLR-7082: Syntactic sugar for metric gathering

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to perform operations the streams. 
> Some examples are the UniqueStream, MergeStream and ReducerStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-27 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383752#comment-14383752
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1669557 from [~joel.bernstein] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1669557 ]

SOLR-7082: Editing Javadoc

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to perform operations the streams. 
> Some examples are the UniqueStream, MergeStream and ReducerStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-27 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383749#comment-14383749
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1669554 from [~joel.bernstein] in branch 'dev/trunk'
[ https://svn.apache.org/r1669554 ]

SOLR-7082: Editing Javadoc

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to perform operations the streams. 
> Some examples are the UniqueStream, MergeStream and ReducerStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-26 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381884#comment-14381884
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1669344 from [~joel.bernstein] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1669344 ]

SOLR-7082: update CHANGES.txt

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to perform operations the streams. 
> Some examples are the UniqueStream, MergeStream and ReducerStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-26 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381881#comment-14381881
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1669343 from [~joel.bernstein] in branch 'dev/trunk'
[ https://svn.apache.org/r1669343 ]

SOLR-7082: update CHANGES.txt

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to perform operations the streams. 
> Some examples are the UniqueStream, MergeStream and ReducerStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380604#comment-14380604
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1669212 from [~joel.bernstein] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1669212 ]

SOLR-7082 SOLR-7224 SOLR-7225: Streaming Aggregation for SolrCloud

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to perform operations the streams. 
> Some examples are the UniqueStream, MergeStream and ReducerStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-25 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380288#comment-14380288
 ] 

Joel Bernstein commented on SOLR-7082:
--

In the latest commit a few stream implementations are removed to focus on a 
core set of foundational streams for the initial release. 

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380273#comment-14380273
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1669164 from [~joel.bernstein] in branch 'dev/trunk'
[ https://svn.apache.org/r1669164 ]

SOLR-7082: Streaming Aggregation for SolrCloud

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-10 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355680#comment-14355680
 ] 

Ramkumar Aiyengar commented on SOLR-7082:
-

Missed that, thanks Joel..

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-10 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354747#comment-14354747
 ] 

Joel Bernstein commented on SOLR-7082:
--

The initial set of tests are here:
https://svn.apache.org/viewvc/lucene/dev/trunk/solr/solrj/src/test/org/apache/solr/client/solrj/io/StreamingTest.java

We can break these out to smaller files also.

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-10 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354650#comment-14354650
 ] 

Ramkumar Aiyengar commented on SOLR-7082:
-

Haven't looked at the patch in great detail, but looks like the SolrJ side 
could use a few tests? There's a new package there but with no tests? 

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354080#comment-14354080
 ] 

ASF subversion and git services commented on SOLR-7082:
---

Commit 1665391 from [~joel.bernstein] in branch 'dev/trunk'
[ https://svn.apache.org/r1665391 ]

SOLR-7082: Streaming Aggregation for SolrCloud

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-01 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342723#comment-14342723
 ] 

Yonik Seeley commented on SOLR-7082:


bq. but conceptually and functionally speaking, would you say this is more or 
less the same as ES aggregations?

I don't think so.  The heliosearch JSON Facet API looks a lot more like ES 
aggregations?   Streaming aggregations is a more general purpose distributed 
computation framework.



> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-01 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342720#comment-14342720
 ] 

Joel Bernstein commented on SOLR-7082:
--

I believe this is more closely comparable to technologies that shuffle, like 
Map/Reduce. 

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-01 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342655#comment-14342655
 ] 

Otis Gospodnetic commented on SOLR-7082:


Thanks Joel.  Re 1) -- but conceptually and functionally speaking, would you 
say this is more or less the same as ES aggregations?

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-03-01 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342293#comment-14342293
 ] 

Joel Bernstein commented on SOLR-7082:
--

Hi Otis,

Sorry about the slow response, just got back from vacation and still catching 
up. I'll be writing more about how Streaming aggregation works this week. Here 
are some thoughts on your questions:

1) This ticket is focused on providing fast streaming Map/Reduce like 
functionality. Streams can be sorted and partitioned strategically to minimized 
the amount of memory needed to perform aggregations and transformations. It 
should be fairly responsive because it pushes most of the work (record 
selection, sorting, partitioning) into the search the engine. So records go 
straight from the search engine to the correct worker node to be reduced. These 
techniques won't be as fast as faceting, but it will support a very wide range 
of use cases.

2) I aiming to get this into Solr trunk soon with eye towards having this ready 
to go for Solr 5.1

  

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-02-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333490#comment-14333490
 ] 

Otis Gospodnetic commented on SOLR-7082:


This looks really nice, Joel.  2 questions:
* this looks a lot like ES aggregations.  Have you maybe made any comparisons 
in terms of speed or memory footprint? (ES aggregations love heap)
* is this all going to land in Solr or will some of it remain in Heliosearch?


> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Joel Bernstein
> Fix For: Trunk, 5.1
>
> Attachments: SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

2015-02-05 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307869#comment-14307869
 ] 

Joel Bernstein commented on SOLR-7082:
--

The initial patch includes a fully operational parallel streaming framework 
with tests.

It's a fairly large patch so I'll be updating this ticket with details about 
the design and code.

> Streaming Aggregation for SolrCloud
> ---
>
> Key: SOLR-7082
> URL: https://issues.apache.org/jira/browse/SOLR-7082
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Affects Versions: 5.1
>Reporter: Joel Bernstein
> Attachments: SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org