[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

Ramkumar Aiyengar (JIRA) Tue, 10 Mar 2015 03:43:35 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354650#comment-14354650
 ]


Ramkumar Aiyengar commented on SOLR-7082:
-----------------------------------------

Haven't looked at the patch in great detail, but looks like the SolrJ side 
could use a few tests? There's a new package there but with no tests? 

> Streaming Aggregation for SolrCloud
> -----------------------------------
>
>                 Key: SOLR-7082
>                 URL: https://issues.apache.org/jira/browse/SOLR-7082
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>            Reporter: Joel Bernstein
>             Fix For: Trunk, 5.1
>
>         Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, 
> SOLR-7082.patch, SOLR-7082.patch
>
>
> This issue provides a general purpose streaming aggregation framework for 
> SolrCloud. An overview of how it works can be found at this link:
> http://heliosearch.org/streaming-aggregation-for-solrcloud/
> This functionality allows SolrCloud users to perform operations that we're 
> typically done using map/reduce or a parallel computing platform.
> Here is a brief explanation of how the framework works:
> There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io*
> Key classes:
> *Tuple*: Abstracts a document in a search result as a Map of key/value pairs.
> *TupleStream*: is the base class for all of the streams. Abstracts search 
> results as a stream of Tuples.
> *SolrStream*: connects to a single Solr instance. You call the read() method 
> to iterate over the Tuples.
> *CloudSolrStream*: connects to a SolrCloud collection and merges the results 
> based on the sort param. The merge takes place in CloudSolrStream itself.
> *Decorator Streams*: wrap other streams to gather *Metrics* on streams and 
> *transform* the streams. Some examples are the MetricStream, RollupStream, 
> GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, 
> FilterStream.
> *Going parallel with the ParallelStream and  "Worker Collections"*
> The io package also contains the *ParallelStream*, which wraps a TupleStream 
> and sends it to N worker nodes. The workers are chosen from a SolrCloud 
> collection. These "Worker Collections" don't have to hold any data, they can 
> just be used to execute TupleStreams.
> *The StreamHandler*
> The Worker nodes have a new RequestHandler called the *StreamHandler*. The 
> ParallelStream serializes a TupleStream, before it is opened, and sends it to 
> the StreamHandler on the Worker Nodes.
> The StreamHandler on each Worker node deserializes the TupleStream, opens the 
> stream, iterates the tuples and streams them back to the ParallelStream. The 
> ParallelStream performs the final merge of Metrics and can be wrapped by 
> other Streams to handled the final merged TupleStream.
> *Sorting and Partitioning search results (Shuffling)*
> Each Worker node is shuffled 1/N of the document results. There is a 
> "partitionKeys" parameter that can be included with each TupleStream to 
> ensure that Tuples with the same partitionKeys are shuffled to the same 
> Worker. The actual partitioning is done with a filter query using the 
> HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in 
> the filter cache which provides extremely high performance hash partitioning. 
> Many of the stream transformations rely on the sort order of the TupleStreams 
> (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To 
> accommodate this the search results can be sorted by specific keys. The 
> "/export" handler can be used to sort entire result sets efficiently.
> By specifying the sort order of the results and the partition keys, documents 
> will be sorted and partitioned inside of the search engine. So when the 
> tuples hit the network they are already sorted, partitioned and headed 
> directly to correct worker node.
> *Extending The Framework*
> To extend the framework you create new TupleStream Decorators, that gather 
> custom metrics or perform custom stream transformations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7082) Streaming Aggregation for SolrCloud

Reply via email to