[ https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354650#comment-14354650 ]
Ramkumar Aiyengar commented on SOLR-7082: ----------------------------------------- Haven't looked at the patch in great detail, but looks like the SolrJ side could use a few tests? There's a new package there but with no tests? > Streaming Aggregation for SolrCloud > ----------------------------------- > > Key: SOLR-7082 > URL: https://issues.apache.org/jira/browse/SOLR-7082 > Project: Solr > Issue Type: New Feature > Components: SolrCloud > Reporter: Joel Bernstein > Fix For: Trunk, 5.1 > > Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, > SOLR-7082.patch, SOLR-7082.patch > > > This issue provides a general purpose streaming aggregation framework for > SolrCloud. An overview of how it works can be found at this link: > http://heliosearch.org/streaming-aggregation-for-solrcloud/ > This functionality allows SolrCloud users to perform operations that we're > typically done using map/reduce or a parallel computing platform. > Here is a brief explanation of how the framework works: > There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io* > Key classes: > *Tuple*: Abstracts a document in a search result as a Map of key/value pairs. > *TupleStream*: is the base class for all of the streams. Abstracts search > results as a stream of Tuples. > *SolrStream*: connects to a single Solr instance. You call the read() method > to iterate over the Tuples. > *CloudSolrStream*: connects to a SolrCloud collection and merges the results > based on the sort param. The merge takes place in CloudSolrStream itself. > *Decorator Streams*: wrap other streams to gather *Metrics* on streams and > *transform* the streams. Some examples are the MetricStream, RollupStream, > GroupByStream, UniqueStream, MergeJoinStream, HashJoinStream, MergeStream, > FilterStream. > *Going parallel with the ParallelStream and "Worker Collections"* > The io package also contains the *ParallelStream*, which wraps a TupleStream > and sends it to N worker nodes. The workers are chosen from a SolrCloud > collection. These "Worker Collections" don't have to hold any data, they can > just be used to execute TupleStreams. > *The StreamHandler* > The Worker nodes have a new RequestHandler called the *StreamHandler*. The > ParallelStream serializes a TupleStream, before it is opened, and sends it to > the StreamHandler on the Worker Nodes. > The StreamHandler on each Worker node deserializes the TupleStream, opens the > stream, iterates the tuples and streams them back to the ParallelStream. The > ParallelStream performs the final merge of Metrics and can be wrapped by > other Streams to handled the final merged TupleStream. > *Sorting and Partitioning search results (Shuffling)* > Each Worker node is shuffled 1/N of the document results. There is a > "partitionKeys" parameter that can be included with each TupleStream to > ensure that Tuples with the same partitionKeys are shuffled to the same > Worker. The actual partitioning is done with a filter query using the > HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in > the filter cache which provides extremely high performance hash partitioning. > Many of the stream transformations rely on the sort order of the TupleStreams > (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To > accommodate this the search results can be sorted by specific keys. The > "/export" handler can be used to sort entire result sets efficiently. > By specifying the sort order of the results and the partition keys, documents > will be sorted and partitioned inside of the search engine. So when the > tuples hit the network they are already sorted, partitioned and headed > directly to correct worker node. > *Extending The Framework* > To extend the framework you create new TupleStream Decorators, that gather > custom metrics or perform custom stream transformations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org