[ https://issues.apache.org/jira/browse/SOLR-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380604#comment-14380604 ]
ASF subversion and git services commented on SOLR-7082: ------------------------------------------------------- Commit 1669212 from [~joel.bernstein] in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669212 ] SOLR-7082 SOLR-7224 SOLR-7225: Streaming Aggregation for SolrCloud > Streaming Aggregation for SolrCloud > ----------------------------------- > > Key: SOLR-7082 > URL: https://issues.apache.org/jira/browse/SOLR-7082 > Project: Solr > Issue Type: New Feature > Components: SolrCloud > Reporter: Joel Bernstein > Fix For: Trunk, 5.1 > > Attachments: SOLR-7082.patch, SOLR-7082.patch, SOLR-7082.patch, > SOLR-7082.patch, SOLR-7082.patch > > > This issue provides a general purpose streaming aggregation framework for > SolrCloud. An overview of how it works can be found at this link: > http://heliosearch.org/streaming-aggregation-for-solrcloud/ > This functionality allows SolrCloud users to perform operations that we're > typically done using map/reduce or a parallel computing platform. > Here is a brief explanation of how the framework works: > There is a new Solrj *io* package found in: *org.apache.solr.client.solrj.io* > Key classes: > *Tuple*: Abstracts a document in a search result as a Map of key/value pairs. > *TupleStream*: is the base class for all of the streams. Abstracts search > results as a stream of Tuples. > *SolrStream*: connects to a single Solr instance. You call the read() method > to iterate over the Tuples. > *CloudSolrStream*: connects to a SolrCloud collection and merges the results > based on the sort param. The merge takes place in CloudSolrStream itself. > *Decorator Streams*: wrap other streams to perform operations the streams. > Some examples are the UniqueStream, MergeStream and ReducerStream. > *Going parallel with the ParallelStream and "Worker Collections"* > The io package also contains the *ParallelStream*, which wraps a TupleStream > and sends it to N worker nodes. The workers are chosen from a SolrCloud > collection. These "Worker Collections" don't have to hold any data, they can > just be used to execute TupleStreams. > *The StreamHandler* > The Worker nodes have a new RequestHandler called the *StreamHandler*. The > ParallelStream serializes a TupleStream, before it is opened, and sends it to > the StreamHandler on the Worker Nodes. > The StreamHandler on each Worker node deserializes the TupleStream, opens the > stream, iterates the tuples and streams them back to the ParallelStream. The > ParallelStream performs the final merge of Metrics and can be wrapped by > other Streams to handled the final merged TupleStream. > *Sorting and Partitioning search results (Shuffling)* > Each Worker node is shuffled 1/N of the document results. There is a > "partitionKeys" parameter that can be included with each TupleStream to > ensure that Tuples with the same partitionKeys are shuffled to the same > Worker. The actual partitioning is done with a filter query using the > HashQParserPlugin. The DocSets from the HashQParserPlugin can be cached in > the filter cache which provides extremely high performance hash partitioning. > Many of the stream transformations rely on the sort order of the TupleStreams > (GroupByStream, MergeJoinStream, UniqueStream, FilterStream etc..). To > accommodate this the search results can be sorted by specific keys. The > "/export" handler can be used to sort entire result sets efficiently. > By specifying the sort order of the results and the partition keys, documents > will be sorted and partitioned inside of the search engine. So when the > tuples hit the network they are already sorted, partitioned and headed > directly to correct worker node. > *Extending The Framework* > To extend the framework you create new TupleStream Decorators, that gather > custom metrics or perform custom stream transformations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org