[
https://issues.apache.org/jira/browse/SOLR-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Horatiu Lazu updated SOLR-12216:
--------------------------------
Priority: Trivial (was: Minor)
> Add support for cross-cloud join
> ---------------------------------
>
> Key: SOLR-12216
> URL: https://issues.apache.org/jira/browse/SOLR-12216
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: search
> Reporter: Horatiu Lazu
> Priority: Trivial
>
> This patch is to propose the idea of extended the capabilities of the
> built-in join to allow joining across SolrClouds. Similar to streaming's
> search function, the user can directly specify the zkHost of the other
> SolrCloud and the rest of the syntax (from, to, fromIndex) can remain the
> same. This join would be triggered when the zkHost parameter is specified,
> containing the address of the other SolrCluster. It could also be packaged as
> a separate plugin.
>
> In my testing, my current implementation is on average 4.5x faster than an
> equivalent streaming expression intersecting from two search queries, one of
> which streams from another collection on another SolrCloud.
> h5. How it works
> Similar to the existing join, I created a QParser, but this join works as a
> post-filter. The join first populates a hash set containing fields from the
> “from” index (i.e, the index that’s not the one we’re running the query
> from). To obtain the fields, it establishes a connection with the other
> SolrCloud using SolrJ through the ZooKeeper address specified, and then uses
> a custom request handler that performs the query on the “from” index and
> return back an array of strings containing a list of fields. Then, on the
> “to” index, it iterates through the array sent as JavaBin and adds it to the
> hash set. After that, we iterate through the NumericDocList for the “to”
> core’s join field, and if there’s a value within the NumericDocList that’s
> found within our hash set, we collect it inside the DelegatingCollector.
> This allows for joining across sharded collections as well.
> h5. How I benchmarked
> I created web-app that first reloads the collections, then sends 25 AJAX
> requests at once to the Solr endpoint of varying query sizes (between 127
> search results and 690,000), and then recorded the results. After all
> responses are returned, the collection is reloaded, and the equivalent
> streaming expressions are tested. This process is repeated 15 times, and the
> average of the results is taken.
> Note: The first two requests are not counted in the statistics, because it
> “warms up” the collection. For reference, after bouncing Solr and at least
> one query is executed, it takes on average ~890ms for joining on two
> collections with about 690,000 results, while it takes ~4.5 seconds using
> streaming expressions).
>
> I have written unit tests written as well. I would appreciate some comments
> on this. Thank you.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]