[ https://issues.apache.org/jira/browse/SOLR-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ishan Chattopadhyaya updated SOLR-12216: ---------------------------------------- Priority: Major (was: Trivial) > Add support for cross-cloud join > --------------------------------- > > Key: SOLR-12216 > URL: https://issues.apache.org/jira/browse/SOLR-12216 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: search > Reporter: Horatiu Lazu > Priority: Major > > This patch is to propose the idea of extending the capabilities of the > built-in join to allow joining across SolrClouds. Similar to streaming's > search function, the user can directly specify the zkHost of the other > SolrCloud and the rest of the syntax (from, to, fromIndex) can remain the > same. This join would be triggered when the zkHost parameter is specified, > containing the address of the other SolrCluster. It could also be packaged as > a separate plugin. > > In my testing, my current implementation is on average 4.5x faster than an > equivalent streaming expression intersecting from two search queries, one of > which streams from another collection on another SolrCloud. > h5. How it works > Similar to the existing join, I created a QParser, but this join works as a > post-filter. The join first populates a hash set containing fields from the > “from” index (i.e, the index that’s not the one we’re running the query > from). To obtain the fields, it establishes a connection with the other > SolrCloud using SolrJ through the ZooKeeper address specified, and then uses > a custom request handler that performs the query on the “from” index and > return back an array of strings containing a list of fields. Then, on the > “to” index, it iterates through the array sent as JavaBin and adds it to the > hash set. After that, we iterate through the NumericDocList for the “to” > core’s join field, and if there’s a value within the NumericDocList that’s > found within our hash set, we collect it inside the DelegatingCollector. > This allows for joining across sharded collections as well. > h5. How I benchmarked > I created web-app that first reloads the collections, then sends 25 AJAX > requests at once to the Solr endpoint of varying query sizes (between 127 > search results and 690,000), and then recorded the results. After all > responses are returned, the collection is reloaded, and the equivalent > streaming expressions are tested. This process is repeated 15 times, and the > average of the results is taken. > Note: The first two requests are not counted in the statistics, because it > “warms up” the collection. For reference, after bouncing Solr and at least > one query is executed, it takes on average ~890ms for joining on two > collections with about 690,000 results, while it takes ~4.5 seconds using > streaming expressions). > > I have written unit tests written as well. I would appreciate some comments > on this. Thank you. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org