[ 
https://issues.apache.org/jira/browse/SOLR-7090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205740#comment-15205740
 ] 

Erick Erickson commented on SOLR-7090:
--------------------------------------

Disclaimer: I only skimmed this patch and the patch for SOLR-7341, so take this 
with a grain of salt.

Both of these seem, from my limited review to form a query against the "from" 
collection, return some kind of representation of the matched docs then apply 
those to the "to" query. What I'm wondering is if this is really the right way 
to go for these kinds of operations or whether the Streaming Aggregation 
process is better.

My concern is mostly that there's a fair bit of complexity here, and I'm very 
suspicious of the performance across large Solr collections, especially for the 
"from" collections.

I'd be reluctant to see this functionality go into Solr without some 
performance numbers. Since we're now regularly seeing Solr used with very large 
corpi I have to ask whether this is complexity we want to add (and then 
support). I'd at least like to see what kinds of use-cases are solved by this 
functionality that aren't handled by Streaming Aggregation and/or whether we 
could implement this functionality with Streaming Aggregation instead.

The discussion changes if there are use-cases this functionality supports that 
we can't implement with a Streaming Aggregation solution, I'd just like to see 
them enumerated before we jump in with both feet.



> Cross collection join
> ---------------------
>
>                 Key: SOLR-7090
>                 URL: https://issues.apache.org/jira/browse/SOLR-7090
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ishan Chattopadhyaya
>             Fix For: 5.2, master
>
>         Attachments: SOLR-7090-fulljoin.patch, SOLR-7090.patch
>
>
> Although SOLR-4905 supports joins across collections in Cloud mode, there are 
> limitations, (i) the secondary collection must be replicated at each node 
> where the primary collection has a replica, (ii) the secondary collection 
> must be singly sharded.
> This issue explores ideas/possibilities of cross collection joins, even 
> across nodes. This will be helpful for users who wish to maintain boosts or 
> signals in a secondary, more frequently updated collection, and perform query 
> time join of these boosts/signals with results from the primary collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to