Re: Trying to understand cross-collection-join routing/hashing choices and behavior

Mikhail Khludnev Sun, 22 Jan 2023 11:17:14 -0800

Up^.

Hello!
Was there an answer?
Thanks


On Wed, Dec 21, 2022 at 9:38 PM Zack Kendall <[email protected]>
wrote:

> I'm trying to understand the cross-collection JOIN
> <
> https://solr.apache.org/guide/solr/latest/query-guide/join-query-parser.html#cross-collection-join
> >
> documentation,
> behavior, choices, and viability.
>
> *# Terminology language choice*
>
> """routerField - If the documents are routed to shards using the
> CompositeID router by the join field, then that field name should be
> specified in the configuration here. This will allow the parser to optimize
> the resulting HashRange query."""
>
> """routed - If true, the cross collection join query will use each shard’s
> hash range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but
> it depends on the local collection being routed by the to field. If this
> parameter is not specified, the cross collection join query will try to
> determine the correct value automatically."""
>
> *Question 1*: Why overload terminology like "route" when these parameters
> do NOT route AFAICT. Based on my reading of the code all they do is add a
> hash_range fq parameter to the remote join query request. Filtering results
> is not routing, so this fosters confusion. Is there reasoning behind this
> or just happenstance?
>
> *# Implied vs Actual behavior*
>
> My reading of the code base is this: the hash_range parameter is always
> populated with the "fromField" value. The routerField is only used to check
> against the "toField" for equality to enable the hash_range parameter
> usage, this is only done as a fall back if "routed" is not set.
>
> It's a little strange to me that "routerField" is not used as a router
> field, or even as a hash field. It is only used as a flag for "if a query
> is joining to THIS field then use hash_range filter on the fromField" (or
> at least that's how I read the code).
>
> *Question 2:* Is my reading of the code correct? Can we try to update the
> documentation to be more explicit about this?
>
>
> *# Routing *
>
> *Question 3:* Is there a reason why actual routing was not used? I'm not
> familiar with the Solr code base, but it seems like it'd be nicer to
> instead use existing routing behavior in this context instead of querying
> all and filtering results. This seems like it would need 2 things: First,
> the _route_ value from the current "local" request, and second, either the
> local client (like how solrj does) or the remote "/export" handler would
> need to recognize and handle this parameter. Is that obviously doable or
> not doable? Trying to understand why that approach wasn't taken originally.
>
>
> *# Hashing*
>
> Here is the behavior touted in the docs for HashRangeQueryParser
> <
> https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#hash-range-query-parser
> >
> .
> """In the cross collection join case, the hash range query parser is used
> to ensure that each shard only gets the set of join keys that would end up
> on that shard. This query parser uses the MurmurHash3_x86_32. This is the
> same as the default hashing for the default composite ID router in Solr."""
>
> The documentation mentions "CompositeID router", which we know is based on
> prefixes (split on "!") being hashed and routed with the first/top 16 bits
> of info (with the later 16 bits provided by the rest of the doc "id" on
> inserts).
>
> The CrossCollectionJoinQuery uses 16 bits from the current/local shard
> range, which seems fine and good. However, the HashRangeQuery appears to
> hash
> the entire field
> <
> https://github.com/apache/solr/blob/26195c82493422cb9d6d4bdf9d4452046e7b3f67/solr/core/src/java/org/apache/solr/search/join/HashRangeQuery.java#L116-L117
> >.
> So I'm struggling to understand how this would work, especially since the
> join field and the "route" field are sourced from the same value. Either
> the join field is a compositeId in which case the HashRangeQuery code
> appears to be invalid, as it would not hash "A!B" the same as the actual
> router would hash "A", or the join field is not a compositeId in which case
> for it to work it would have to be the exact value as the actual
> compositeId prefix field something like this doc: {"id":"A!B",
> "myJoinField": "A"}. (Or maybe using "router.field=myJoinField" works
> without the compositeId/"!" format?). And if the join field is not a
> compositeId, then the only thing you could join on is the broad category
> tenant/product/etc that is used as the compositeId prefix, which would
> severely limit the use-case of the plugin, preventing joins on something
> more akin to record-ids/foreign-keys, and only allowing you to narrow down
> the results by what you know ahead of time to cram into the "v=" query
> field.
>
> *Question 4:* Not a specific question so much as "am I onto something here
> or am I missing something and off base?"
>
> Actually reading through the test code, now I see that my hypothesized "it
> could only work if router key and join field are the same value" is
> actually what is tested. The data is set-up
> <
> https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L128-L130
> >with
> product_id as the compositeId prefix. Then all the test queries
> <
> https://github.com/apache/solr/blob/a18f5b3c7cf2ce3f4d1cd11288e82ba0f48f7dfd/solr/core/src/test/org/apache/solr/search/join/CrossCollectionJoinQueryTest.java#L166-L217
> >
> are
> joins on another field with the same product_Id value. So that explains how
> it can work.
>
> *Alternative Use-Case*
> While I'm here I guess I'll fill in the use-case I was hoping for based on
> how we currently do local joins. We want to have two collections which both
> route on the same tenantId, whereas our join is on more of a foreign-key,
> as seen below.
>
> // Collection-1
> {
> "id": "tenantId!abc"
>     "entity": "userUpload",
>     "entity_id": "abc",
>     "uploadedBy": "123",
> }
>
> // Collection-2
> {
> "id": "tenantId!123",
>     "entity": "user",
>     "entity_id": "123",
>     "user_groups": ["xyz",...]
> }
>
> // Query Collection-1, join example adapted to crossCollection. This will
> include user-upload documents that were uploaded-by the user in group xyz.
> {!join method="crossCollection"
>   fromIndex="Collection-2" // remote
>   from="entity_id"  // remote
>   to="uploadedBy" // local
>   v="user_groups:xyz" // remote search filter
> }
>
> This join query works locally and we wish it would work remotely,
> cross-collection, but it appears incompatible with the current
> routing/hashing behavior of the plugin.
>
> At this point I have worked through it enough that I understand how it
> currently works, and even rereading the docs it kinda makes more sense now
> like the information was there the whole time, but I think this is still
> worth raising for awareness and discussion. I don't currently have the
> need/time to update the plugin to expand its behavior. But I might be able
> to update the documentation to make it more clear so that others don't go
> through the same rollercoaster and deep dive that I've gone through.
>
> Thanks a bunch for any assistance or information regarding this!
>
> - Zack
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Trying to understand cross-collection-join routing/hashing choices and behavior

Reply via email to