abstractdog commented on PR #6317:
URL: https://github.com/apache/hive/pull/6317#issuecomment-3932359415

   > @abstractdog and @ayushtkn, I wanted to follow up properly on both points 
raised here.
   > 
   > First, @abstractdog, thank you for correcting me on how `HashSet` works! I 
genuinely didn't realize it always computes `hashCode()` first before even 
getting to `equals()`. I was wrong to claim the Set check was "mostly just 
comparing memory addresses," and I really appreciate you taking the time to 
explain that clearly.
   > 
   > Regarding the self-join safety concern, I decided to actually debug this 
locally. I attached a debugger to a test run, put a breakpoint inside 
`configureJobConf`, and inspected the `aliasToPartnInfo` map while executing a 
self-join query:
   > 
   > ```sql
   > SELECT * FROM test t1 JOIN test t2 USING(a);
   > ```
   > 
   > When I expanded `aliasToPartnInfo` in the debugger, I could see two 
entries: one for alias `t1` and one for alias `t2`. Both PartitionDesc objects 
had their tableDesc field pointing to the exact same @ identity number in the 
debugger, confirming they are the exact same Java object instance in memory.
   > 
   > So, my original safety argument was wrong! I thought that a self-join 
might produce two distinct `TableDesc` instances with different column 
configurations, but that's not what happens. Hive reuses the exact same 
`TableDesc` instance for all aliases of the same underlying table.
   > 
   > Because of this, `Set<TableDesc>` and `Set<String>` behave identically in 
this scenario, they both deduplicate correctly without skipping anything.
   > 
   > I am more than happy to switch to using `Set<String>` via 
`tableDesc.getTableName()` as you suggested. It is definitely lighter, and the 
behavior is exactly the same. I'll update the patch right away.
   
   thanks a lot @hemanthumashankar0511 for the detailed analysis, I really 
appreciate that!
   I would like to ask you to confirm 1 more scenario, or clarify something: 
you're using `tableDesc.getTableName()`, is it fully-qualified name, or just 
table name? what if the same table is joined from different databases, like 
`db1.a JOIN db2.a` ? there is a chance it's not a problem because they fall to 
separate `MapWork`s, but it's still worth a quick check, thanks in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to