Hey all, I’m on Hive 1.2.2 at work and I found some unfavorable behavior on one of my joins and I wanted to see what you all think. Below is an example: https://github.secureserver.net/gist/rkleck/258a9a7b3dd3c915f94a53234e422a1a
WITH string_key_setup AS ( SELECT CAST('1234 ' AS STRING) AS my_key UNION ALL SELECT CAST('1234' AS STRING) AS my_key ) , group_setup AS ( SELECT my_key AS my_key FROM string_key_setup GROUP BY my_key ) , string_key AS ( SELECT CAST('1234' AS STRING) AS my_key ) , integer_key AS ( SELECT CAST('1234' AS INT) AS my_key ) SELECT 'String To String Join' AS join_type , COUNT(1) AS num_rows FROM string_key t1 JOIN group_setup t2 ON t1.my_key = t2.my_key UNION ALL SELECT 'Integer To String Join' AS join_type , COUNT(1) AS num_rows FROM integer_key t1 JOIN group_setup t2 ON t1.my_key = t2.my_key ; This query returns the following: join_type num_rows Integer To String Join 2 String To String Join 1 I feel it’s unfavorable because the GROUP BY is not TRIMming the extra space around the string, but when we do a join against an integer it does trim the space. This query should not produce multiple rows. In my opinion, there should be another check when comparing string to int to make sure the size of the string and integer are the same (so in this example the row with key ‘1234 ‘ will be filtered out). Furthermore, the original query that produces these dupes has MANY more joins. I could fix by CASTing/TRIMing, but it would require me to know all the data types for the columns in the tables involved in the join (and maybe casting a string to int will lose some rows and you can’t TRIM an INT). Thoughts? Ryan Kleck Data Engineer IV Advanced Analytics 480-505-8800 xt. 4024 This email message and any attachments hereto are intended for use only by its intended recipient(s) and may contain confidential information. If you have received this email in error, please immediately notify the sender and permanently delete the original and any copy of this message and its attachments.