[ https://issues.apache.org/jira/browse/DRILL-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chun Chang closed DRILL-2803. ----------------------------- Assignee: Chun Chang (was: Jacques Nadeau) verified fix {code} 0: jdbc:drill:schema=dfs.drillTestDirAdvanced> select t.id, t.nul, hash64(t.nul, hash64(t.id)) as hash_value from `complex.json` t where t.nul is null order by t.id limit 10; +------------+------------+------------+ | id | nul | hash_value | +------------+------------+------------+ | 2 | null | 2087691676675520095 | | 7 | null | 1124679897442190205 | | 8 | null | 6873016763837711824 | | 9 | null | 5689101538750744972 | | 14 | null | 3054377820775748426 | | 15 | null | 7945380393500328936 | | 18 | null | 3527854738750367914 | | 20 | null | 3939724015981009335 | | 23 | null | 6367888492325306020 | | 25 | null | 1786113389955931412 | +------------+------------+------------+ 10 rows selected (13.679 seconds) 0: jdbc:drill:schema=dfs.drillTestDirAdvanced> select * from sys.version; +------------+----------------+-------------+-------------+------------+ | commit_id | commit_message | commit_time | build_email | build_time | +------------+----------------+-------------+-------------+------------+ | 57a96d200e12c0efcad3f3ca9d935c42647234b1 | DRILL-2083: Fix bug in merging receiver | 27.04.2015 @ 17:12:13 EDT | Unknown | 27.04.2015 @ 23:19:52 EDT | +------------+----------------+-------------+-------------+------------+ {code} add a new test case - complex320.q > Severe skew due to null values in columns even when other columns are non-null > ------------------------------------------------------------------------------ > > Key: DRILL-2803 > URL: https://issues.apache.org/jira/browse/DRILL-2803 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 0.8.0 > Reporter: Aman Sinha > Assignee: Chun Chang > Fix For: 0.9.0 > > Attachments: DRILL-2803.patch > > > If you have 2 columns that are hashed (either for distribution or for hash > based operators) and one of those columns has lots of null values, it can > result in substantial skew even if the other column has non-null values. > In the following query the combined hash value of 2 columns is 0 even when 1 > column is non-null. The reason is that if the starting value is null (for > cr_reason_sk all values are null in the above query), it does not matter what > seed is passed in. The hash function treats the second parameter as a seed > and not as a combiner, so it gets ignored. > {code} > select cr_call_center_sk, cr_reason_sk, hash64(cr_reason_sk, > hash64(cr_call_center_sk)) as hash_value from catalog_returns where > cr_reason_sk is null and cr_call_center_sk is not null limit 10; > +-------------------+--------------+------------+ > | cr_call_center_sk | cr_reason_sk | hash_value | > +-------------------+--------------+------------+ > | 1 | null | 0 | > | 1 | null | 0 | > | 4 | null | 0 | > | 1 | null | 0 | > | 4 | null | 0 | > | 2 | null | 0 | > | 2 | null | 0 | > | 2 | null | 0 | > | 2 | null | 0 | > | 2 | null | 0 | > +-------------------+--------------+------------+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)