[ https://issues.apache.org/jira/browse/HIVE-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Oleksiy Sayankin updated HIVE-13697: ------------------------------------ Status: Patch Available (was: In Progress) ROOT-CAUSE: toLowerCase() operator while getting skewed values from AST Node in BaseSemanticAnalyzer. Hence Skewed Values are stored lower case only. {code} hive> desc formatted testskew2; OK # col_name data_type comment id int a string # Detailed Table Information Database: default Owner: hdfs CreateTime: Thu May 12 18:37:20 EEST 2016 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs:/user/hive/warehouse/testskew2 Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1463067440 # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Stored As SubDirectories: Yes Skewed Columns: [a] Skewed Values: [[aus], [us]] <---- !!! ERROR !!! Storage Desc Params: serialization.format 1 {code} SOLUTION: Remove unnecessary toLowerCase() operator. > ListBucketing feature does not support uppercase string. > -------------------------------------------------------- > > Key: HIVE-13697 > URL: https://issues.apache.org/jira/browse/HIVE-13697 > Project: Hive > Issue Type: Bug > Components: Database/Schema > Affects Versions: 1.2.1 > Environment: 1.2.1 > Reporter: Hao Zhu > Assignee: Oleksiy Sayankin > Priority: Critical > Attachments: HIVE-13697.1.patch > > > This is the feature: > https://cwiki.apache.org/confluence/display/Hive/ListBucketing > 1. Good example: > {code} > CREATE TABLE testskew (id INT, a STRING) > SKEWED BY (a) ON ('abc', 'xyz') STORED AS DIRECTORIES; > set hive.mapred.supports.subdirectories=true; > set mapred.input.dir.recursive=true; > INSERT OVERWRITE TABLE testskew > SELECT 123,'abc' FROM dual > union all > SELECT 123,'xyz' FROM dual > union all > SELECT 123,'others' FROM dual; > {code} > {code} > # hadoop fs -ls /user/hive/warehouse/testskew > Found 3 items > drwxrwxrwx - mapr mapr 1 2016-05-05 14:56 > /user/hive/warehouse/testskew/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME > drwxrwxrwx - mapr mapr 1 2016-05-05 14:56 > /user/hive/warehouse/testskew/a=abc > drwxrwxrwx - mapr mapr 1 2016-05-05 14:56 > /user/hive/warehouse/testskew/a=xyz > {code} > This is good, because both "abc" and "xyz" directories got created. > 2. Bad example -- This is the issue > {code} > CREATE TABLE testskew2 (id INT, a STRING) > SKEWED BY (a) ON ('aus', 'US') STORED AS DIRECTORIES; > set hive.mapred.supports.subdirectories=true; > set mapred.input.dir.recursive=true; > INSERT OVERWRITE TABLE testskew2 > SELECT 123, 'aus' FROM dual > union all > SELECT 123, 'US' FROM dual > union all > SELECT 123, 'others' FROM dual; > {code} > You can see, only "aus" directory got created... > {code} > # hadoop fs -ls /user/hive/warehouse/testskew2 > Found 2 items > drwxrwxrwx - mapr mapr 1 2016-05-05 15:11 > /user/hive/warehouse/testskew2/HIVE_DEFAULT_LIST_BUCKETING_DIR_NAME > drwxrwxrwx - mapr mapr 1 2016-05-05 15:11 > /user/hive/warehouse/testskew2/a=aus > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)