[ https://issues.apache.org/jira/browse/HIVE-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Deepak Jaiswal updated HIVE-18350: ---------------------------------- Attachment: HIVE-18350.14.patch > load data should rename files consistent with insert statements > --------------------------------------------------------------- > > Key: HIVE-18350 > URL: https://issues.apache.org/jira/browse/HIVE-18350 > Project: Hive > Issue Type: Bug > Reporter: Deepak Jaiswal > Assignee: Deepak Jaiswal > Priority: Major > Attachments: HIVE-18350.1.patch, HIVE-18350.10.patch, > HIVE-18350.11.patch, HIVE-18350.12.patch, HIVE-18350.13.patch, > HIVE-18350.14.patch, HIVE-18350.2.patch, HIVE-18350.3.patch, > HIVE-18350.4.patch, HIVE-18350.5.patch, HIVE-18350.6.patch, > HIVE-18350.7.patch, HIVE-18350.8.patch, HIVE-18350.9.patch > > > Insert statements create files of format ending with 0000_0, 0001_0 etc. > However, the load data uses the input file name. That results in inconsistent > naming convention which makes SMB joins difficult in some scenarios and may > cause trouble for other types of queries in future. > We need consistent naming convention. > For non-bucketed table, hive renames all the files regardless of how they > were named by the user. > For bucketed table, hive relies on user to name the files matching the > bucket in non-strict mode. Hive assumes that the data belongs to same bucket > in a file. In strict mode, loading bucketed table is disabled. > This will likely affect most of the tests which load data which is pretty > significant due to which it is further divided into two subtasks for smoother > merge. > For existing tables in customer database, it is recommended to reload > bucketed tables otherwise if customer tries to run SMB join and there is a > bucket for which there is no split, then there is a possibility of getting > incorrect results. However, this is not a regression as it would happen even > without the patch. > With this patch however, and reloading data, the results should be correct. > For non-bucketed tables and external tables, there is no difference in > behavior and reloading data is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)