Igor Kabiljo created HIVE-3541:
----------------------------------

             Summary: Allow keeping the bucket order while streaming bucketed 
table
                 Key: HIVE-3541
                 URL: https://issues.apache.org/jira/browse/HIVE-3541
             Project: Hive
          Issue Type: Improvement
            Reporter: Igor Kabiljo
            Priority: Minor


If we have a bucketed table, for example table_a with columns col_key and 
col_value (bucketed on col_key), and we need to create new derived bucketed 
table (by for example SELECT col_key, col_value*2 FROM table a), it would be 
fastest if it can be done in single streaming map-only job. 

With specifying:
SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
we can make sure that each input bucket will be read by exactly one mapper, and 
that they will output exactly one file. With:
SET hive.merge.mapfiles = false;
SET hive.merge.mapredfiles = false;
SET hive.enforce.bucketing = false;
We can make sure those files are inserted as is into the output table. 
But with that - bucket order is not kept, so end table is not bucketed 
correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to