[ https://issues.apache.org/jira/browse/HIVE-23889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161891#comment-17161891 ]
László Bodor edited comment on HIVE-23889 at 7/21/21, 12:03 PM: ---------------------------------------------------------------- this has been solved as part of HIVE-22538: https://github.com/apache/hive/commit/964f08ae733b037c6e58dfb4ed149ccad2d3ddc0#diff-bb969e858664d98848960a801fd58b5cR579 I'm closing this as duplicate, but we can use this ticket for "tracking" the fix of the schema issues: {code} - OrcFile.WriterOptions wo = OrcFile.writerOptions(this.options.getConfiguration()) - .inspector(rowInspector) - .callback(new OrcRecordUpdater.KeyIndexBuilder("testEmpty")); - OrcFile.createWriter(path, wo).close(); + OrcFile.createWriter(path, writerOptions).close(); {code} patch is attached as [^HIVE-23889.01.patch] was (Author: abstractdog): this has been solved as part of HIVE-22538: https://github.com/apache/hive/commit/964f08ae733b037c6e58dfb4ed149ccad2d3ddc0#diff-bb969e858664d98848960a801fd58b5cR579 I'm closing this as duplicate, but we can use this ticket for "tracking" the fix of the schema issues: {code} - OrcFile.WriterOptions wo = OrcFile.writerOptions(this.options.getConfiguration()) - .inspector(rowInspector) - .callback(new OrcRecordUpdater.KeyIndexBuilder("testEmpty")); - OrcFile.createWriter(path, wo).close(); + OrcFile.createWriter(path, writerOptions).close(); {code} > Empty bucket files are inserted with invalid schema after HIVE-21784 > -------------------------------------------------------------------- > > Key: HIVE-23889 > URL: https://issues.apache.org/jira/browse/HIVE-23889 > Project: Hive > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Attachments: HIVE-23889.01.patch > > > HIVE-21784 uses a new WriterOptions instead of the field in OrcRecordUpdater: > https://github.com/apache/hive/commit/f62379ba279f41b843fcd5f3d4a107b6fcd04dec#diff-bb969e858664d98848960a801fd58b5cR580-R583 > so in this scenario, the overwrite creates an empty bucket file, which is > fine as that was the intention of that patch, but it creates that with > invalid schema: > {code} > CREATE TABLE test_table ( > cda_id int, > cda_run_id varchar(255), > cda_load_ts timestamp, > global_party_id string) > PARTITIONED BY ( > cda_date int, > cda_job_name varchar(12)) > CLUSTERED BY (cda_id) > INTO 2 BUCKETS > STORED AS ORC; > INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , > cda_job_name = 'core_base') > SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, > 'global_party_id' global_party_id > UNION ALL > SELECT 2 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, > 'global_party_id' global_party_id; > ALTER TABLE test_table ADD COLUMNS (group_id string) CASCADE ; > INSERT OVERWRITE TABLE test_table PARTITION (cda_date = 20200601 , > cda_job_name = 'core_base') > SELECT 1 as cda_id,'cda_run_id' as cda_run_id, NULL as cda_load_ts, > 'global_party_id' global_party_id, 'group_id' as group_id; > {code} > because of HIVE-21784, the new empty bucket_00000 shows this schema in orc > dump: > {code} > Type: > struct<_col0:int,_col1:varchar(255),_col2:timestamp,_col3:string,_col4:string> > {code} > instead of: > {code} > Type: > struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<cda_id:int,cda_run_id:varchar(255),cda_load_ts:timestamp,global_party_id:string,group_id:string>> > {code} > and this could lead to problems later, when hive tries to look into the file > during split generation -- This message was sent by Atlassian Jira (v8.3.4#803005)