[jira] [Commented] (HIVE-10016) Remove duplicated Hive table schema parsing in DataWritableReadSupport
[ https://issues.apache.org/jira/browse/HIVE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14495723#comment-14495723 ] Dong Chen commented on HIVE-10016: -- Thanks for working on the branch! [~spena] I am uploading the patch, but a problem occurs. When I rebase the latest patch 'HIVE-10016.patch' (target to trunk) to 'parquet' branch, a merge confilct happens. This is because the code of branch is behind trunk about one month. Do you think we sync the branch first, and then update the patch? (If so, I will rebase the latest patch after branch is sync-ed) Or we merge all the patches first, and then sync with trunk, resolve conflict together? (If so, patch 'HIVE-10016.1-parquet.patch' is ok for committing now) Remove duplicated Hive table schema parsing in DataWritableReadSupport -- Key: HIVE-10016 URL: https://issues.apache.org/jira/browse/HIVE-10016 Project: Hive Issue Type: Sub-task Reporter: Dong Chen Assignee: Dong Chen Attachments: HIVE-10016-parquet.patch, HIVE-10016.1-parquet.patch, HIVE-10016.patch In {{DataWritableReadSupport.init()}}, the table schema is created and its string format is set in conf. When construct the {{ParquetRecordReaderWrapper}} , the schema is fetched from conf and parsed several times. We could remove these schema parsing, and improve the speed of getRecordReader a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10016) Remove duplicated Hive table schema parsing in DataWritableReadSupport
[ https://issues.apache.org/jira/browse/HIVE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14494897#comment-14494897 ] Sergio Peña commented on HIVE-10016: [~dongc] Could you upload the patch that belongs to the 'parquet' branch so that I can commit it to parquet? Thanks. Remove duplicated Hive table schema parsing in DataWritableReadSupport -- Key: HIVE-10016 URL: https://issues.apache.org/jira/browse/HIVE-10016 Project: Hive Issue Type: Sub-task Reporter: Dong Chen Assignee: Dong Chen Attachments: HIVE-10016-parquet.patch, HIVE-10016.1-parquet.patch, HIVE-10016.patch In {{DataWritableReadSupport.init()}}, the table schema is created and its string format is set in conf. When construct the {{ParquetRecordReaderWrapper}} , the schema is fetched from conf and parsed several times. We could remove these schema parsing, and improve the speed of getRecordReader a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10016) Remove duplicated Hive table schema parsing in DataWritableReadSupport
[ https://issues.apache.org/jira/browse/HIVE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486835#comment-14486835 ] Dong Chen commented on HIVE-10016: -- The failed test is not related. The patch is rebased to trunk and is ready to go. Remove duplicated Hive table schema parsing in DataWritableReadSupport -- Key: HIVE-10016 URL: https://issues.apache.org/jira/browse/HIVE-10016 Project: Hive Issue Type: Sub-task Reporter: Dong Chen Assignee: Dong Chen Attachments: HIVE-10016-parquet.patch, HIVE-10016.1-parquet.patch, HIVE-10016.patch In {{DataWritableReadSupport.init()}}, the table schema is created and its string format is set in conf. When construct the {{ParquetRecordReaderWrapper}} , the schema is fetched from conf and parsed several times. We could remove these schema parsing, and improve the speed of getRecordReader a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10016) Remove duplicated Hive table schema parsing in DataWritableReadSupport
[ https://issues.apache.org/jira/browse/HIVE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486617#comment-14486617 ] Hive QA commented on HIVE-10016: {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12723827/HIVE-10016.patch {color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 8665 tests executed *Failed tests:* {noformat} TestMinimrCliDriver-bucketmapjoin6.q-constprog_partitioner.q-infer_bucket_sort_dyn_part.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-external_table_with_space_in_location_path.q-infer_bucket_sort_merge.q-auto_sortmerge_join_16.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-groupby2.q-import_exported_table.q-bucketizedhiveinputformat.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-index_bitmap3.q-stats_counter_partitioned.q-temp_table_external.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_reducers_power_two.q-scriptfile1.q-scriptfile1_win.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-leftsemijoin_mr.q-load_hdfs_file_with_space_in_the_name.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-list_bucket_dml_10.q-bucket_num_reducers.q-bucket6.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-load_fs2.q-file_with_header_footer.q-ql_rewrite_gbtoidx_cbo_1.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-parallel_orderby.q-reduce_deduplicate.q-ql_rewrite_gbtoidx_cbo_2.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-ql_rewrite_gbtoidx.q-smb_mapjoin_8.q - did not produce a TEST-*.xml file TestMinimrCliDriver-schemeAuthority2.q-bucket4.q-input16_cc.q-and-1-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_acid org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_parquet_join {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3336/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/3336/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-3336/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 15 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12723827 - PreCommit-HIVE-TRUNK-Build Remove duplicated Hive table schema parsing in DataWritableReadSupport -- Key: HIVE-10016 URL: https://issues.apache.org/jira/browse/HIVE-10016 Project: Hive Issue Type: Sub-task Reporter: Dong Chen Assignee: Dong Chen Attachments: HIVE-10016-parquet.patch, HIVE-10016.1-parquet.patch, HIVE-10016.patch In {{DataWritableReadSupport.init()}}, the table schema is created and its string format is set in conf. When construct the {{ParquetRecordReaderWrapper}} , the schema is fetched from conf and parsed several times. We could remove these schema parsing, and improve the speed of getRecordReader a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10016) Remove duplicated Hive table schema parsing in DataWritableReadSupport
[ https://issues.apache.org/jira/browse/HIVE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376026#comment-14376026 ] Sergio Peña commented on HIVE-10016: +1 This is good [~dongc] Thanks for the patch. Remove duplicated Hive table schema parsing in DataWritableReadSupport -- Key: HIVE-10016 URL: https://issues.apache.org/jira/browse/HIVE-10016 Project: Hive Issue Type: Sub-task Reporter: Dong Chen Assignee: Dong Chen Attachments: HIVE-10016-parquet.patch, HIVE-10016.1-parquet.patch In {{DataWritableReadSupport.init()}}, the table schema is created and its string format is set in conf. When construct the {{ParquetRecordReaderWrapper}} , the schema is fetched from conf and parsed several times. We could remove these schema parsing, and improve the speed of getRecordReader a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10016) Remove duplicated Hive table schema parsing in DataWritableReadSupport
[ https://issues.apache.org/jira/browse/HIVE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371002#comment-14371002 ] Dong Chen commented on HIVE-10016: -- Thanks for your review! [~Ferd]. Yes, Parquet have a new instance there. The ReadSupport instance in Hive side is just for providing some info for ParquetRecordReaderWrapper creation. Remove duplicated Hive table schema parsing in DataWritableReadSupport -- Key: HIVE-10016 URL: https://issues.apache.org/jira/browse/HIVE-10016 Project: Hive Issue Type: Sub-task Reporter: Dong Chen Assignee: Dong Chen Attachments: HIVE-10016-parquet.patch In {{DataWritableReadSupport.init()}}, the table schema is created and its string format is set in conf. When construct the {{ParquetRecordReaderWrapper}} , the schema is fetched from conf and parsed several times. We could remove these schema parsing, and improve the speed of getRecordReader a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-10016) Remove duplicated Hive table schema parsing in DataWritableReadSupport
[ https://issues.apache.org/jira/browse/HIVE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371448#comment-14371448 ] Sergio Peña commented on HIVE-10016: Looks good [~dongc]. Just a couple of small comments: - In DataWritableRecordConverter.java Could you remove the imports that are not used anymore: * import parquet.schema.MessageTypeParser; * import org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport; - In DataWritableReadSupport.java I think the 'MessageType tableSchema' is not needed. What if we just assign the value to hiveTableSchema, and use this variable in the rest of the block? MessageType tableSchema = new MessageType(TABLE_SCHEMA, typeListTable); hiveTableSchema = tableSchema; could it be: hiveTableSchema = new MessageType(TABLE_SCHEMA, typeListTable); Remove duplicated Hive table schema parsing in DataWritableReadSupport -- Key: HIVE-10016 URL: https://issues.apache.org/jira/browse/HIVE-10016 Project: Hive Issue Type: Sub-task Reporter: Dong Chen Assignee: Dong Chen Attachments: HIVE-10016-parquet.patch In {{DataWritableReadSupport.init()}}, the table schema is created and its string format is set in conf. When construct the {{ParquetRecordReaderWrapper}} , the schema is fetched from conf and parsed several times. We could remove these schema parsing, and improve the speed of getRecordReader a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)