[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033713#comment-17033713 ] Sahil Takiar commented on HIVE-14165: - Marking as unassigned as I am no longer working on this. IIRC this speedup only applies to very simply queries - e.g. select / project queries. > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Priority: Major > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.07.patch, HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033607#comment-17033607 ] Steve Loughran commented on HIVE-14165: --- What is the current status of this? Is it a defacto WONTFIX? Or is someone keeping the patch up to date > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.07.patch, HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804833#comment-16804833 ] t oo commented on HIVE-14165: - gentle ping > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.07.patch, HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941966#comment-15941966 ] Pengcheng Xiong commented on HIVE-14165: Hello, I am deferring this to Hive 3.0 as we are going to cut the first RC and it is not marked as blocker. Please feel free to commit to the branch if this can be resolved before the release. > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.07.patch, HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937749#comment-15937749 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12860035/HIVE-14165.07.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 10509 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[comments] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] (batchId=29) org.apache.hive.hcatalog.pig.TestOrcHCatLoader.testReadMissingPartitionBasicNeg (batchId=175) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/4304/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/4304/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-4304/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12860035 - PreCommit-HIVE-Build > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.07.patch, HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15806158#comment-15806158 ] Vihang Karajgaonkar commented on HIVE-14165: Thanks for the patch [~stakiar]. It seems like the previous implementation was ignoring zero length files for computing the splits. While FileInputFormat.getSplit() creates an empty Split for the zero length files. I am not sure how it impacts the execution, may be worth while to test. Also, if needed may be you can ignore the empty splits before adding them to {{FetchInputFormatSplit[] inputSplit}} > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767455#comment-15767455 ] Sahil Takiar commented on HIVE-14165: - [~poeppt] just attached an RB. I agree we shouldn't make backwards incompatible changes to Hive. Let me know what you think of the RB. There are some alternatives to this approach though: * The file listing could be done in the background, by a dedicated thread * Listing could be done eagerly rather than lazily so that the file listing does not block the fetch operator This would offer a good speedup, but would require the same amount of metadata operations to S3. > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767035#comment-15767035 ] Thomas Poepping commented on HIVE-14165: Hi Sahil, When you update the patch, can you create a new ReviewBoard submission? WRT the InputFormat issue, my feeling is that we should stray away from backwards-incompatible changes. Is there no way we can avoid the backwards-incompatible change, but still avoid the unnecessary list? I will be able to provide more targeted feedback once the RB submission has been updated. > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766615#comment-15766615 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12844199/HIVE-14165.06.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 10825 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=234) TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely timed out) (batchId=251) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] (batchId=71) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] (batchId=29) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[transform_ppr2] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[stats_based_fetch_decision] (batchId=151) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=93) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exchange_partition_neg_incomplete_partition] (batchId=84) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_00_unsupported_schema] (batchId=85) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query36] (batchId=222) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query70] (batchId=222) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query86] (batchId=222) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vector_count_distinct] (batchId=105) org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3] (batchId=171) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2671/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2671/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2671/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 15 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12844199 - PreCommit-HIVE-Build > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.06.patch, > HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766220#comment-15766220 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12844177/HIVE-14165.05.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 10825 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=234) TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely timed out) (batchId=251) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] (batchId=71) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] (batchId=29) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[transform_ppr2] (batchId=135) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[stats_based_fetch_decision] (batchId=151) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=93) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exchange_partition_neg_incomplete_partition] (batchId=84) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_00_unsupported_schema] (batchId=85) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query36] (batchId=222) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query70] (batchId=222) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query86] (batchId=222) org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3] (batchId=171) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/2667/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/2667/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-2667/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 14 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12844177 - PreCommit-HIVE-Build > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.04.patch, HIVE-14165.05.patch, HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765927#comment-15765927 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12844165/HIVE-14165.04.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 74 failed/errored test(s), 10825 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=234) TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely timed out) (batchId=251) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std] (batchId=40) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const] (batchId=16) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[concat_op] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] (batchId=71) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] (batchId=47) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing] (batchId=78) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] (batchId=29) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality] (batchId=24) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[interval_alt] (batchId=3) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[reset_conf] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source] (batchId=21) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_date_only] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal] (batchId=25) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months] (batchId=58) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt] (batchId=49) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt] (batchId=77) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft] (batchId=18) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright] (batchId=70) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround] (batchId=52) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt] (batchId=75) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32] (batchId=2) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] (batchId=43) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format] (batchId=52) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode] (batchId=77) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial] (batchId=76) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number] (batchId=8) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp] (batchId=76) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object] (batchId=30) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask] (batchId=68) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n] (batchId=70) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash] (batchId=26) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n] (batchId=4) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n] (batchId=51) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5] (batchId=8) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] (batchId=48) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_nullif] (batchId=77) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter] (batchId=41) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace] (batchId=36) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1] (batchId=6) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2] (batchId=11) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index]
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765854#comment-15765854 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12844165/HIVE-14165.04.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 74 failed/errored test(s), 10825 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=234) TestVectorizedColumnReaderBase - did not produce a TEST-*.xml file (likely timed out) (batchId=251) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std] (batchId=40) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const] (batchId=16) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[concat_op] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dbtxnmgr_showlocks] (batchId=71) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] (batchId=47) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing] (batchId=78) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] (batchId=29) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality] (batchId=24) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[interval_alt] (batchId=3) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[reset_conf] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source] (batchId=21) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[str_to_map] (batchId=58) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_date_only] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal] (batchId=25) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months] (batchId=58) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt] (batchId=49) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt] (batchId=77) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft] (batchId=18) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright] (batchId=70) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround] (batchId=52) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt] (batchId=75) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr] (batchId=27) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32] (batchId=2) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database] (batchId=61) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] (batchId=43) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format] (batchId=52) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] (batchId=2) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode] (batchId=77) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial] (batchId=76) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number] (batchId=8) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp] (batchId=76) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object] (batchId=30) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask] (batchId=68) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n] (batchId=70) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash] (batchId=26) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n] (batchId=4) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n] (batchId=51) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5] (batchId=8) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] (batchId=48) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_nullif] (batchId=77) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter] (batchId=41) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace] (batchId=36) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1] (batchId=6) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2] (batchId=11) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index]
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765737#comment-15765737 ] Sahil Takiar commented on HIVE-14165: - Assigning to myself as [~ayousufi] is no longer working on this issue. I played around with this patch and found a similar speedup for a simple {{select * from s3_partitioned_table}} query where {{s3_partitioned_table}} has 500 partitions all stored on S3 (each partition contains a CSV file of ~80 KB in size). Performance improves by about 2x. The only problem I see with this patch is that it is technically a backwards incompatible change. Hive allows any custom {{InputFormat}} to be registered for a table, or for a partition. Before this patch, Hive guaranteed that the {{Path}} set in {{mapred.input.dir}} would always exist, and would always contain files of non-zero length. After this patch, the given {{Path}} may not exist, or may just be empty. This patch adds handling for {{FileInputFormat}}s, but given that a user can register any custom {{InputFormat}} with a table its possible some user queries may break. I'm not sure how much of an issue this is, technically the {{InputFormat}} API makes no claim about whether a given {{Path}} should exist or should not be empty. Also need to add some tests for this patch. > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Sahil Takiar > Attachments: HIVE-14165.02.patch, HIVE-14165.03.patch, > HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429389#comment-15429389 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12824670/HIVE-14165.03.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 61 failed/errored test(s), 10470 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_to_utc_timestamp] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_trunc] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_version] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_1] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[select_dummy_source] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1] org.apache.hive.beeline.TestBeeLineWithArgs.testConnectionWithURLParams org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3] org.apache.hive.jdbc.TestJdbcWithMiniHS2.testConnectionSchemaAPIs org.apache.hive.jdbc.TestJdbcWithMiniLlap.testLlapInputFormatEndToEnd org.apache.hive.jdbc.TestJdbcWithMiniLlap.testNonAsciiStrings org.apache.hive.service.cli.operation.TestOperationLoggingLay
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428725#comment-15428725 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12824588/HIVE-14165.02.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 62 failed/errored test(s), 10441 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_to_utc_timestamp] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_trunc] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_version] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_1] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[select_dummy_source] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1] org.apache.hive.beeline.TestBeeLineWithArgs.testConnectionWithURLParams org.apache.hive.beeline.TestBeeLineWithArgs.testEmbeddedBeelineOutputs org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3] org.apache.hive.jdbc.TestJdbcWithMiniHS2.testConnectionSchemaAPIs org.apache.hive.jdbc.TestJdbcWithMiniHS2.testSelectThriftSerializeInTasks org.apache.hive.jdbc.TestJdbcWithMiniLlap.testLlapInputFormatEndToEnd org.apach
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427148#comment-15427148 ] Steve Loughran commented on HIVE-14165: --- the faster list status is only applicable on a recursive listing; if you are listing one directory, it's just the same time as before > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > Attachments: HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426846#comment-15426846 ] Abdullah Yousufi commented on HIVE-14165: - I believe when Hive calls getSplits() it's actually using {code}org.apache.hadoop.mapred.FileInputFormat{code}. And is the updated listStatus faster in the non-recursive case as well? Because if not, I think it doesn't make sense to pass in the recursive flag as true since Hive is only interested in the files in the top level of the path, since it currently calls getSplits() for each partition. However, if Hive were changed to call getSplits() on the root directory in the partitioned case, then the listStatus(recursive) would make sense. I decided against this change because I was not sure how to best resolve partition elimination. For example if the query selects a single partition from a table, then doing the listStatus(recursive) on the root directory would be slower than just doing a listStatus on the single partition. Also, Qubole mentions the following, which may be something to pursue in the future. {code} "we modified split computation to invoke listing at the level of the parent directory. This call returns all files (and their sizes) in all subdirectories in blocks of 1000. Some subdirectories and files may not be of interest to job/query e.g. partition elimination may be eliminated some of them. We take advantage of the fact that file listing is in lexicographic order and perform a modified merge join of the list of files and list of directories of interest." {code} When you mentioned earlier that Hadoop grabs 5000 objects at a time, is that including files in subdirectories? > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > Attachments: HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426708#comment-15426708 ] Hive QA commented on HIVE-14165: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12824278/HIVE-14165.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 63 failed/errored test(s), 10426 tests executed *Failed tests:* {noformat} TestVectorTimestampExpressions - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[authorization_1_sql_std] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_const] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[dynpart_sort_opt_bucketing] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[float_equality] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[select_dummy_source] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamp_literal] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_add_months] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_decrypt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_aes_encrypt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftleft] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftright] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bitwise_shiftrightunsigned] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_bround] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_cbrt] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_chr] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_crc32] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_current_database] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_add] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_format] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_date_sub] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_decode] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_factorial] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_format_number] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_from_utc_timestamp] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_get_json_object] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_last_day] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_levenshtein] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_first_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_hash] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_last_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_first_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_mask_show_last_n] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_md5] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_months_between] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_quarter] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_replace] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha1] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_sha2] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_soundex] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_substring_index] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_to_utc_timestamp] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_trunc] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_version] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_1] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[hybridgrace_hashjoin_2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[select_dummy_source] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1] org.apache.hive.beeline.TestBeeLineWithArgs.testConnectionWithURLParams org.apache.hive.beeline.TestBeeLineWithArgs.testEmbeddedBeelineOutputs org.apache.hive.hcatalog.pig.TestHCatLoader.testReadMissingPartitionBasicNeg[3] org.apache.hive.jdbc.TestJdbcWithMiniHS2.testConnectionSchemaAPIs org.apache.hive.jdbc.TestJdbcWithMiniHS2.testSelectThriftSerializeInTasks org.apache.hive.jdbc.TestJdbcWithMiniLlap.testLlapInp
[jira] [Commented] (HIVE-14165) Remove Hive file listing during split computation
[ https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426139#comment-15426139 ] Steve Loughran commented on HIVE-14165: --- which {{FileInputFormat}} are you using? If it is {{org.apache.hadoop.mapreduce.lib.input.FileInputFormat}} we could look at switching that to the {{listStatus(recursive)}} and pickup the HADOOP-13208 speedup? > Remove Hive file listing during split computation > - > > Key: HIVE-14165 > URL: https://issues.apache.org/jira/browse/HIVE-14165 > Project: Hive > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Abdullah Yousufi >Assignee: Abdullah Yousufi > Attachments: HIVE-14165.patch > > > The Hive side listing in FetchOperator.java is unnecessary, since Hadoop's > FileInputFormat.java will list the files during split computation anyway to > determine their size. One way to remove this is to catch the > InvalidInputFormat exception thrown by FileInputFormat#getSplits() on the > Hive side instead of doing the file listing beforehand. > For S3 select queries on partitioned tables, this results in a 2x speedup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)