[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258751#comment-14258751 ] Brock Noland commented on HIVE-9153: Nice, I see the perf improvement but I don't get the changes to {{Utilities.getBaseWork}}? Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: HIVE-9153.1-spark.patch, screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258760#comment-14258760 ] Hive QA commented on HIVE-9153: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689107/HIVE-9153.1-spark.patch {color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 7255 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_cast_constant org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_admin_almighty1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_multi_single_reducer org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_windowing {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/589/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/589/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-589/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 6 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12689107 - PreCommit-HIVE-SPARK-Build Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: HIVE-9153.1-spark.patch, screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258840#comment-14258840 ] Hive QA commented on HIVE-9153: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689126/HIVE-9153.1-spark.patch {color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 7255 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_cast_constant org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_admin_almighty1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_windowing {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/590/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/590/console Test logs: http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-590/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 6 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12689126 - PreCommit-HIVE-SPARK-Build Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258841#comment-14258841 ] Xuefu Zhang commented on HIVE-9153: --- Re: Utilities.getBaseWork() changes, I suppose Rui is probably trying to clean up some redundant (useless) code. The changed code would be equivalent to the old one if name is the full path of the plan file on HDFS for non-local mode, which is very possible but needs to be confirmed. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258885#comment-14258885 ] Rui Li commented on HIVE-9153: -- Hi [~brocknoland] and [~xuefuz], Sorry maybe I was being confusing. The patch here is to reduce the call to {{Utilities.getBaseWork()}}, which is quite similar to HIVE-9127. Changes to {{Utilities.getBaseWork()}} is just to remove redundant code: {code} Path localPath; if (conf.getBoolean(mapreduce.task.uberized, false) name.equals(REDUCE_PLAN_NAME)) { localPath = new Path(name); } else if (ShimLoader.getHadoopShims().isLocalMode(conf)) { localPath = path; } else { LOG.info(***non-local mode***); localPath = new Path(name); } localPath = path; LOG.info(local path = + localPath); {code} Seems those if-else is unnecessary because localPath = path anyway, which makes localPath redundant too. But I can revert this change if you feel uncertain about it. BTW, the path should be a trunk patch, I'll upload a trunk version to test again. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258897#comment-14258897 ] Hive QA commented on HIVE-9153: --- {color:red}Overall{color}: -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689137/HIVE-9153.2.patch Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2195/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2195/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2195/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]] + export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera + export PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-2195/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_9.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_4.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_12.q.out' Reverted 'ql/src/test/results/clientpositive/stats_list_bucket.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_8.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_11.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_5.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_13.q.out' Reverted 'ql/src/test/results/clientpositive/partitions_json.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_2.q.out' Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_10.q.out' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_5.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_11.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_13.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_9.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_2.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_4.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_10.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_12.q' Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_8.q' Reverted 'ql/src/test/queries/clientpositive/stats_list_bucket.q' Reverted 'ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/MapBuilder.java' ++ egrep -v '^X|^Performing status on external' ++ awk '{print $2}' ++ svn status --no-ignore + rm -rf target datanucleus.log ant/target shims/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/scheduler/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit-hadoop2/target itests/hive-minikdc/target itests/hive-unit/target itests/custom-serde/target itests/util/target hcatalog/target hcatalog/core/target hcatalog/streaming/target hcatalog/server-extensions/target hcatalog/hcatalog-pig-adapter/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target accumulo-handler/target hwi/target common/target common/src/gen service/target contrib/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target ql/src/test/results/clientpositive/list_bucket_dml_9.q.java1.8.out ql/src/test/results/clientpositive/list_bucket_dml_13.q.java1.8.out
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258909#comment-14258909 ] Rui Li commented on HIVE-9153: -- Strange thing is that {{Utilities}} is different in trunk and spark branch. But seems we have merged all the commits from trunk. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, HIVE-9153.2.patch, HIVE-9153.3.patch, screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258928#comment-14258928 ] Hive QA commented on HIVE-9153: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689146/HIVE-9153.3.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 6722 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_list_bucket_dml_10 {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2197/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2197/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2197/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12689146 - PreCommit-HIVE-TRUNK-Build Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, HIVE-9153.2.patch, HIVE-9153.3.patch, screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255650#comment-14255650 ] Rui Li commented on HIVE-9153: -- [~xuefuz] - I was wrong about turning off delay schedule. Actually you can set {{spark.locality.wait}} to 0 to turn it off. I tried doing that and parallelism won't drop during execution now. Besides, I find tez uses its own property to control the size of combined spits: {{tez.grouping.max-size}} which defaults to 1G, while {{mapreduce.input.fileinputformat.split.maxsize}} defaults to less than 256M (these two properties are a little different in that {{mapreduce.input.fileinputformat.split.maxsize}} is more like a target size and {{tez.grouping.max-size}} is an upper bound, but they have similar effect when data size is big). So I changed {{mapreduce.input.fileinputformat.split.maxsize}} to 1G as well and spark now spawns 317 mappers for the previous test (332 for tez). Spark finishes the query in 155s with these new settings. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256082#comment-14256082 ] Xuefu Zhang commented on HIVE-9153: --- Thanks, [~lirui]. It seems that the smaller number of splits helped. However, this might be just one minor improvement and there seems to be other factors to be researched, such as locality (HIVE-8722). Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253164#comment-14253164 ] Rui Li commented on HIVE-9153: -- Investigated a bit about why {{CombineHiveInputFormat.getLocations}} might return null. {{CombineFileInputFormat}} first tries to combine blocks within each node to generate splits. These splits will have that single node as preferred location. There's also a target size of each split ({{mapreduce.input.fileinputformat.split.maxsize}}) and {{CombineFileInputFormat}} will try to make sure each split reaches that size. Therefore, if blocks left on a node don't add up to that size, they might be further combined on rack level: {code} // haven't created any split on this machine. so its ok to add a // smaller one for parallelism. Otherwise group it in the rack for // balanced size create an input split and add it to the splits // array .. // Put the unplaced blocks back into the pool for later rack-allocation. for (OneBlockInfo oneblock : validBlocks) { blockToNodes.put(oneblock, oneblock.hosts); } {code} On rack level, preferred locations consist of all nodes in that rack. Since my cluster don't have rack to node mapping, the preferred locs is null. Such tasks may slow down the query, but they should only take up a small portion of total tasks. I'll look how tez combines the blocks. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251407#comment-14251407 ] Rui Li commented on HIVE-9153: -- I used our cluster B to test this. Results show that CombineHiveInputFormat still performs much better than HiveInputFormat for spark. The test query is {code}select count(*) from store_sales where ss_sold_date_sk is not null;{code} With CombineHiveInputFormat spark spawns 1252 mappers and the query finishes in about 180s, while HiveInputFormat requires 13559 mappers and the query finishes in about 700s. I didn't find why Tez uses HiveInputFormat as default. But for Tez, HiveInputFormat spawns 332 mappers while CombineHiveInputFormat spawns 1252. So I think Tez has its own way to combine the splits. With 332 mappers, Tez finishes the query in about 90s, and with 1252 mappers, it took about 120s. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251597#comment-14251597 ] Rui Li commented on HIVE-9153: -- Judging from the results, I think fewer mappers can improve overall performance, which is true for both spark and tez. Problem is that, why spark is 60s slower than tez with same # of mappers. One possible reason is that we don't have data locality with CombineHiveInputFormat, which is tracked by HIVE-8722. I also noticed that the parallelism drops during execution (attach a screenshot later). This may be due to the delay schedule mechanism of spark, which attempts to schedule tasks with some locality first. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251755#comment-14251755 ] Xuefu Zhang commented on HIVE-9153: --- Thanks for the findings, [~lirui]. I heard that the spark snapshot we are using is 2X slower than previous version. this might explain the slowness. Also, I think the number of mappers and locality matter in speed, but the two may collide with each other. For instance, if we have more executors than mappers, it's desirable to have more map tasks. However, doing so might impact locality because some mappers might read remotely. On the other hand, if there are more mappers than executors, then few mappers will help the speed. Any way, it would be good to find out how Tez generates splits using HiveInputFormat. Also, we should fix HIVE-8722. Is there a way to disable Spark's delayed schedule to try out? Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252762#comment-14252762 ] Rui Li commented on HIVE-9153: -- Hi [~xuefuz] - if the spark cluster is the same as the hadoop cluster i.e. each executor is also a datanode, spark task scheduler usually does a good job to make sure all mappers have some locality (of course on condition that the mappers do specify a preferred location). In such case, more mappers won't impact data locality. bq. Is there a way to disable Spark's delayed schedule to try out? Spark task scheduler divides tasks into multiple lists according to locality level and attempts to launch tasks with highest locality level when an executor offers resources. It may also wait some time to schedule tasks in a lower level. I don't think there's a switch to turn it off. Actually I'm not 100% sure it's the delay schedule causing the issue. If all our tasks don't have preferred location, the delay may happen at start-up (waiting allowed locality level to drop) but not during execution. I'll look more into this. Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
[ https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252874#comment-14252874 ] Rui Li commented on HIVE-9153: -- I think we actually can get location info with {{CombineHiveInputFormat}}. I verified this by running some tests on my machine and most tasks have NODE_LOCAL locality (some has PROCESS_LOCAL because {{CombineHiveInputFormat.getLocations}} returns null). BTW, spark prints PROCESS_LOCAL for tasks that have no preferred locations, which may be a bug: {code} if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) { // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic for (index - dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) { return Some((index, TaskLocality.PROCESS_LOCAL, false)) } } {code} Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch] - Key: HIVE-9153 URL: https://issues.apache.org/jira/browse/HIVE-9153 Project: Hive Issue Type: Sub-task Components: Spark Affects Versions: spark-branch Reporter: Brock Noland Assignee: Rui Li Attachments: screenshot.PNG The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in Spark, it might make sense for us to use {{HiveInputFormat}} as well. We should evaluate this on a query which has many input splits such as {{select count(\*) from store_sales where something is not null}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)