[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258751#comment-14258751
 ] 

Brock Noland commented on HIVE-9153:


Nice, I see the perf improvement but I don't get the changes to 
{{Utilities.getBaseWork}}?

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: HIVE-9153.1-spark.patch, screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258760#comment-14258760
 ] 

Hive QA commented on HIVE-9153:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12689107/HIVE-9153.1-spark.patch

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 7255 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_cast_constant
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_admin_almighty1
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_groupby_multi_single_reducer
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_windowing
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/589/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/589/console
Test logs: 
http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-589/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12689107 - PreCommit-HIVE-SPARK-Build

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: HIVE-9153.1-spark.patch, screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258840#comment-14258840
 ] 

Hive QA commented on HIVE-9153:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12689126/HIVE-9153.1-spark.patch

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 7255 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample_islocalmode_hook
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_cast_constant
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_authorization_admin_almighty1
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_windowing
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/590/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-SPARK-Build/590/console
Test logs: 
http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-SPARK-Build-590/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12689126 - PreCommit-HIVE-SPARK-Build

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, 
 screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258841#comment-14258841
 ] 

Xuefu Zhang commented on HIVE-9153:
---

Re: Utilities.getBaseWork() changes, I suppose Rui is probably trying to clean 
up some redundant (useless) code. The changed code would be equivalent to the 
old one if name is the full path of the plan file on HDFS for non-local mode, 
which is very possible but needs to be confirmed.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, 
 screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258885#comment-14258885
 ] 

Rui Li commented on HIVE-9153:
--

Hi [~brocknoland] and [~xuefuz],

Sorry maybe I was being confusing. The patch here is to reduce the call to 
{{Utilities.getBaseWork()}}, which is quite similar to HIVE-9127. Changes to 
{{Utilities.getBaseWork()}} is just to remove redundant code:
{code}
Path localPath;
if (conf.getBoolean(mapreduce.task.uberized, false)  
name.equals(REDUCE_PLAN_NAME)) {
  localPath = new Path(name);
} else if (ShimLoader.getHadoopShims().isLocalMode(conf)) {
  localPath = path;
} else {
  LOG.info(***non-local mode***);
  localPath = new Path(name);
}
localPath = path;
LOG.info(local path =  + localPath);
{code}
Seems those if-else is unnecessary because localPath = path anyway, which makes 
localPath redundant too. But I can revert this change if you feel uncertain 
about it.
BTW, the path should be a trunk patch, I'll upload a trunk version to test 
again.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, 
 screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258897#comment-14258897
 ] 

Hive QA commented on HIVE-9153:
---



{color:red}Overall{color}: -1 no tests executed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12689137/HIVE-9153.2.patch

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2195/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2195/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2195/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]]
+ export JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ JAVA_HOME=/usr/java/jdk1.7.0_45-cloudera
+ export 
PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ 
PATH=/usr/java/jdk1.7.0_45-cloudera/bin/:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-maven-3.0.5/bin:/usr/local/apache-maven-3.0.5/bin:/usr/java/jdk1.7.0_45-cloudera/bin:/usr/local/apache-ant-1.9.1/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hiveptest/bin
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost 
-Dhttp.proxyPort=3128'
+ cd /data/hive-ptest/working/
+ tee /data/hive-ptest/logs/PreCommit-HIVE-TRUNK-Build-2195/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ svn = \s\v\n ]]
+ [[ -n '' ]]
+ [[ -d apache-svn-trunk-source ]]
+ [[ ! -d apache-svn-trunk-source/.svn ]]
+ [[ ! -d apache-svn-trunk-source ]]
+ cd apache-svn-trunk-source
+ svn revert -R .
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_9.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_4.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_12.q.out'
Reverted 'ql/src/test/results/clientpositive/stats_list_bucket.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_8.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_11.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_5.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_13.q.out'
Reverted 'ql/src/test/results/clientpositive/partitions_json.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_2.q.out'
Reverted 'ql/src/test/results/clientpositive/list_bucket_dml_10.q.out'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_5.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_11.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_13.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_9.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_2.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_4.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_10.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_12.q'
Reverted 'ql/src/test/queries/clientpositive/list_bucket_dml_8.q'
Reverted 'ql/src/test/queries/clientpositive/stats_list_bucket.q'
Reverted 
'ql/src/java/org/apache/hadoop/hive/ql/metadata/formatting/MapBuilder.java'
++ egrep -v '^X|^Performing status on external'
++ awk '{print $2}'
++ svn status --no-ignore
+ rm -rf target datanucleus.log ant/target shims/target shims/0.20S/target 
shims/0.23/target shims/aggregator/target shims/common/target 
shims/scheduler/target packaging/target hbase-handler/target testutils/target 
jdbc/target metastore/target itests/target itests/hcatalog-unit/target 
itests/test-serde/target itests/qtest/target itests/hive-unit-hadoop2/target 
itests/hive-minikdc/target itests/hive-unit/target itests/custom-serde/target 
itests/util/target hcatalog/target hcatalog/core/target 
hcatalog/streaming/target hcatalog/server-extensions/target 
hcatalog/hcatalog-pig-adapter/target hcatalog/webhcat/svr/target 
hcatalog/webhcat/java-client/target accumulo-handler/target hwi/target 
common/target common/src/gen service/target contrib/target serde/target 
beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target 
ql/src/test/results/clientpositive/list_bucket_dml_9.q.java1.8.out 
ql/src/test/results/clientpositive/list_bucket_dml_13.q.java1.8.out 

[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258909#comment-14258909
 ] 

Rui Li commented on HIVE-9153:
--

Strange thing is that {{Utilities}} is different in trunk and spark branch. But 
seems we have merged all the commits from trunk.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, 
 HIVE-9153.2.patch, HIVE-9153.3.patch, screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-25 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258928#comment-14258928
 ] 

Hive QA commented on HIVE-9153:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12689146/HIVE-9153.3.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 6722 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_list_bucket_dml_10
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2197/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2197/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2197/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12689146 - PreCommit-HIVE-TRUNK-Build

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch, 
 HIVE-9153.2.patch, HIVE-9153.3.patch, screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-22 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255650#comment-14255650
 ] 

Rui Li commented on HIVE-9153:
--

[~xuefuz] - I was wrong about turning off delay schedule. Actually you can set 
{{spark.locality.wait}} to 0 to turn it off. I tried doing that and parallelism 
won't drop during execution now.
Besides, I find tez uses its own property to control the size of combined 
spits: {{tez.grouping.max-size}} which defaults to 1G, while 
{{mapreduce.input.fileinputformat.split.maxsize}} defaults to less than 256M 
(these two properties are a little different in that 
{{mapreduce.input.fileinputformat.split.maxsize}} is more like a target size 
and {{tez.grouping.max-size}} is an upper bound, but they have similar effect 
when data size is big). So I changed 
{{mapreduce.input.fileinputformat.split.maxsize}} to 1G as well and spark now 
spawns 317 mappers for the previous test (332 for tez). Spark finishes the 
query in 155s with these new settings.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-22 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256082#comment-14256082
 ] 

Xuefu Zhang commented on HIVE-9153:
---

Thanks, [~lirui]. It seems that the smaller number of splits helped. However, 
this might be just one minor improvement and there seems to be other factors to 
be researched, such as locality (HIVE-8722).

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253164#comment-14253164
 ] 

Rui Li commented on HIVE-9153:
--

Investigated a bit about why {{CombineHiveInputFormat.getLocations}} might 
return null.
{{CombineFileInputFormat}} first tries to combine blocks within each node to 
generate splits. These splits will have that single node as preferred location. 
There's also a target size of each split 
({{mapreduce.input.fileinputformat.split.maxsize}}) and 
{{CombineFileInputFormat}} will try to make sure each split reaches that size. 
Therefore, if blocks left on a node don't add up to that size, they might be 
further combined on rack level:
{code}
// haven't created any split on this machine. so its ok to add a
// smaller one for parallelism. Otherwise group it in the rack for
// balanced size create an input split and add it to the splits
// array
   ..
// Put the unplaced blocks back into the pool for later 
rack-allocation.
for (OneBlockInfo oneblock : validBlocks) {
  blockToNodes.put(oneblock, oneblock.hosts);
}
{code}
On rack level, preferred locations consist of all nodes in that rack. Since my 
cluster don't have rack to node mapping, the preferred locs is null. Such tasks 
may slow down the query, but they should only take up a small portion of total 
tasks.
I'll look how tez combines the blocks.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251407#comment-14251407
 ] 

Rui Li commented on HIVE-9153:
--

I used our cluster B to test this. Results show that CombineHiveInputFormat 
still performs much better than HiveInputFormat for spark. The test query is 
{code}select count(*) from store_sales where ss_sold_date_sk is not null;{code}
With CombineHiveInputFormat spark spawns 1252 mappers and the query finishes in 
about 180s, while HiveInputFormat requires 13559 mappers and the query finishes 
in about 700s.
I didn't find why Tez uses HiveInputFormat as default. But for Tez, 
HiveInputFormat spawns 332 mappers while CombineHiveInputFormat spawns 1252. So 
I think Tez has its own way to combine the splits. With 332 mappers, Tez 
finishes the query in about 90s, and with 1252 mappers, it took about 120s.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li

 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251597#comment-14251597
 ] 

Rui Li commented on HIVE-9153:
--

Judging from the results, I think fewer mappers can improve overall 
performance, which is true for both spark and tez. Problem is that, why spark 
is 60s slower than tez with same # of mappers.
One possible reason is that we don't have data locality with 
CombineHiveInputFormat, which is tracked by HIVE-8722.
I also noticed that the parallelism drops during execution (attach a screenshot 
later). This may be due to the delay schedule mechanism of spark, which 
attempts to schedule tasks with some locality first.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251755#comment-14251755
 ] 

Xuefu Zhang commented on HIVE-9153:
---

Thanks for the findings, [~lirui]. I heard that the spark snapshot we are using 
is 2X slower than previous version. this might explain the slowness. Also, I 
think the number of mappers and locality matter in speed, but the two may 
collide with each other. For instance, if we have more executors than mappers, 
it's desirable to have more map tasks. However, doing so might impact locality 
because some mappers might read remotely. On the other hand, if there are more 
mappers than executors, then few mappers will help the speed.

Any way, it would be good to find out how Tez generates splits using 
HiveInputFormat. Also, we should fix HIVE-8722. Is there a way to disable 
Spark's delayed schedule to try out?

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252762#comment-14252762
 ] 

Rui Li commented on HIVE-9153:
--

Hi [~xuefuz] - if the spark cluster is the same as the hadoop cluster i.e. each 
executor is also a datanode, spark task scheduler usually does a good job to 
make sure all mappers have some locality (of course on condition that the 
mappers do specify a preferred location). In such case, more mappers won't 
impact data locality.
bq. Is there a way to disable Spark's delayed schedule to try out?
Spark task scheduler divides tasks into multiple lists according to locality 
level and attempts to launch tasks with highest locality level when an executor 
offers resources. It may also wait some time to schedule tasks in a lower 
level. I don't think there's a switch to turn it off. Actually I'm not 100% 
sure it's the delay schedule causing the issue. If all our tasks don't have 
preferred location, the delay may happen at start-up (waiting allowed locality 
level to drop) but not during execution. I'll look more into this.

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9153) Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]

2014-12-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252874#comment-14252874
 ] 

Rui Li commented on HIVE-9153:
--

I think we actually can get location info with {{CombineHiveInputFormat}}. I 
verified this by running some tests on my machine and most tasks have 
NODE_LOCAL locality (some has PROCESS_LOCAL because 
{{CombineHiveInputFormat.getLocations}} returns null).
BTW, spark prints PROCESS_LOCAL for tasks that have no preferred locations, 
which may be a bug:
{code}
if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {
  // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic
  for (index - dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) {
return Some((index, TaskLocality.PROCESS_LOCAL, false))
  }
}
{code}

 Evaluate CombineHiveInputFormat versus HiveInputFormat [Spark Branch]
 -

 Key: HIVE-9153
 URL: https://issues.apache.org/jira/browse/HIVE-9153
 Project: Hive
  Issue Type: Sub-task
  Components: Spark
Affects Versions: spark-branch
Reporter: Brock Noland
Assignee: Rui Li
 Attachments: screenshot.PNG


 The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this. 
 However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in 
 Spark, it might make sense for us to use {{HiveInputFormat}} as well. We 
 should evaluate this on a query which has many input splits such as {{select 
 count(\*) from store_sales where something is not null}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)