[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-02-07 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355472#comment-16355472
 ] 

Rui Li commented on HIVE-18368:
---

+1. Thanks [~stakiar] for the update.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, 
> Stage DAG 1.png, Stage DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-02-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354056#comment-16354056
 ] 

Hive QA commented on HIVE-18368:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12909368/HIVE-18368.4.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 24 failed/errored test(s), 12974 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_queries]
 (batchId=240)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] 
(batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[row__id] (batchId=79)
org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_move_tbl]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] 
(batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez1]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=161)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[resourceplan]
 (batchId=164)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=161)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] 
(batchId=122)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=221)
org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testAlterTableNullStorageDescriptorInNew[Embedded]
 (batchId=206)
org.apache.hadoop.hive.metastore.client.TestTablesList.testListTableNamesByFilterNullDatabase[Embedded]
 (batchId=206)
org.apache.hadoop.hive.ql.TestTxnNoBuckets.testCTAS (batchId=280)
org.apache.hadoop.hive.ql.TestTxnNoBucketsVectorized.testCTAS (batchId=280)
org.apache.hadoop.hive.ql.exec.TestOperators.testNoConditionalTaskSizeForLlap 
(batchId=282)
org.apache.hadoop.hive.ql.io.TestDruidRecordWriter.testWrite (batchId=256)
org.apache.hive.beeline.cli.TestHiveCli.testNoErrorDB (batchId=188)
org.apache.hive.jdbc.TestSSL.testConnectionMismatch (batchId=234)
org.apache.hive.jdbc.TestSSL.testConnectionWrongCertCN (batchId=234)
org.apache.hive.jdbc.TestSSL.testMetastoreConnectionWrongCertCN (batchId=234)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/9049/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/9049/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-9049/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 24 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12909368 - PreCommit-HIVE-Build

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, 
> Stage DAG 1.png, Stage DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-02-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353976#comment-16353976
 ] 

Hive QA commented on HIVE-18368:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
38s{color} | {color:red} ql: The patch generated 1 new + 47 unchanged - 5 fixed 
= 48 total (was 52) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 14m  9s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / 443b10b |
| Default Java | 1.8.0_111 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9049/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9049/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, 
> Stage DAG 1.png, Stage DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-02-05 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353439#comment-16353439
 ] 

Sahil Takiar commented on HIVE-18368:
-

[~lirui] sorry for the delay. Updated the RB, addressed comments.

[~xuefuz] true, but the UI also displays the edge-type so it should be pretty 
easy to figure out what tran object is being used.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, 
> Stage DAG 1.png, Stage DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340806#comment-16340806
 ] 

Rui Li commented on HIVE-18368:
---

Looks good to me overall. Left some minor comments on RB.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-24 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338668#comment-16338668
 ] 

Xuefu Zhang commented on HIVE-18368:


[~stakiar] It may not be meaningful to all users, but could be additional info 
for Hive developers for diagnosis. I feel it's probably better than just 
repeating the same info. However, I don't have a strong opinion about this.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-24 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338446#comment-16338446
 ] 

Sahil Takiar commented on HIVE-18368:
-

[~xuefuz] thanks for the comments

{quote} As to the duplication, is it possible that we name the call site 
differently so it is less confusing, such as "In ReduceTran" {quote} Yeah, we 
could, but I'm not sure its very useful for users. Ideally, they shouldn't need 
to understand what a ReduceTran is. The call site is also displayed on the 
Completed-Stages.png page, so I think its useful to have it set to something 
like {{Reducer 2}}

{quote} One thing unclear to me is the reason we changed the test case. {quote} 
Whoops, I'll remove that.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-24 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338395#comment-16338395
 ] 

Xuefu Zhang commented on HIVE-18368:


Thanks, [~stakiar]. 

As to the duplication, is it possible that we name the call site differently so 
it is less confusing, such as "In ReduceTran"

The code looks fine, though I didn't get too much into the details. I will let 
[~lirui] share his comments.

One thing unclear to me is the reason we changed the test case.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-24 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338053#comment-16338053
 ] 

Sahil Takiar commented on HIVE-18368:
-

[~xuefuz], [~lirui] thoughts on the updated patch? The latest screenshots are 
attached.

There is one weird thing I wasn't able to fix. You'll notice that the 
Stage-DAG.png images look like they have duplicate info - e.g.:

{code}
Reducer 2 (400) [7]
Reducer 2
{code}

The first line is the RDD name, the second like is the RDD call site. Living 
with this duplicate metadata is necessary to get the results in 
Completed-Stages.png

Spark distinguishes between RDD names and RDD call sites. By default, the RDD 
call site is what line of code created the RDD. However, the call site can be 
overwritten for each RDD.

In Spark, each stage is described by the call site of the final RDD in the 
stage (e.g. what you see in Completed-Stages.png).

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-22 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335253#comment-16335253
 ] 

Sahil Takiar commented on HIVE-18368:
-

Attached updated patch along with new screenshots. 

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-11 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322561#comment-16322561
 ] 

Sahil Takiar commented on HIVE-18368:
-

{quote} Can we get rid of code reference such as at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57. they don't seem 
useful. {quote} I agree they aren't useful for HoS users. At first I thought 
removing them wasn't possible, but there may be a way to do this, which would 
be pretty cool. Working on a fix, may take another week to figure out, the APIs 
aren't really documented.

{quote} Can you clarify what's the format of an RDD specification as shown in 
each line of the output. {quote} Take {{Reducer 5 (SORT, 1) ShuffledRDD\[24\] 
at sortByKey at SortByShuffler.java:51 \[\]}} as an example:

* {{Reducer 5}} is the Hive {{BaseWork}} name
* {{SORT}} is the edge type (taken from the {{SparkEdgeProperty{{)
* {{(1)}} is the number of partitions for the stage (taken from the 
{{SparkEdgeProperty}})
* {{ShuffledRDD}} is the RDD type
* {{\[24\]}} is the RDD id
* {{sortByKey}} is the RDD transformation that created this RDD
* {{SortByShuffler.java:51}} is the line number that created this RDD
* {{\[\]}} I'm not sure what this is exactly

{quote} We can Skip SparkTran entirely, but need to have a clear mapping from 
Work to RDD {quote} Yes, this is the main goal of this patch. An easy way to 
map {{BaseWork}} objects to {{RDD}}.

{quote} Why the num of partitions of MapInput is 0 {quote} Thats just because 
the job I ran didn't have any data in the underlying tables.

{quote} It seems confusing to have 2 RDDs having the same work name {quote} 
Yes, I can play with the names a bit so its clearer. I'm not sure if 
{{ShuffleTran}} is the best name, all the {{Tran}} objects are internal 
implementation details of HoS that end users probably don't need to know about 
(another reason why I removed the {{Tran}} graph).

I'll continue working on addressing the comments in the RB too. Hope to have an 
updated patch sometime next week.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-10 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319913#comment-16319913
 ] 

Rui Li commented on HIVE-18368:
---

Hi [~stakiar], two questions regarding the screenshot:
# Why the num of partitions of MapInput is 0?
# It seems confusing to have 2 RDDs having the same work name, e.g. "Reducer 
3", "Map 11". Can we name the shuffled RDD as "ShuffleTran", and the Hadoop RDD 
as "MapInput"?

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319245#comment-16319245
 ] 

Xuefu Zhang commented on HIVE-18368:


Hi [~stakiar], thanks for working on this. I think this is very useful. I 
haven't looked at the patch, but I have a couple of high-level questions:

1. Can we get rid of code reference such as {{at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57}}. they don't seem 
useful.
2. Can you clarify what's the format of an RDD specification as shown in each 
line of the output. Besides the code reference, I'm not entirely sure what 
other elements means. For instance, I see many "[]" out there.
3. We have several internal object graphs, from Work graph, to SparkTran, and 
to RDD. We can Skip SparkTran entirely, but need to have a clear mapping from 
Work to RDD. Maybe reading the patch will give me the idea.


> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-09 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318889#comment-16318889
 ] 

Sahil Takiar commented on HIVE-18368:
-

[~lirui], [~xuefuz] could you take a look? I think it will help improve the 
debug-ability of HoS, especially when using the Spark Web UI. 
[This|https://issues.apache.org/jira/browse/HIVE-18368?focusedCommentId=16312109&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16312109]
 comment contains a list of what changes this patch contains.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314716#comment-16314716
 ] 

Hive QA commented on HIVE-18368:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12904862/HIVE-18368.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 11549 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35)
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_binary_storage_queries]
 (batchId=99)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2]
 (batchId=151)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=156)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=164)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=159)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part]
 (batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] 
(batchId=120)
org.apache.hadoop.hive.metastore.TestEmbeddedHiveMetaStore.testTransactionalValidation
 (batchId=213)
org.apache.hadoop.hive.ql.io.TestDruidRecordWriter.testWrite (batchId=253)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=225)
org.apache.hive.jdbc.TestSSL.testConnectionMismatch (batchId=231)
org.apache.hive.jdbc.TestSSL.testConnectionWrongCertCN (batchId=231)
org.apache.hive.jdbc.TestSSL.testMetastoreConnectionWrongCertCN (batchId=231)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8483/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8483/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8483/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 18 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12904862 - PreCommit-HIVE-Build

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314696#comment-16314696
 ] 

Hive QA commented on HIVE-18368:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
29s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} ql: The patch generated 0 new + 54 unchanged - 9 
fixed = 54 total (was 63) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 12m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / a6b88d9 |
| Default Java | 1.8.0_111 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8483/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-05 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313998#comment-16313998
 ] 

Sahil Takiar commented on HIVE-18368:
-

[~chengxiang li] I saw you added the RDD graph in HIVE-10550, [~chinnalalam] I 
saw you added the SparkPlan graph in HIVE-8858. Could you take a look at this 
patch? - RB: https://reviews.apache.org/r/64996/

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312522#comment-16312522
 ] 

Hive QA commented on HIVE-18368:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12904682/HIVE-18368.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 11547 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] 
(batchId=150)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2]
 (batchId=151)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=156)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=164)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=159)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1]
 (batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] 
(batchId=120)
org.apache.hadoop.hive.metastore.TestEmbeddedHiveMetaStore.testTransactionalValidation
 (batchId=213)
org.apache.hadoop.hive.ql.io.TestDruidRecordWriter.testWrite (batchId=253)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=225)
org.apache.hive.jdbc.TestSSL.testConnectionMismatch (batchId=231)
org.apache.hive.jdbc.TestSSL.testConnectionWrongCertCN (batchId=231)
org.apache.hive.jdbc.TestSSL.testMetastoreConnectionWrongCertCN (batchId=231)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8452/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8452/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8452/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 18 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12904682 - PreCommit-HIVE-Build

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, Spark UI - Named RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312478#comment-16312478
 ] 

Hive QA commented on HIVE-18368:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
45s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
29s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
32s{color} | {color:red} ql: The patch generated 1 new + 54 unchanged - 9 fixed 
= 55 total (was 63) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
15s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 13m  3s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / 20c9a39 |
| Default Java | 1.8.0_111 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8452/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8452/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, Spark UI - Named RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-04 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312109#comment-16312109
 ] 

Sahil Takiar commented on HIVE-18368:
-

* Spark provides a nice RDD graph via {{RDD#toDebugString}} - I replaced the 
{{SparkPlan#logSparkPlan}} and {{SparkUtilities#rddGraphToString}} with this 
graph. It includes all the info from both of these graphs + more info. It's 
very similar to the info that is showed in the Spark Web UI. An example is 
below.
* Added explicit names for each RDD; the name is derived from the name of the 
{{BaseWork}} that corresponds to the RDD, along with the {{SparkEdgeProperty}} 
(if there is one). The example below shows this in detail.
** The nice thing about adding explicit names is that they show up in Spark Web 
UI too, which can be very useful for mapping a Hive Explain Plan to the Spark 
RDD DAG
** The name includes the number of partitions for the RDD as well as whether or 
not the RDD is cached
* I originally wanted to find a way to display this in the {{EXPLAIN EXTENDED}} 
output, but for now that may be a bit difficult, because the {{SparkPlan}} is 
only generated in the {{RemoteDriver}} - its probably possible to generate the 
{{SparkPlan}} somewhere in the {{ExplainTask}}, but I'll save that for a later 
JIRA
* The Spark RDD Graph is printed at INFO level, which I think should help with 
debugging
* I've attached a screenshot of what the the Spark Web UI looks like with named 
RDDs

Spark RDD Graph:

{code}
(1) Reducer 5 (1) MapPartitionsRDD[25] at mapPartitionsToPair at 
ReduceTran.java:41 []
 |  Reducer 5 (SORT, 1) ShuffledRDD[24] at sortByKey at SortByShuffler.java:51 
[]
 +-(166) Reducer 4 (166) MapPartitionsRDD[23] at mapPartitionsToPair at 
ReduceTran.java:41 []
 |   Reducer 4 (PARTITION-LEVEL SORT, 166) ShuffledRDD[22] at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
 +-(328) UnionRDD (328) UnionRDD[21] at union at SparkPlan.java:70 []
 |   Reducer 3 (328) MapPartitionsRDD[19] at mapPartitionsToPair at 
ReduceTran.java:41 []
 |   Reducer 3 (PARTITION-LEVEL SORT, 328) ShuffledRDD[18] at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
 +-(874) UnionRDD (874) UnionRDD[17] at union at SparkPlan.java:70 []
 |   UnionRDD (874) UnionRDD[16] at union at SparkPlan.java:70 []
 |   Reducer 2 (437) MapPartitionsRDD[11] at mapPartitionsToPair at 
ReduceTran.java:41 []
 |   Reducer 2 (GROUP, 437) MapPartitionsRDD[10] at groupByKey at 
GroupByShuffler.java:31 []
 |   ShuffledRDD[9] at groupByKey at GroupByShuffler.java:31 []
 +-(0) Map 1 (0) MapPartitionsRDD[8] at mapPartitionsToPair at 
MapTran.java:41 []
|  Map 1 (store_sales, 0) HadoopRDD[4] at hadoopRDD at 
SparkPlanGenerator.java:203 []
 |   Reducer 8 (437) MapPartitionsRDD[14] at mapPartitionsToPair at 
ReduceTran.java:41 []
 |   Reducer 8 (GROUP PARTITION-LEVEL SORT, 437) ShuffledRDD[13] at 
repartitionAndSortWithinPartitions at SortByShuffler.java:57 []
 +-(0) Map 7 (0) MapPartitionsRDD[12] at mapPartitionsToPair at 
MapTran.java:41 []
|  Map 7 (store_sales, 0) HadoopRDD[5] at hadoopRDD at 
SparkPlanGenerator.java:203 []
 |   Map 10 (0) MapPartitionsRDD[15] at mapPartitionsToPair at 
MapTran.java:41 []
 |   Map 10 (store, 0) HadoopRDD[6] at hadoopRDD at 
SparkPlanGenerator.java:203 []
 |   Map 11 (0) MapPartitionsRDD[20] at mapPartitionsToPair at 
MapTran.java:41 []
 |   Map 11 (item, 0) HadoopRDD[7] at hadoopRDD at 
SparkPlanGenerator.java:203 []
{code}

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: Spark UI - Named RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)