[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355472#comment-16355472 ] Rui Li commented on HIVE-18368: --- +1. Thanks [~stakiar] for the update. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, > Stage DAG 1.png, Stage DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354056#comment-16354056 ] Hive QA commented on HIVE-18368: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12909368/HIVE-18368.4.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 24 failed/errored test(s), 12974 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_queries] (batchId=240) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] (batchId=13) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=36) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[row__id] (batchId=79) org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_move_tbl] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez1] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=167) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=161) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[resourceplan] (batchId=164) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=161) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=122) org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut (batchId=221) org.apache.hadoop.hive.metastore.client.TestTablesCreateDropAlterTruncate.testAlterTableNullStorageDescriptorInNew[Embedded] (batchId=206) org.apache.hadoop.hive.metastore.client.TestTablesList.testListTableNamesByFilterNullDatabase[Embedded] (batchId=206) org.apache.hadoop.hive.ql.TestTxnNoBuckets.testCTAS (batchId=280) org.apache.hadoop.hive.ql.TestTxnNoBucketsVectorized.testCTAS (batchId=280) org.apache.hadoop.hive.ql.exec.TestOperators.testNoConditionalTaskSizeForLlap (batchId=282) org.apache.hadoop.hive.ql.io.TestDruidRecordWriter.testWrite (batchId=256) org.apache.hive.beeline.cli.TestHiveCli.testNoErrorDB (batchId=188) org.apache.hive.jdbc.TestSSL.testConnectionMismatch (batchId=234) org.apache.hive.jdbc.TestSSL.testConnectionWrongCertCN (batchId=234) org.apache.hive.jdbc.TestSSL.testMetastoreConnectionWrongCertCN (batchId=234) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/9049/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/9049/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-9049/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 24 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12909368 - PreCommit-HIVE-Build > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, > Stage DAG 1.png, Stage DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353976#comment-16353976 ] Hive QA commented on HIVE-18368: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 28s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 35s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 38s{color} | {color:red} ql: The patch generated 1 new + 47 unchanged - 5 fixed = 48 total (was 52) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 14m 9s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / 443b10b | | Default Java | 1.8.0_111 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-9049/yetus/diff-checkstyle-ql.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-9049/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, > Stage DAG 1.png, Stage DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353439#comment-16353439 ] Sahil Takiar commented on HIVE-18368: - [~lirui] sorry for the delay. Updated the RB, addressed comments. [~xuefuz] true, but the UI also displays the edge-type so it should be pretty easy to figure out what tran object is being used. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, > Stage DAG 1.png, Stage DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340806#comment-16340806 ] Rui Li commented on HIVE-18368: --- Looks good to me overall. Left some minor comments on RB. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage > DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338668#comment-16338668 ] Xuefu Zhang commented on HIVE-18368: [~stakiar] It may not be meaningful to all users, but could be additional info for Hive developers for diagnosis. I feel it's probably better than just repeating the same info. However, I don't have a strong opinion about this. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage > DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338446#comment-16338446 ] Sahil Takiar commented on HIVE-18368: - [~xuefuz] thanks for the comments {quote} As to the duplication, is it possible that we name the call site differently so it is less confusing, such as "In ReduceTran" {quote} Yeah, we could, but I'm not sure its very useful for users. Ideally, they shouldn't need to understand what a ReduceTran is. The call site is also displayed on the Completed-Stages.png page, so I think its useful to have it set to something like {{Reducer 2}} {quote} One thing unclear to me is the reason we changed the test case. {quote} Whoops, I'll remove that. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage > DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338395#comment-16338395 ] Xuefu Zhang commented on HIVE-18368: Thanks, [~stakiar]. As to the duplication, is it possible that we name the call site differently so it is less confusing, such as "In ReduceTran" The code looks fine, though I didn't get too much into the details. I will let [~lirui] share his comments. One thing unclear to me is the reason we changed the test case. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage > DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338053#comment-16338053 ] Sahil Takiar commented on HIVE-18368: - [~xuefuz], [~lirui] thoughts on the updated patch? The latest screenshots are attached. There is one weird thing I wasn't able to fix. You'll notice that the Stage-DAG.png images look like they have duplicate info - e.g.: {code} Reducer 2 (400) [7] Reducer 2 {code} The first line is the RDD name, the second like is the RDD call site. Living with this duplicate metadata is necessary to get the results in Completed-Stages.png Spark distinguishes between RDD names and RDD call sites. By default, the RDD call site is what line of code created the RDD. However, the call site can be overwritten for each RDD. In Spark, each stage is described by the call site of the final RDD in the stage (e.g. what you see in Completed-Stages.png). > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage > DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335253#comment-16335253 ] Sahil Takiar commented on HIVE-18368: - Attached updated patch along with new screenshots. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: Completed Stages.png, HIVE-18368.1.patch, > HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage > DAG 2.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322561#comment-16322561 ] Sahil Takiar commented on HIVE-18368: - {quote} Can we get rid of code reference such as at repartitionAndSortWithinPartitions at SortByShuffler.java:57. they don't seem useful. {quote} I agree they aren't useful for HoS users. At first I thought removing them wasn't possible, but there may be a way to do this, which would be pretty cool. Working on a fix, may take another week to figure out, the APIs aren't really documented. {quote} Can you clarify what's the format of an RDD specification as shown in each line of the output. {quote} Take {{Reducer 5 (SORT, 1) ShuffledRDD\[24\] at sortByKey at SortByShuffler.java:51 \[\]}} as an example: * {{Reducer 5}} is the Hive {{BaseWork}} name * {{SORT}} is the edge type (taken from the {{SparkEdgeProperty{{) * {{(1)}} is the number of partitions for the stage (taken from the {{SparkEdgeProperty}}) * {{ShuffledRDD}} is the RDD type * {{\[24\]}} is the RDD id * {{sortByKey}} is the RDD transformation that created this RDD * {{SortByShuffler.java:51}} is the line number that created this RDD * {{\[\]}} I'm not sure what this is exactly {quote} We can Skip SparkTran entirely, but need to have a clear mapping from Work to RDD {quote} Yes, this is the main goal of this patch. An easy way to map {{BaseWork}} objects to {{RDD}}. {quote} Why the num of partitions of MapInput is 0 {quote} Thats just because the job I ran didn't have any data in the underlying tables. {quote} It seems confusing to have 2 RDDs having the same work name {quote} Yes, I can play with the names a bit so its clearer. I'm not sure if {{ShuffleTran}} is the best name, all the {{Tran}} objects are internal implementation details of HoS that end users probably don't need to know about (another reason why I removed the {{Tran}} graph). I'll continue working on addressing the comments in the RB too. Hope to have an updated patch sometime next week. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named > RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319913#comment-16319913 ] Rui Li commented on HIVE-18368: --- Hi [~stakiar], two questions regarding the screenshot: # Why the num of partitions of MapInput is 0? # It seems confusing to have 2 RDDs having the same work name, e.g. "Reducer 3", "Map 11". Can we name the shuffled RDD as "ShuffleTran", and the Hadoop RDD as "MapInput"? > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named > RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319245#comment-16319245 ] Xuefu Zhang commented on HIVE-18368: Hi [~stakiar], thanks for working on this. I think this is very useful. I haven't looked at the patch, but I have a couple of high-level questions: 1. Can we get rid of code reference such as {{at repartitionAndSortWithinPartitions at SortByShuffler.java:57}}. they don't seem useful. 2. Can you clarify what's the format of an RDD specification as shown in each line of the output. Besides the code reference, I'm not entirely sure what other elements means. For instance, I see many "[]" out there. 3. We have several internal object graphs, from Work graph, to SparkTran, and to RDD. We can Skip SparkTran entirely, but need to have a clear mapping from Work to RDD. Maybe reading the patch will give me the idea. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named > RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318889#comment-16318889 ] Sahil Takiar commented on HIVE-18368: - [~lirui], [~xuefuz] could you take a look? I think it will help improve the debug-ability of HoS, especially when using the Spark Web UI. [This|https://issues.apache.org/jira/browse/HIVE-18368?focusedCommentId=16312109&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16312109] comment contains a list of what changes this patch contains. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named > RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314716#comment-16314716 ] Hive QA commented on HIVE-18368: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12904862/HIVE-18368.2.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 11549 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] (batchId=12) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35) org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_binary_storage_queries] (batchId=99) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2] (batchId=151) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=156) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=164) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=159) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=159) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=120) org.apache.hadoop.hive.metastore.TestEmbeddedHiveMetaStore.testTransactionalValidation (batchId=213) org.apache.hadoop.hive.ql.io.TestDruidRecordWriter.testWrite (batchId=253) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=225) org.apache.hive.jdbc.TestSSL.testConnectionMismatch (batchId=231) org.apache.hive.jdbc.TestSSL.testConnectionWrongCertCN (batchId=231) org.apache.hive.jdbc.TestSSL.testMetastoreConnectionWrongCertCN (batchId=231) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8483/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8483/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8483/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 18 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12904862 - PreCommit-HIVE-Build > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named > RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314696#comment-16314696 ] Hive QA commented on HIVE-18368: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 29s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green} ql: The patch generated 0 new + 54 unchanged - 9 fixed = 54 total (was 63) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 12m 54s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / a6b88d9 | | Default Java | 1.8.0_111 | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8483/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named > RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313998#comment-16313998 ] Sahil Takiar commented on HIVE-18368: - [~chengxiang li] I saw you added the RDD graph in HIVE-10550, [~chinnalalam] I saw you added the SparkPlan graph in HIVE-8858. Could you take a look at this patch? - RB: https://reviews.apache.org/r/64996/ > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named > RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312522#comment-16312522 ] Hive QA commented on HIVE-18368: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12904682/HIVE-18368.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 18 failed/errored test(s), 11547 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=150) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2] (batchId=151) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=156) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=164) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=159) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=159) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part] (batchId=93) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=120) org.apache.hadoop.hive.metastore.TestEmbeddedHiveMetaStore.testTransactionalValidation (batchId=213) org.apache.hadoop.hive.ql.io.TestDruidRecordWriter.testWrite (batchId=253) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=225) org.apache.hive.jdbc.TestSSL.testConnectionMismatch (batchId=231) org.apache.hive.jdbc.TestSSL.testConnectionWrongCertCN (batchId=231) org.apache.hive.jdbc.TestSSL.testMetastoreConnectionWrongCertCN (batchId=231) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8452/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8452/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8452/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 18 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12904682 - PreCommit-HIVE-Build > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, Spark UI - Named RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312478#comment-16312478 ] Hive QA commented on HIVE-18368: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 45s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s{color} | {color:red} ql: The patch generated 1 new + 54 unchanged - 9 fixed = 55 total (was 63) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 13m 3s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / 20c9a39 | | Default Java | 1.8.0_111 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-8452/yetus/diff-checkstyle-ql.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8452/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: HIVE-18368.1.patch, Spark UI - Named RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph
[ https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312109#comment-16312109 ] Sahil Takiar commented on HIVE-18368: - * Spark provides a nice RDD graph via {{RDD#toDebugString}} - I replaced the {{SparkPlan#logSparkPlan}} and {{SparkUtilities#rddGraphToString}} with this graph. It includes all the info from both of these graphs + more info. It's very similar to the info that is showed in the Spark Web UI. An example is below. * Added explicit names for each RDD; the name is derived from the name of the {{BaseWork}} that corresponds to the RDD, along with the {{SparkEdgeProperty}} (if there is one). The example below shows this in detail. ** The nice thing about adding explicit names is that they show up in Spark Web UI too, which can be very useful for mapping a Hive Explain Plan to the Spark RDD DAG ** The name includes the number of partitions for the RDD as well as whether or not the RDD is cached * I originally wanted to find a way to display this in the {{EXPLAIN EXTENDED}} output, but for now that may be a bit difficult, because the {{SparkPlan}} is only generated in the {{RemoteDriver}} - its probably possible to generate the {{SparkPlan}} somewhere in the {{ExplainTask}}, but I'll save that for a later JIRA * The Spark RDD Graph is printed at INFO level, which I think should help with debugging * I've attached a screenshot of what the the Spark Web UI looks like with named RDDs Spark RDD Graph: {code} (1) Reducer 5 (1) MapPartitionsRDD[25] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 5 (SORT, 1) ShuffledRDD[24] at sortByKey at SortByShuffler.java:51 [] +-(166) Reducer 4 (166) MapPartitionsRDD[23] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 4 (PARTITION-LEVEL SORT, 166) ShuffledRDD[22] at repartitionAndSortWithinPartitions at SortByShuffler.java:57 [] +-(328) UnionRDD (328) UnionRDD[21] at union at SparkPlan.java:70 [] | Reducer 3 (328) MapPartitionsRDD[19] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 3 (PARTITION-LEVEL SORT, 328) ShuffledRDD[18] at repartitionAndSortWithinPartitions at SortByShuffler.java:57 [] +-(874) UnionRDD (874) UnionRDD[17] at union at SparkPlan.java:70 [] | UnionRDD (874) UnionRDD[16] at union at SparkPlan.java:70 [] | Reducer 2 (437) MapPartitionsRDD[11] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 2 (GROUP, 437) MapPartitionsRDD[10] at groupByKey at GroupByShuffler.java:31 [] | ShuffledRDD[9] at groupByKey at GroupByShuffler.java:31 [] +-(0) Map 1 (0) MapPartitionsRDD[8] at mapPartitionsToPair at MapTran.java:41 [] | Map 1 (store_sales, 0) HadoopRDD[4] at hadoopRDD at SparkPlanGenerator.java:203 [] | Reducer 8 (437) MapPartitionsRDD[14] at mapPartitionsToPair at ReduceTran.java:41 [] | Reducer 8 (GROUP PARTITION-LEVEL SORT, 437) ShuffledRDD[13] at repartitionAndSortWithinPartitions at SortByShuffler.java:57 [] +-(0) Map 7 (0) MapPartitionsRDD[12] at mapPartitionsToPair at MapTran.java:41 [] | Map 7 (store_sales, 0) HadoopRDD[5] at hadoopRDD at SparkPlanGenerator.java:203 [] | Map 10 (0) MapPartitionsRDD[15] at mapPartitionsToPair at MapTran.java:41 [] | Map 10 (store, 0) HadoopRDD[6] at hadoopRDD at SparkPlanGenerator.java:203 [] | Map 11 (0) MapPartitionsRDD[20] at mapPartitionsToPair at MapTran.java:41 [] | Map 11 (item, 0) HadoopRDD[7] at hadoopRDD at SparkPlanGenerator.java:203 [] {code} > Improve Spark Debug RDD Graph > - > > Key: HIVE-18368 > URL: https://issues.apache.org/jira/browse/HIVE-18368 > Project: Hive > Issue Type: Sub-task > Components: Spark >Reporter: Sahil Takiar >Assignee: Sahil Takiar > Attachments: Spark UI - Named RDDs.png > > > The {{SparkPlan}} class does some logging to show the mapping between > different {{SparkTran}}, what shuffle types are used, and what trans are > cached. However, there is room for improvement. > When debug logging is enabled the RDD graph is logged, but there isn't much > information printed about each RDD. > We should combine both of the graphs and improve them. We could even make the > Spark Plan graph part of the {{EXPLAIN EXTENDED}} output. > Ideally, the final graph shows a clear relationship between Tran objects, > RDDs, and BaseWorks. Edge should include information about number of > partitions, shuffle types, Spark operations used, etc. -- This message was sent by Atlassian JIRA (v6.4.14#64029)