[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22353 thanks, merging to master/2.4/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22353 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22353 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95993/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22353 **[Test build #95993 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95993/testReport)** for PR 22353 at commit [`0340fa6`](https://github.com/apache/spark/commit/0340fa648a17384a039ee484de9ce91a0129b260). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22353 **[Test build #95993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95993/testReport)** for PR 22353 at commit [`0340fa6`](https://github.com/apache/spark/commit/0340fa648a17384a039ee484de9ce91a0129b260). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22353 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/22353 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/22353 Thank you @cloud-fan for your reminding. Weâve handled the drop message case. Agree, I will update a commit tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22353 So you need a way to reliably report some extra information like file path in the event logs, but don't want to show it in the UI as it maybe too long. Basically we shouldn't put such information in the event logs if it's not used in the UI, and we should build a new mechanism to make Spark easier to be analyzed. Also keep it mind that event logs are not reliable, Spark may drop some events if the event bus is too busy. I'm ok to add it back to the event logs since it was there before, but please don't add `metadata` to `SparkPlan`, we can pattern match the `FileSourceScanExec` in `SparkPlanInfo.fromSparkPlan`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/22353 Spark driver log is always distributed on various client nodes and depends on the log4j configs. In a big company, it's hard to collect them all and I think it's better to used for debug not analyze. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user LantaoJin commented on the issue: https://github.com/apache/spark/pull/22353 > Although event log is in JSON format, it's mostly for internal usage, to be load by history server and used to build the Spark UI. AFAIK, there are more and more projects replay event log to analysis jobs offline, especially in a platform/infra team in a big company. Dr-elephant doesn't read event log, instead, query SHS to get information causing many problems like compatibility or data accuracy. In eBay we are building a system similar with Dr-elephant but much powerful. One of use cases in this system is building a data lineage and monitor the input/output path and data size for each application. Difference with Apache Altas who need attach a spark listener into the spark runtime, we choose to replay event log to build all context we need. Before 2.3, we can get above information from the `metadata` field in SQLExecutionStart event. Now it was removed. So I hope this PR could add it back. What's more is make more probability on event log instead of only using in SHS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22353 Although event log is in JSON format, it's mostly for internal usage, to be load by history server and used to build the Spark UI. For compatibility, we only focus on making history to be able to load event logs from different spark versions, not the event log itself. At the end it's still a log. Metadata is a hack which I really hate to add back. Can you describe more details about your use case? Let's see if we can solve it with the Spark driver log. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org