[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-12 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22353
  
thanks, merging to master/2.4/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22353
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22353
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95993/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22353
  
**[Test build #95993 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95993/testReport)**
 for PR 22353 at commit 
[`0340fa6`](https://github.com/apache/spark/commit/0340fa648a17384a039ee484de9ce91a0129b260).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22353
  
**[Test build #95993 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95993/testReport)**
 for PR 22353 at commit 
[`0340fa6`](https://github.com/apache/spark/commit/0340fa648a17384a039ee484de9ce91a0129b260).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-12 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22353
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-11 Thread LantaoJin
Github user LantaoJin commented on the issue:

https://github.com/apache/spark/pull/22353
  
ping @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-11 Thread LantaoJin
Github user LantaoJin commented on the issue:

https://github.com/apache/spark/pull/22353
  
Thank you @cloud-fan for your reminding. We’ve handled the drop message 
case. Agree, I will update a commit tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-11 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22353
  
So you need a way to reliably report some extra information like file path 
in the event logs, but don't want to show it in the UI as it maybe too long.

Basically we shouldn't put such information in the event logs if it's not 
used in the UI, and we should build a new mechanism to make Spark easier to be 
analyzed. Also keep it mind that event logs are not reliable, Spark may drop 
some events if the event bus is too busy.

I'm ok to add it back to the event logs since it was there before, but 
please don't add `metadata` to `SparkPlan`, we can pattern match the 
`FileSourceScanExec` in `SparkPlanInfo.fromSparkPlan`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-11 Thread LantaoJin
Github user LantaoJin commented on the issue:

https://github.com/apache/spark/pull/22353
  
Spark driver log is always distributed on various client nodes and depends 
on the log4j configs. In a big company, it's hard to collect them all and I 
think it's better to used for debug not analyze.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-11 Thread LantaoJin
Github user LantaoJin commented on the issue:

https://github.com/apache/spark/pull/22353
  
> Although event log is in JSON format, it's mostly for internal usage, to 
be load by history server and used to build the Spark UI.
AFAIK, there are more and more projects replay event log to analysis jobs 
offline, especially in a platform/infra team in a big company. Dr-elephant 
doesn't read event log, instead, query SHS to get information causing many 
problems like compatibility or data accuracy. In eBay we are building a system 
similar with Dr-elephant but much powerful. One of use cases in this system is 
building a data lineage and monitor the input/output path and data size for 
each application. Difference with Apache Altas who need attach a spark listener 
into the spark runtime, we choose to replay event log to build all context we 
need. Before 2.3, we can get above information from the `metadata` field in 
SQLExecutionStart event. Now it was removed. So I hope this PR could add it 
back. What's more is  make more probability on event log instead of only using 
in SHS.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump...

2018-09-11 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22353
  
Although event log is in JSON format, it's mostly for internal usage, to be 
load by history server and used to build the Spark UI. For compatibility, we 
only focus on making history to be able to load event logs from different spark 
versions, not the event log itself. At the end it's still a log.

Metadata is a hack which I really hate to add back. Can you describe more 
details about your use case? Let's see if we can solve it with the Spark driver 
log.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org