subject:"\[GitHub\] spark pull request\: \[SPARK\-6066\] Make event log format easier to p..."

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76864451
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28201/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76864438
  
  [Test build #28201 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28201/consoleFull)
 for   PR 4821 at commit 
[`8511141`](https://github.com/apache/spark/commit/8511141f6b769655acc2b0c1de33a8401c8f7133).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4821


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76860619
  
  [Test build #28198 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28198/consoleFull)
 for   PR 4821 at commit 
[`fdae14c`](https://github.com/apache/spark/commit/fdae14c86f660047b50020b40a782b842182336c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76860628
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28198/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76855580
  
Okay I took a close look through this and it LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76850938
  
  [Test build #28201 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28201/consoleFull)
 for   PR 4821 at commit 
[`8511141`](https://github.com/apache/spark/commit/8511141f6b769655acc2b0c1de33a8401c8f7133).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25649521
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -574,6 +583,11 @@ private[spark] object JsonProtocol {
 SparkListenerExecutorRemoved(time, executorId, reason)
   }
 
+  def logStartFromJson(json: JValue): SparkListenerLogStart = {
+val version = (json \ "Spark Version").extract[String]
--- End diff --

(edit): misunderstood what you meant. I'll rename the variable NBD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76848440
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25648636
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -574,6 +583,11 @@ private[spark] object JsonProtocol {
 SparkListenerExecutorRemoved(time, executorId, reason)
   }
 
+  def logStartFromJson(json: JValue): SparkListenerLogStart = {
+val version = (json \ "Spark Version").extract[String]
--- End diff --

but then it's not consistent with other similar JSONs...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25648384
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -574,6 +583,11 @@ private[spark] object JsonProtocol {
 SparkListenerExecutorRemoved(time, executorId, reason)
   }
 
+  def logStartFromJson(json: JValue): SparkListenerLogStart = {
+val version = (json \ "Spark Version").extract[String]
--- End diff --

minor - but maybe call this `sparkVersion` so it's clear that this isn't a 
version for the logging format (since that doesn't have its own versioning).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76846594
  
  [Test build #28198 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28198/consoleFull)
 for   PR 4821 at commit 
[`fdae14c`](https://github.com/apache/spark/commit/fdae14c86f660047b50020b40a782b842182336c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25647830
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -89,16 +105,18 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 
 val list = provider.getListing().toSeq
 list should not be (null)
-list.size should be (4)
-list.count(e => e.completed) should be (2)
+list.size should be (5)
+list.count(_.completed) should be (3)
 
 list(0) should be (ApplicationHistoryInfo(newAppComplete.getName(), 
"new-app-complete", 1L, 4L,
   newAppComplete.lastModified(), "test", true))
-list(1) should be (ApplicationHistoryInfo(oldAppComplete.getName(), 
"old-app-complete", 2L, 3L,
+list(1) should be 
(ApplicationHistoryInfo(newAppCompressedComplete.getName(),
--- End diff --

It could be that you're using a different filesystem than then jenkins 
machine, and both return things in different order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25647694
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -89,16 +105,18 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 
 val list = provider.getListing().toSeq
 list should not be (null)
-list.size should be (4)
-list.count(e => e.completed) should be (2)
+list.size should be (5)
+list.count(_.completed) should be (3)
 
 list(0) should be (ApplicationHistoryInfo(newAppComplete.getName(), 
"new-app-complete", 1L, 4L,
   newAppComplete.lastModified(), "test", true))
-list(1) should be (ApplicationHistoryInfo(oldAppComplete.getName(), 
"old-app-complete", 2L, 3L,
+list(1) should be 
(ApplicationHistoryInfo(newAppCompressedComplete.getName(),
--- End diff --

yeah, I just realized this independently. I fixed this in my latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25647732
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -89,16 +105,18 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 
 val list = provider.getListing().toSeq
 list should not be (null)
-list.size should be (4)
-list.count(e => e.completed) should be (2)
+list.size should be (5)
+list.count(_.completed) should be (3)
 
 list(0) should be (ApplicationHistoryInfo(newAppComplete.getName(), 
"new-app-complete", 1L, 4L,
   newAppComplete.lastModified(), "test", true))
-list(1) should be (ApplicationHistoryInfo(oldAppComplete.getName(), 
"old-app-complete", 2L, 3L,
+list(1) should be 
(ApplicationHistoryInfo(newAppCompressedComplete.getName(),
--- End diff --

however that still doesn't explain why it passes locally but fails remotely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25641491
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -89,16 +105,18 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 
 val list = provider.getListing().toSeq
 list should not be (null)
-list.size should be (4)
-list.count(e => e.completed) should be (2)
+list.size should be (5)
+list.count(_.completed) should be (3)
 
 list(0) should be (ApplicationHistoryInfo(newAppComplete.getName(), 
"new-app-complete", 1L, 4L,
   newAppComplete.lastModified(), "test", true))
-list(1) should be (ApplicationHistoryInfo(oldAppComplete.getName(), 
"old-app-complete", 2L, 3L,
+list(1) should be 
(ApplicationHistoryInfo(newAppCompressedComplete.getName(),
--- End diff --

I think your problem is that entires `1` and `2` have both the same start 
and end time, so their sort order is non-deterministic. Also, 
`newAppCompressedComplete` has end time `4` in the code a few lines above. So 
it seems your test code has a couple of issues, this is not really flakiness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76831455
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28182/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76831439
  
  [Test build #28182 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28182/consoleFull)
 for   PR 4821 at commit 
[`654883d`](https://github.com/apache/spark/commit/654883dfbd65de162455601c69108dab4e354f7d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76821354
  
>  flakiness of FsHistoryProviderSuite

Do you have any theories for why this would be related to the log's 
modification time? The only place where it's used in FsHistoryProvider is for 
filtering entries in `checkForLogs`, and it's already a conservative check (it 
uses `>=` so that it picks up any new logs that show up with the same mod time 
as the last check, at the expense of re-parsing logs that haven't changed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76815718
  
  [Test build #28182 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28182/consoleFull)
 for   PR 4821 at commit 
[`654883d`](https://github.com/apache/spark/commit/654883dfbd65de162455601c69108dab4e354f7d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76815711
  
By the way I gained first-hand experience of the flakiness of 
`FsHistoryProviderSuite` from this patch. The tests passed locally but failed 
here, and I suspect that it has something to do with the modification time not 
varying enough between the file writes, though this is something we should fix 
separately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25628566
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -71,6 +74,21 @@ private[spark] object CompressionCodec {
   s"Consider setting $configKey=$FALLBACK_COMPRESSION_CODEC"))
   }
 
+  /**
+   * Return the short version of the given codec name.
+   * If it is already a short name, just return it.
+   */
+  def getShortName(codecName: String): String = {
+if (shortCompressionCodecNames.contains(codecName)) {
+  codecName
+} else {
+  shortCompressionCodecNames
+.collect { case (k, v) if v == codecName => k }
+.headOption
--- End diff --

thanks, that's exactly what I want


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25628525
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -111,19 +119,17 @@ private[spark] class EventLoggingListener(
 hadoopDataStream.get
   }
 
-val compressionCodec =
-  if (shouldCompress) {
-Some(CompressionCodec.createCodec(sparkConf))
-  } else {
-None
-  }
-
-fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
-val logStream = initEventLog(new BufferedOutputStream(dstream, 
outputBufferSize),
-  compressionCodec)
-writer = Some(new PrintWriter(logStream))
-
-logInfo("Logging events to %s".format(logPath))
+try {
+  val cstream = 
compressionCodec.map(_.compressedOutputStream(dstream)).getOrElse(dstream)
+  val bstream = new BufferedOutputStream(cstream, outputBufferSize)
+  fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
+  writer = Some(new PrintWriter(bstream))
--- End diff --

again, I'm just trying to preserve functionality here. If it's a real issue 
I would prefer to fix it separately, but in this patch I'm trying to minimize 
the scope of the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76787929
  
  [Test build #28180 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28180/consoleFull)
 for   PR 4821 at commit 
[`7f537cd`](https://github.com/apache/spark/commit/7f537cda6e68f922889f5b95e2869ec770f038a8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76785685
  
@pwendell yes, that's what I'm trying to say. Even today, logs generated by 
Spark 1.0, 1.1 and 1.2 (and soon 1.3) are not exactly the same - events have 
different properties. The metadata allows the parsing code to know what to 
expect.

(That is not a problem internally in Spark because the code handles all 
added/changed event properties using optionals and things like that, but 
there's no way to know whether we'll make a change in the future that prevents 
code like that from working without knowing the version of the data.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25628232
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -111,19 +119,17 @@ private[spark] class EventLoggingListener(
 hadoopDataStream.get
   }
 
-val compressionCodec =
-  if (shouldCompress) {
-Some(CompressionCodec.createCodec(sparkConf))
-  } else {
-None
-  }
-
-fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
-val logStream = initEventLog(new BufferedOutputStream(dstream, 
outputBufferSize),
-  compressionCodec)
-writer = Some(new PrintWriter(logStream))
-
-logInfo("Logging events to %s".format(logPath))
+try {
+  val cstream = 
compressionCodec.map(_.compressedOutputStream(dstream)).getOrElse(dstream)
+  val bstream = new BufferedOutputStream(cstream, outputBufferSize)
+  fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
+  writer = Some(new PrintWriter(bstream))
--- End diff --

Ah, sorry, there doesn't seem to be one for `(OutputStream, String)`. Sigh. 
Do you need a PrintWriter here? Can you use a regular `Writer` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76782633
  
@vanzin so just to be clear - you are anticipating a future case where we 
need to read the version to correctly parse the logs? Is that the argument for 
it? I am only making a simple observation that here we write something and 
never read it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25626361
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -45,18 +44,26 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 Utils.deleteRecursively(testDir)
   }
 
+  /** Create a fake log file using the new log format used in Spark 1.3+ */
+  private def newLogFile(appId: String, inProgress: Boolean = false): File 
= {
--- End diff --

Ok. I'm just preserving functionality here but it's not too expensive to 
add tests for that too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76780546
  
@pwendell the header is needed because it contains potentially useful 
information for the code parsing the logs. For example, now it contains the 
Spark version, which might be needed to tell the parsing code which properties 
to expect in the logs.

The original version of the change (the one that got rid of the directories 
and used a single file) encoded all metadata in the file name. The feedback was 
that it was ugly (long, cryptic file names) and brittle, since if you change 
the file name, you lose that information. I agree with that and thus the header 
was born.

Now we're back to encoding metadata in the file name. A simple extension is 
not to bad, though, espcially since you can probably figure out the compression 
codec by looking at the first few bytes of the file. But the header still 
provides useful information.

So I'm a little worried that the latest patch removes the metadata 
completely. Especially since it's common for the first event of the log to 
*not* be the one that contains the spark version 
(`SparkListenerEnvironmentUpdate`?), and instead be 
`SparkListenerBlockManagerAdded`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25626186
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -111,19 +119,17 @@ private[spark] class EventLoggingListener(
 hadoopDataStream.get
   }
 
-val compressionCodec =
-  if (shouldCompress) {
-Some(CompressionCodec.createCodec(sparkConf))
-  } else {
-None
-  }
-
-fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
-val logStream = initEventLog(new BufferedOutputStream(dstream, 
outputBufferSize),
-  compressionCodec)
-writer = Some(new PrintWriter(logStream))
-
-logInfo("Logging events to %s".format(logPath))
+try {
+  val cstream = 
compressionCodec.map(_.compressedOutputStream(dstream)).getOrElse(dstream)
+  val bstream = new BufferedOutputStream(cstream, outputBufferSize)
+  fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
+  writer = Some(new PrintWriter(bstream))
--- End diff --


http://docs.oracle.com/javase/6/docs/api/java/io/PrintWriter.html#PrintWriter(java.lang.String,%20java.lang.String)

Without that, you're writing data in the "default platform encoding".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25625447
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -45,18 +44,26 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 Utils.deleteRecursively(testDir)
   }
 
+  /** Create a fake log file using the new log format used in Spark 1.3+ */
+  private def newLogFile(appId: String, inProgress: Boolean = false): File 
= {
--- End diff --

Hmm... this method doesn't account for the compression codec. Maybe it 
should, and we should add a compressed, new-style event log to the "Parse new 
and old application logs" test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25625518
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -111,19 +119,17 @@ private[spark] class EventLoggingListener(
 hadoopDataStream.get
   }
 
-val compressionCodec =
-  if (shouldCompress) {
-Some(CompressionCodec.createCodec(sparkConf))
-  } else {
-None
-  }
-
-fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
-val logStream = initEventLog(new BufferedOutputStream(dstream, 
outputBufferSize),
-  compressionCodec)
-writer = Some(new PrintWriter(logStream))
-
-logInfo("Logging events to %s".format(logPath))
+try {
+  val cstream = 
compressionCodec.map(_.compressedOutputStream(dstream)).getOrElse(dstream)
+  val bstream = new BufferedOutputStream(cstream, outputBufferSize)
+  fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
+  writer = Some(new PrintWriter(bstream))
--- End diff --

that signature doesn't exist. Also this was moved from L124 so it's 
probably fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25625161
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -45,18 +44,26 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 Utils.deleteRecursively(testDir)
   }
 
+  /** Create a fake log file using the new log format used in Spark 1.3+ */
+  private def newLogFile(appId: String, inProgress: Boolean = false): File 
= {
+val ip = if (inProgress) EventLoggingListener.IN_PROGRESS else ""
+val logUri = EventLoggingListener.getLogPath(testDir.getAbsolutePath, 
appId)
+val logPath = new URI(logUri).getPath + ip
+new File(logPath)
+  }
+
   test("Parse new and old application logs") {
 val provider = new FsHistoryProvider(createTestConf())
 
 // Write a new-style application log.
-val newAppComplete = new File(testDir, "new1")
+val newAppComplete = newLogFile("new1", inProgress = false)
--- End diff --

If you're going to pass the default value explicitly, might be better to 
not make it an optional parameter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25625032
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -279,52 +253,17 @@ private[spark] object EventLoggingListener extends 
Logging {
 }
 
 val in = new BufferedInputStream(fs.open(log))
-// Read a single line from the input stream without buffering.
-// We cannot use BufferedReader because we must avoid reading
-// beyond the end of the header, after which the content of the
-// file may be compressed.
-def readLine(): String = {
-  val bytes = new ByteArrayOutputStream()
-  var next = in.read()
-  var count = 0
-  while (next != '\n') {
-if (next == -1) {
-  throw new IOException("Unexpected end of file.")
-}
-bytes.write(next)
-count = count + 1
-if (count > MAX_HEADER_LINE_LENGTH) {
-  throw new IOException("Maximum header line length exceeded.")
-}
-next = in.read()
-  }
-  new String(bytes.toByteArray(), Charsets.UTF_8)
+
+// Compression codec is encoded as an extension, e.g. app_123.lzf
+// Since we sanitize the app ID to not include periods, it is safe to 
split on it
+val logName = log.getName.replaceAll(IN_PROGRESS, "")
--- End diff --

`replaceAll` takes a regex, so you need to `Pattern.quote(IN_PROGRESS)`. In 
fact, it might be better to use `stripSuffix`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25624653
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -111,19 +119,17 @@ private[spark] class EventLoggingListener(
 hadoopDataStream.get
   }
 
-val compressionCodec =
-  if (shouldCompress) {
-Some(CompressionCodec.createCodec(sparkConf))
-  } else {
-None
-  }
-
-fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
-val logStream = initEventLog(new BufferedOutputStream(dstream, 
outputBufferSize),
-  compressionCodec)
-writer = Some(new PrintWriter(logStream))
-
-logInfo("Logging events to %s".format(logPath))
+try {
+  val cstream = 
compressionCodec.map(_.compressedOutputStream(dstream)).getOrElse(dstream)
+  val bstream = new BufferedOutputStream(cstream, outputBufferSize)
+  fileSystem.setPermission(path, LOG_FILE_PERMISSIONS)
+  writer = Some(new PrintWriter(bstream))
--- End diff --

`new PrintWriter(bstream, "UTF-8")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25624492
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -71,6 +74,21 @@ private[spark] object CompressionCodec {
   s"Consider setting $configKey=$FALLBACK_COMPRESSION_CODEC"))
   }
 
+  /**
+   * Return the short version of the given codec name.
+   * If it is already a short name, just return it.
+   */
+  def getShortName(codecName: String): String = {
+if (shortCompressionCodecNames.contains(codecName)) {
+  codecName
+} else {
+  shortCompressionCodecNames
+.collect { case (k, v) if v == codecName => k }
+.headOption
--- End diff --

`collectFirst`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25624317
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -320,30 +320,24 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
* log file (along with other metadata files), which is the case for 
directories generated by
* the code in previous releases.
*
-   * @return 2-tuple of (input stream of the events, version of Spark 
which wrote the log)
+   * @return input stream that holds one JSON serialized event per line
--- End diff --

nit: one JSON *record*


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76673355
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28159/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76673346
  
  [Test build #28159 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28159/consoleFull)
 for   PR 4821 at commit 
[`7d6aa61`](https://github.com/apache/spark/commit/7d6aa61c59a784d6f62fee94c9d7f4c6e0f501f9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76669704
  
  [Test build #28158 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28158/consoleFull)
 for   PR 4821 at commit 
[`59abee9`](https://github.com/apache/spark/commit/59abee953d785f8d6b4f0320eb5c5e91715c35a3).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MatrixFactorizationModel(JavaModelWrapper, Saveable, 
JavaLoader):`
  * `class Saveable(object):`
  * `class Loader(object):`
  * `class JavaLoader(Loader):`
  * `java_class = ".".join([java_package, cls.__name__])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76669712
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28158/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76668874
  
  [Test build #28157 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28157/consoleFull)
 for   PR 4821 at commit 
[`40e1ff6`](https://github.com/apache/spark/commit/40e1ff6398be8a55fe834e7d9c705d4ea31a520a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76668879
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28157/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-7392
  
  [Test build #28159 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28159/consoleFull)
 for   PR 4821 at commit 
[`7d6aa61`](https://github.com/apache/spark/commit/7d6aa61c59a784d6f62fee94c9d7f4c6e0f501f9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76663956
  
  [Test build #28158 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28158/consoleFull)
 for   PR 4821 at commit 
[`59abee9`](https://github.com/apache/spark/commit/59abee953d785f8d6b4f0320eb5c5e91715c35a3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76663334
  
As of the latest commit I have removed the Spark version and the metadata. 
Please have a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579983
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala ---
@@ -39,10 +39,9 @@ private[spark] class ReplayListenerBus extends 
SparkListenerBus with Logging {
* error is thrown by this method.
*
* @param logData Stream containing event log data.
-   * @param version Spark version that generated the events.
* @param sourceName Filename (or other source identifier) from whence 
@logData is being read
*/
-  def replay(logData: InputStream, version: String, sourceName: String) {
+  def replay(logData: InputStream, sourceName: String): Unit = {
--- End diff --

note that changing this signature was necessary, since the new format can 
no longer provide a Spark version that makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579955
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
*
--- End diff --

Actually I guess this handles multiple versions. My bad, fine to leave as is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579899
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
*
* @return 2-tuple of (input stream of the events, version of Spark 
which wrote the log)
--- End diff --

This is outdated now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579888
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -322,28 +322,22 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
*
--- End diff --

This doc might be better if it said "in Spark 1.2.X and earlier" instead of 
"in previous releases".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76663276
  
  [Test build #28157 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28157/consoleFull)
 for   PR 4821 at commit 
[`40e1ff6`](https://github.com/apache/spark/commit/40e1ff6398be8a55fe834e7d9c705d4ea31a520a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25579333
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala ---
@@ -23,7 +23,9 @@ private[spark] class ApplicationDescription(
 val memoryPerSlave: Int,
 val command: Command,
 var appUiUrl: String,
-val eventLogDir: Option[String] = None)
+val sparkVersion: String,
--- End diff --

just adding an extra field seems like a solution that involves a smaller 
change, but let me know if you'd like me to change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25577300
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala ---
@@ -23,7 +23,9 @@ private[spark] class ApplicationDescription(
 val memoryPerSlave: Int,
 val command: Command,
 var appUiUrl: String,
-val eventLogDir: Option[String] = None)
+val sparkVersion: String,
--- End diff --

it's because at this point the `SparkContext` doesn't have an App ID yet. 
If we want to pass name of the file we'll need to have some place holder (e.g. 
`{{APP_ID}}_lzf`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25577271
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,57 +223,67 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * "key=value" format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
-   *
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap(("version" -> SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += ("Event" -> 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY -> SPARK_VERSION)
 compressionCodec.foreach { codec =>
-  meta += ("compressionCodec" -> codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY -> 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length > MAX_HEADER_LINE_LENGTH) {
-throw new IOException(s"Header entry too long: ${entry}")
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + "\n").getBytes(Charsets.UTF_8)
+if (metadataBytes.length > MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(s"Event log metadata too long: $metadataJson")
 }
-
-meta.foreach { case (k, v) => write(s"$k=$v\n") }
-write(s"$HEADER_END_MARKER\n")
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll("[ :/]", "-").replaceAll("[${}'\"]", 
"_").toLowerCase
-Utils.resolveURI(logBaseDir) + "/" + name.stripSuffix("/")
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String] = None): String = {
+val sanitizedAppId = appId.replaceAll("[ :/]", 
"-").replaceAll("[${}'\"]", "_").toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
--- End diff --

you

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76641955
  
I took a pass on this with some feedback. Overall, it would be good to 
really minimize the scope of the changes since this is so late in the game. 
There is some clean-up and renaming etc that would be best just left out of the 
patch.

The main thing I'm wondering is why we need this header at all. It doesn't 
even ever get used by our own replay - we just ignore it. It seems like it was 
added for the purpose of conveying the compression codec to bootstrap replaying 
the file, however just having an extension seems like a better, much more 
standard way of doing that. The only argument I see for it is that the header 
could be used in the future to encode things that are necessary for proper 
replay of the logs. However, in that case I don't see why we can't just add it 
later if and when those things occur.

I guess I don't see a good argument against a straw man of just not having 
the header. Curious to hear thoughts from @andrewor14 and @vanzin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25574126
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala ---
@@ -99,6 +99,12 @@ case class SparkListenerExecutorRemoved(time: Long, 
executorId: String, reason:
   extends SparkListenerEvent
 
 /**
+ * A special dummy event used to identify the metadata header in event 
logs.
+ * This is not actually posted anywhere.
+ */
+private[spark] case object SparkListenerMetadataIdentifier extends 
SparkListenerEvent
--- End diff --

Since this is never used, why don't we instead just ignore this event type 
when reading from JSON (i.e. allow sparkEventFromJson to return None). It seems 
a bit strange to add this event but it contains nothing that's ever used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25573931
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ApplicationDescription.scala ---
@@ -23,7 +23,9 @@ private[spark] class ApplicationDescription(
 val memoryPerSlave: Int,
 val command: Command,
 var appUiUrl: String,
-val eventLogDir: Option[String] = None)
+val sparkVersion: String,
--- End diff --

What about just passing a single option `eventLogFile` here (i.e. the full 
path to the log file). The master only uses the evenLogDir, sparkVersion, 
eventLogCodec to reconstruct the filename. Reconstructing it in two places 
seems a little brittle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25573801
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,57 +223,67 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * "key=value" format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
-   *
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap(("version" -> SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += ("Event" -> 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY -> SPARK_VERSION)
 compressionCodec.foreach { codec =>
-  meta += ("compressionCodec" -> codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY -> 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length > MAX_HEADER_LINE_LENGTH) {
-throw new IOException(s"Header entry too long: ${entry}")
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + "\n").getBytes(Charsets.UTF_8)
+if (metadataBytes.length > MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(s"Event log metadata too long: $metadataJson")
 }
-
-meta.foreach { case (k, v) => write(s"$k=$v\n") }
-write(s"$HEADER_END_MARKER\n")
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll("[ :/]", "-").replaceAll("[${}'\"]", 
"_").toLowerCase
-Utils.resolveURI(logBaseDir) + "/" + name.stripSuffix("/")
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String] = None): String = {
+val sanitizedAppId = appId.replaceAll("[ :/]", 
"-").replaceAll("[${}'\"]", "_").toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
+// e.g. EVENT_LOG_ {...}

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-03-01 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25573776
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,57 +223,67 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * "key=value" format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
-   *
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap(("version" -> SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += ("Event" -> 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY -> SPARK_VERSION)
 compressionCodec.foreach { codec =>
-  meta += ("compressionCodec" -> codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY -> 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length > MAX_HEADER_LINE_LENGTH) {
-throw new IOException(s"Header entry too long: ${entry}")
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + "\n").getBytes(Charsets.UTF_8)
+if (metadataBytes.length > MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(s"Event log metadata too long: $metadataJson")
 }
-
-meta.foreach { case (k, v) => write(s"$k=$v\n") }
-write(s"$HEADER_END_MARKER\n")
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll("[ :/]", "-").replaceAll("[${}'\"]", 
"_").toLowerCase
-Utils.resolveURI(logBaseDir) + "/" + name.stripSuffix("/")
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String] = None): String = {
+val sanitizedAppId = appId.replaceAll("[ :/]", 
"-").replaceAll("[${}'\"]", "_").toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
--- End diff --

I don

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76507612
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28103/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76507606
  
  [Test build #28103 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28103/consoleFull)
 for   PR 4821 at commit 
[`519e51a`](https://github.com/apache/spark/commit/519e51a958b40d193327e85b659e1df767041f55).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76505881
  
  [Test build #28102 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28102/consoleFull)
 for   PR 4821 at commit 
[`ef69276`](https://github.com/apache/spark/commit/ef692768db319d3159ce9522d625cede3505e161).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76505885
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28102/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76503962
  
  [Test build #28103 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28103/consoleFull)
 for   PR 4821 at commit 
[`519e51a`](https://github.com/apache/spark/commit/519e51a958b40d193327e85b659e1df767041f55).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25552458
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -110,17 +117,12 @@ private[spark] class EventLoggingListener(
 hadoopDataStream = Some(fileSystem.create(path))
 hadoopDataStream.get
   }
-
-val compressionCodec =
-  if (shouldCompress) {
-Some(CompressionCodec.createCodec(sparkConf))
-  } else {
-None
-  }
+val cstream = 
compressionCodec.map(_.compressedOutputStream(dstream)).getOrElse(dstream)
--- End diff --

that's fine. I fixed this in my latest commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25552366
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,53 +219,60 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * "key=value" format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
-   *
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap(("version" -> SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += ("Event" -> 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY -> SPARK_VERSION)
 compressionCodec.foreach { codec =>
-  meta += ("compressionCodec" -> codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY -> 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length > MAX_HEADER_LINE_LENGTH) {
-throw new IOException(s"Header entry too long: ${entry}")
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + "\n").getBytes(Charsets.UTF_8)
+if (metadataBytes.length > MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(s"Event log metadata too long: $metadataJson")
 }
-
-meta.foreach { case (k, v) => write(s"$k=$v\n") }
-write(s"$HEADER_END_MARKER\n")
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll("[ :/]", "-").replaceAll("[${}'\"]", 
"_").toLowerCase
-Utils.resolveURI(logBaseDir) + "/" + name.stripSuffix("/")
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String]): String = {
+val sanitizedAppId = appId.replaceAll("[ :/]", 
"-").replaceAll("[${}'\"]", "_").toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
+// e.g. EVENT_LOG_ {...} 
_COM

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76502888
  
  [Test build #28102 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28102/consoleFull)
 for   PR 4821 at commit 
[`ef69276`](https://github.com/apache/spark/commit/ef692768db319d3159ce9522d625cede3505e161).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25552104
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,53 +219,60 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * "key=value" format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
-   *
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap(("version" -> SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += ("Event" -> 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY -> SPARK_VERSION)
 compressionCodec.foreach { codec =>
-  meta += ("compressionCodec" -> codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY -> 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length > MAX_HEADER_LINE_LENGTH) {
-throw new IOException(s"Header entry too long: ${entry}")
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + "\n").getBytes(Charsets.UTF_8)
+if (metadataBytes.length > MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(s"Event log metadata too long: $metadataJson")
 }
-
-meta.foreach { case (k, v) => write(s"$k=$v\n") }
-write(s"$HEADER_END_MARKER\n")
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll("[ :/]", "-").replaceAll("[${}'\"]", 
"_").toLowerCase
-Utils.resolveURI(logBaseDir) + "/" + name.stripSuffix("/")
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String]): String = {
+val sanitizedAppId = appId.replaceAll("[ :/]", 
"-").replaceAll("[${}'\"]", "_").toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
+// e.g. EVENT_LOG_ {...} 
_COMPRES

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25552069
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -202,11 +204,11 @@ private[spark] object EventLoggingListener extends 
Logging {
   val IN_PROGRESS = ".inprogress"
   val DEFAULT_LOG_DIR = "/tmp/spark-events"
 
-  private val LOG_FILE_PERMISSIONS = new 
FsPermission(Integer.parseInt("770", 8).toShort)
+  val EVENT_LOG_KEY = "EVENT_LOG"
--- End diff --

that's fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25552048
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,53 +219,60 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * "key=value" format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
-   *
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap(("version" -> SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += ("Event" -> 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY -> SPARK_VERSION)
 compressionCodec.foreach { codec =>
-  meta += ("compressionCodec" -> codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY -> 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length > MAX_HEADER_LINE_LENGTH) {
-throw new IOException(s"Header entry too long: ${entry}")
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + "\n").getBytes(Charsets.UTF_8)
+if (metadataBytes.length > MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(s"Event log metadata too long: $metadataJson")
 }
-
-meta.foreach { case (k, v) => write(s"$k=$v\n") }
-write(s"$HEADER_END_MARKER\n")
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll("[ :/]", "-").replaceAll("[${}'\"]", 
"_").toLowerCase
-Utils.resolveURI(logBaseDir) + "/" + name.stripSuffix("/")
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String]): String = {
+val sanitizedAppId = appId.replaceAll("[ :/]", 
"-").replaceAll("[${}'\"]", "_").toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
+// e.g. EVENT_LOG_ {...} 
_COM

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76497543
  
  [Test build #28098 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28098/consoleFull)
 for   PR 4821 at commit 
[`8db5a06`](https://github.com/apache/spark/commit/8db5a06d108d8a2ddb8460e48e3509f46cc4fc2f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76497550
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28098/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76496914
  
Looks reasonable, will wait for tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25549402
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -217,53 +219,60 @@ private[spark] object EventLoggingListener extends 
Logging {
   /**
* Write metadata about the event log to the given stream.
*
-   * The header is a serialized version of a map, except it does not use 
Java serialization to
-   * avoid incompatibilities between different JDKs. It writes one map 
entry per line, in
-   * "key=value" format.
-   *
-   * The very last entry in the header is the `HEADER_END_MARKER` marker, 
so that the parsing code
-   * can know when to stop.
+   * The header is a single line of JSON in the beginning of the file. 
Note that this
+   * assumes all metadata necessary to parse the log is also included in 
the file name.
+   * The format needs to be kept in sync with the `openEventLog()` method 
below. Also, it
+   * cannot change in new Spark versions without some other way of 
detecting the change.
*
-   * The format needs to be kept in sync with the openEventLog() method 
below. Also, it cannot
-   * change in new Spark versions without some other way of detecting the 
change (like some
-   * metadata encoded in the file name).
-   *
-   * @param logStream Raw output stream to the even log file.
+   * @param logStream Raw output stream to the event log file.
* @param compressionCodec Optional compression codec to use.
-   * @return A stream where to write event log data. This may be a wrapper 
around the original
+   * @return A stream to which event log data is written. This may be a 
wrapper around the original
* stream (for example, when compression is enabled).
*/
   def initEventLog(
   logStream: OutputStream,
   compressionCodec: Option[CompressionCodec]): OutputStream = {
-val meta = mutable.HashMap(("version" -> SPARK_VERSION))
+val metadata = new mutable.HashMap[String, String]
+// Some of these metadata are already encoded in the file name
+// Here we include them again within the file itself for completeness
+metadata += ("Event" -> 
Utils.getFormattedClassName(SparkListenerMetadataIdentifier))
+metadata += (SPARK_VERSION_KEY -> SPARK_VERSION)
 compressionCodec.foreach { codec =>
-  meta += ("compressionCodec" -> codec.getClass().getName())
+  metadata += (COMPRESSION_CODEC_KEY -> 
codec.getClass.getCanonicalName)
 }
-
-def write(entry: String) = {
-  val bytes = entry.getBytes(Charsets.UTF_8)
-  if (bytes.length > MAX_HEADER_LINE_LENGTH) {
-throw new IOException(s"Header entry too long: ${entry}")
-  }
-  logStream.write(bytes, 0, bytes.length)
+val metadataJson = compact(render(JsonProtocol.mapToJson(metadata)))
+val metadataBytes = (metadataJson + "\n").getBytes(Charsets.UTF_8)
+if (metadataBytes.length > MAX_HEADER_LINE_LENGTH) {
+  throw new IOException(s"Event log metadata too long: $metadataJson")
 }
-
-meta.foreach { case (k, v) => write(s"$k=$v\n") }
-write(s"$HEADER_END_MARKER\n")
-
compressionCodec.map(_.compressedOutputStream(logStream)).getOrElse(logStream)
+logStream.write(metadataBytes, 0, metadataBytes.length)
+logStream
   }
 
   /**
* Return a file-system-safe path to the log file for the given 
application.
*
+   * Note that because we currently only create a single log file for each 
application,
+   * we must encode all the information needed to parse this event log in 
the file name
+   * instead of within the file itself. Otherwise, if the file is 
compressed, for instance,
+   * we won't know which codec to use to decompress the metadata.
+   *
* @param logBaseDir Directory where the log file will be written.
* @param appId A unique app ID.
+   * @param compressionCodecName Name of the compression codec used to 
compress the contents
+   * of the log, or None if compression is not 
enabled.
* @return A path which consists of file-system-safe characters.
*/
-  def getLogPath(logBaseDir: String, appId: String): String = {
-val name = appId.replaceAll("[ :/]", "-").replaceAll("[${}'\"]", 
"_").toLowerCase
-Utils.resolveURI(logBaseDir) + "/" + name.stripSuffix("/")
+  def getLogPath(
+  logBaseDir: String,
+  appId: String,
+  compressionCodecName: Option[String]): String = {
+val sanitizedAppId = appId.replaceAll("[ :/]", 
"-").replaceAll("[${}'\"]", "_").toLowerCase
+// e.g. EVENT_LOG_app_123_SPARK_VERSION_1.3.1
+// e.g. EVENT_LOG_ {...} 
_COMPRES

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25549234
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -202,11 +204,11 @@ private[spark] object EventLoggingListener extends 
Logging {
   val IN_PROGRESS = ".inprogress"
   val DEFAULT_LOG_DIR = "/tmp/spark-events"
 
-  private val LOG_FILE_PERMISSIONS = new 
FsPermission(Integer.parseInt("770", 8).toShort)
+  val EVENT_LOG_KEY = "EVENT_LOG"
--- End diff --

I think the file name would be less of a mouthfull without this prefix. But 
not a big deal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4821#discussion_r25549140
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -110,17 +117,12 @@ private[spark] class EventLoggingListener(
 hadoopDataStream = Some(fileSystem.create(path))
 hadoopDataStream.get
   }
-
-val compressionCodec =
-  if (shouldCompress) {
-Some(CompressionCodec.createCodec(sparkConf))
-  } else {
-None
-  }
+val cstream = 
compressionCodec.map(_.compressedOutputStream(dstream)).getOrElse(dstream)
--- End diff --

I know this is not part of this change, but if feels like this whole method 
needs a try..finally to avoid leaking an open `dstream` when some exception is 
thrown.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76494032
  
Tests will fail to compile. I'm fixing this in the mean time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6066] Make event log format easier to p...

2015-02-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4821#issuecomment-76493745
  
  [Test build #28098 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28098/consoleFull)
 for   PR 4821 at commit 
[`8db5a06`](https://github.com/apache/spark/commit/8db5a06d108d8a2ddb8460e48e3509f46cc4fc2f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

78 matches

Mail list logo