[ https://issues.apache.org/jira/browse/SPARK-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340958#comment-14340958 ]
Marcelo Vanzin edited comment on SPARK-6066 at 2/27/15 10:43 PM: ----------------------------------------------------------------- Nothing wrong, I just don't see how it's much better. The user trying to read it externally still needs to know that if there is a certain extension he needs to use a particular compression codec. And he still needs to understand that the first line, even though it's JSON, is not actually an event, but a header, and needs to understand the contents of that header. (Right now I don't think there's anything particularly interesting there, but at some point there might - e.g. the Spark version might become important to help understand the rest of the file.) A library would make all that transparent to this user. Basically something like "java.util.zip.ZipFile", where instead of bytes you have a collection of "ZipEntries" (here you'd have a collection of "SparkListenerEvent"). No strong opinion one way or another, I just thing the library is nicer for the end user and more flexible in the long run. was (Author: vanzin): Nothing wrong, I just don't see how it's much better. The user trying to read it externally still needs to know that if there is a certain extension he needs to use a particular compression codec. And he still needs to understand that the first line, even though it's JSON, is not actually an event, but a header, and needs to understand the contents of that header. (Right now I don't think there's anything particularly interesting there, but at some point there might - e.g. the Spark version might become important to help understand the rest of the file.) A library would make all that transparent to this user. Basically something like "java.util.zip.ZipFile", where instead of bytes you have a collection of "ZipEntries" (here you'd have a collection of "SparkListenerEvent"). No strong opinion one way or another, I just thing the library is nices for the end user and more flexible in the long run. > Metadata in event log makes it very difficult for external libraries to parse > event log > --------------------------------------------------------------------------------------- > > Key: SPARK-6066 > URL: https://issues.apache.org/jira/browse/SPARK-6066 > Project: Spark > Issue Type: Bug > Affects Versions: 1.3.0 > Reporter: Kay Ousterhout > Assignee: Andrew Or > Priority: Blocker > > The fix for SPARK-2261 added a line at the beginning of the event log that > encodes metadata. This line makes it much more difficult to parse the event > logs from external libraries (like > https://github.com/kayousterhout/trace-analysis, which is used by folks at > Berkeley) because: > (1) The metadata is not written as JSON, unlike the rest of the file > (2) More annoyingly, if the file is compressed, the metadata is not > compressed. This has a few side-effects: first, someone can't just use the > command line to uncompress the file and then look at the logs, because the > file is in this weird half-compressed format; and second, now external tools > that parse these logs also need to deal with this weird format. > We should fix this before the 1.3 release, because otherwise we'll have to > add a bunch more backward-compatibility code to handle this weird format! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org