Re: SequenceFile and object reuse
Hey Jeff, in addition to what Sandy said, there are two more reasons that this might not be as bad as it seems; I may be incorrect in my understanding though. First, the "additional step" you're referring to is not likely to be adding any overhead; the "extra map" is really just materializing the data once (as opposed to zero times), which is what you want (assuming your access pattern couldn't be reformulated in the way Sandy described, i.e. where all the objects in a partition don't need to be in memory at the same time). Secondly, even if this was an "extra" map step, it wouldn't add any extra stages to a given pipeline, being a "narrow" dependency, so it would likely be low-cost anyway. Let me know if any of the above seems incorrect, thanks! On Thu, Nov 19, 2015 at 12:41 AM Sandy Ryzawrote: > Hi Jeff, > > Many access patterns simply take the result of hadoopFile and use it to > create some other object, and thus have no need for each input record to > refer to a different object. In those cases, the current API is more > performant than an alternative that would create an object for each record, > because it avoids the unnecessary overhead of creating Java objects. As > you've pointed out, this is at the expense of making the code more verbose > when caching. > > -Sandy > > On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi > wrote: > >> So we tried reading a sequencefile in Spark and realized that all our >> records have ended up becoming the same. >> THen one of us found this: >> >> Note: Because Hadoop's RecordReader class re-uses the same Writable >> object for each record, directly caching the returned RDD or directly >> passing it to an aggregation or shuffle operation will create many >> references to the same object. If you plan to directly cache, sort, or >> aggregate Hadoop writable objects, you should first copy them using a map >> function. >> >> Is there anyone that can shed some light on this bizzare behavior and the >> decisions behind it? >> And I also would like to know if anyone's able to read a binary file and >> not to incur the additional map() as suggested by the above? What format >> did you use? >> >> thanks >> Jeff >> > >
Spree: a live-updating web UI for Spark
Probably relevant to people on this list: on Friday I released a clone of the Spark web UI built using Meteor https://www.meteor.com/ so that everything updates in real-time, saving you from endlessly refreshing the page while jobs are running :) It can also serve as the UI for running as well as completed applications, so you don't have to mess with the separate history-server process if you don't want to. *This blog post* http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/ and *the github repo* https://github.com/hammerlab/spree have lots of information on how to use it. It has two sub-components, JsonRelay https://github.com/hammerlab/spark-json-relay and Slim https://github.com/hammerlab/slim; the former sends SparkListenerEvent https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L33s out of Spark via a network socket, while the latter receives those events and writes various stats about them to Mongo (like an external JobProgressListener https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala that persists info to a database). You might find them to offer a better way of storing information about running and completed Spark applications than the event log files that Spark uses, and they can be used with or without the real-time web UI. Give them a try if they sound useful to you, and let me know if you have questions or comments! -Ryan
Re: Re: spark 1.3.1 jars in repo1.maven.org
Thanks so much Shixiong! This is great. On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu zsxw...@gmail.com wrote: Ryan - I sent a PR to fix your issue: https://github.com/apache/spark/pull/6599 Edward - I have no idea why the following error happened. ContextCleaner doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to support both hadoop 1 and hadoop 2. * Exception in thread Spark Context Cleaner java.lang.NoClassDefFoundError: 0 at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149) Best Regards, Shixiong Zhu 2015-06-03 0:08 GMT+08:00 Ryan Williams ryan.blake.willi...@gmail.com: I think this is causing issues upgrading ADAM https://github.com/bigdatagenomics/adam to Spark 1.3.1 (cf. adam#690 https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383); attempting to build against Hadoop 1.0.4 yields errors like: 2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage 0.0 (TID 0) *java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected* at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2015-06-02 15:57:44 WARN TaskSetManager:71 - Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop 2; Spark 1.3.1 expects the interface but is getting the class. It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I then need to hope that I don't exercise certain Spark code paths that run afoul of differences between Hadoop 1 and 2; does that seem correct? Thanks! On Wed, May 20, 2015 at 1:52 PM Sean Owen so...@cloudera.com wrote: I don't think any of those problems are related to Hadoop. Have you looked at userClassPathFirst settings? On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson ejsa...@gmail.com wrote: Hi Sean and Ted, Thanks for your replies. I don't have our current problems nicely written up as good questions yet. I'm still sorting out classpath issues, etc. In case it is of help, I'm seeing: * Exception in thread Spark Context Cleaner java.lang.NoClassDefFoundError: 0 at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149) * We've been having clashing dependencies between a colleague and I because of the aforementioned classpath issue * The clashing dependencies are also causing issues with what jetty libraries are available in the classloader from Spark and don't clash with existing libraries we have. More anon, Cheers, Edward Original Message Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38 From: Sean Owen so...@cloudera.com To: Edward Sargisson esa...@pobox.com Cc: user user@spark.apache.org Yes, the published artifacts can only refer to one version of anything (OK, modulo publishing a large number of variants under classifiers). You aren't intended to rely on Spark's transitive dependencies for anything. Compiling against the Spark API has no relation to what version of Hadoop it binds against because it's not part of any API. You mark the Spark dependency even as provided in your build and get all the Spark/Hadoop bindings at runtime from our cluster
Re: Re: spark 1.3.1 jars in repo1.maven.org
I think this is causing issues upgrading ADAM https://github.com/bigdatagenomics/adam to Spark 1.3.1 (cf. adam#690 https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383); attempting to build against Hadoop 1.0.4 yields errors like: 2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage 0.0 (TID 0) *java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected* at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2015-06-02 15:57:44 WARN TaskSetManager:71 - Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop 2; Spark 1.3.1 expects the interface but is getting the class. It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I then need to hope that I don't exercise certain Spark code paths that run afoul of differences between Hadoop 1 and 2; does that seem correct? Thanks! On Wed, May 20, 2015 at 1:52 PM Sean Owen so...@cloudera.com wrote: I don't think any of those problems are related to Hadoop. Have you looked at userClassPathFirst settings? On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson ejsa...@gmail.com wrote: Hi Sean and Ted, Thanks for your replies. I don't have our current problems nicely written up as good questions yet. I'm still sorting out classpath issues, etc. In case it is of help, I'm seeing: * Exception in thread Spark Context Cleaner java.lang.NoClassDefFoundError: 0 at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149) * We've been having clashing dependencies between a colleague and I because of the aforementioned classpath issue * The clashing dependencies are also causing issues with what jetty libraries are available in the classloader from Spark and don't clash with existing libraries we have. More anon, Cheers, Edward Original Message Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38 From: Sean Owen so...@cloudera.com To: Edward Sargisson esa...@pobox.com Cc: user user@spark.apache.org Yes, the published artifacts can only refer to one version of anything (OK, modulo publishing a large number of variants under classifiers). You aren't intended to rely on Spark's transitive dependencies for anything. Compiling against the Spark API has no relation to what version of Hadoop it binds against because it's not part of any API. You mark the Spark dependency even as provided in your build and get all the Spark/Hadoop bindings at runtime from our cluster. What problem are you experiencing? On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson esa...@pobox.com wrote: Hi, I'd like to confirm an observation I've just made. Specifically that spark is only available in repo1.maven.org for one Hadoop variant. The Spark source can be compiled against a number of different Hadoops using profiles. Yay. However, the spark jars in repo1.maven.org appear to be compiled against one specific Hadoop and no other differentiation is made. (I can see a difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in the version I compiled locally). The implication here is that if you have a pom file asking for spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
Monitoring Spark with Graphite and Grafana
If anyone is curious to try exporting Spark metrics to Graphite, I just published a post about my experience doing that, building dashboards in Grafana http://grafana.org/, and using them to monitor Spark jobs: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ Code for generating Grafana dashboards tailored to the metrics emitted by Spark is here: https://github.com/hammerlab/grafana-spark-dashboards. If anyone else is interested in working on expanding MetricsSystem to make this sort of thing more useful, let me know, I've been working on it a fair amount and have a bunch of ideas about where it should go. Thanks, -Ryan
Re: Data Loss - Spark streaming
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas gerard.m...@gmail.com wrote: Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going work in Spark Streaming to fix HA. https://www.youtube.com/watch?v=jcJq3ZalXD8 -kr, Gerard. On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi, I need a clarification, while running streaming examples, suppose the batch interval is set to 5 minutes, after collecting the data from the input source(FLUME) and processing till 5 minutes. What will happen to the data which is flowing continuously from the input source to spark streaming ? Will that data be stored somewhere or else the data will be lost ? Or else what is the solution to capture each and every data without any loss in Spark streaming. Awaiting for your kind reply. Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: scalac crash when compiling DataTypeConversions.scala
I heard from one person offline who regularly builds Spark on OSX and Linux and they felt like they only ever saw this error on OSX; if anyone can confirm whether they've seen it on Linux, that would be good to know. Stephen: good to know re: profiles/options. I don't think changing them is a necessary condition as I believe I've run into it without doing that, but any set of steps to reproduce this would be welcome so that we could escalate to Typesafe as appropriate. On Sun, Oct 26, 2014 at 11:46 PM, Stephen Boesch java...@gmail.com wrote: Yes it is necessary to do a mvn clean when encountering this issue. Typically you would have changed one or more of the profiles/options - which leads to this occurring. 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on L512 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512 and there’s a final stack trace at the bottom https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671 . mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue while compiling with each tool), but are annoying/time-consuming to do, obvs, and it’s happening pretty frequently for me when doing only small numbers of incremental compiles punctuated by e.g. checking out different git commits. Have other people seen this? This post http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html on this list is basically the same error, but in TestSQLContext.scala and this SO post http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m claims to be hitting it when trying to build in intellij. It seems likely to be a bug in scalac; would finding a consistent repro case and filing it somewhere be useful?
scalac crash when compiling DataTypeConversions.scala
I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on L512 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512 and there’s a final stack trace at the bottom https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671 . mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue while compiling with each tool), but are annoying/time-consuming to do, obvs, and it’s happening pretty frequently for me when doing only small numbers of incremental compiles punctuated by e.g. checking out different git commits. Have other people seen this? This post http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html on this list is basically the same error, but in TestSQLContext.scala and this SO post http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m claims to be hitting it when trying to build in intellij. It seems likely to be a bug in scalac; would finding a consistent repro case and filing it somewhere be useful?