Re: SequenceFile and object reuse

2015-11-18 Thread Ryan Williams
Hey Jeff, in addition to what Sandy said, there are two more reasons that
this might not be as bad as it seems; I may be incorrect in my
understanding though.

First, the "additional step" you're referring to is not likely to be adding
any overhead; the "extra map" is really just materializing the data once
(as opposed to zero times), which is what you want (assuming your access
pattern couldn't be reformulated in the way Sandy described, i.e. where all
the objects in a partition don't need to be in memory at the same time).

Secondly, even if this was an "extra" map step, it wouldn't add any extra
stages to a given pipeline, being a "narrow" dependency, so it would likely
be low-cost anyway.

Let me know if any of the above seems incorrect, thanks!

On Thu, Nov 19, 2015 at 12:41 AM Sandy Ryza  wrote:

> Hi Jeff,
>
> Many access patterns simply take the result of hadoopFile and use it to
> create some other object, and thus have no need for each input record to
> refer to a different object.  In those cases, the current API is more
> performant than an alternative that would create an object for each record,
> because it avoids the unnecessary overhead of creating Java objects.  As
> you've pointed out, this is at the expense of making the code more verbose
> when caching.
>
> -Sandy
>
> On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi 
> wrote:
>
>> So we tried reading a sequencefile in Spark and realized that all our
>> records have ended up becoming the same.
>> THen one of us found this:
>>
>> Note: Because Hadoop's RecordReader class re-uses the same Writable
>> object for each record, directly caching the returned RDD or directly
>> passing it to an aggregation or shuffle operation will create many
>> references to the same object. If you plan to directly cache, sort, or
>> aggregate Hadoop writable objects, you should first copy them using a map
>> function.
>>
>> Is there anyone that can shed some light on this bizzare behavior and the
>> decisions behind it?
>> And I also would like to know if anyone's able to read a binary file and
>> not to incur the additional map() as suggested by the above? What format
>> did you use?
>>
>> thanks
>> Jeff
>>
>
>


Spree: a live-updating web UI for Spark

2015-07-27 Thread Ryan Williams
Probably relevant to people on this list: on Friday I released a clone of
the Spark web UI built using Meteor https://www.meteor.com/ so that
everything updates in real-time, saving you from endlessly refreshing the
page while jobs are running :) It can also serve as the UI for running as
well as completed applications, so you don't have to mess with the separate
history-server process if you don't want to.

*This blog post*
http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/
and
*the github repo* https://github.com/hammerlab/spree have lots of
information on how to use it.

It has two sub-components, JsonRelay
https://github.com/hammerlab/spark-json-relay and Slim
https://github.com/hammerlab/slim; the former sends SparkListenerEvent
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L33s
out of Spark via a network socket, while the latter receives those events
and writes various stats about them to Mongo (like an external
JobProgressListener
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
that persists info to a database). You might find them to offer a better
way of storing information about running and completed Spark applications
than the event log files that Spark uses, and they can be used with or
without the real-time web UI.

Give them a try if they sound useful to you, and let me know if you have
questions or comments!

-Ryan


Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
Thanks so much Shixiong! This is great.

On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu zsxw...@gmail.com wrote:

 Ryan - I sent a PR to fix your issue:
 https://github.com/apache/spark/pull/6599

 Edward - I have no idea why the following error happened. ContextCleaner
 doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to
 support both hadoop 1 and hadoop 2.


 * Exception in thread Spark Context Cleaner
 java.lang.NoClassDefFoundError: 0
 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)


 Best Regards,
 Shixiong Zhu

 2015-06-03 0:08 GMT+08:00 Ryan Williams ryan.blake.willi...@gmail.com:

 I think this is causing issues upgrading ADAM
 https://github.com/bigdatagenomics/adam to Spark 1.3.1 (cf. adam#690
 https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383);
 attempting to build against Hadoop 1.0.4 yields errors like:

 2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage
 0.0 (TID 0)
 *java.lang.IncompatibleClassChangeError: Found class
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
 at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
 (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
 at
 org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
 at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
 2; Spark 1.3.1 expects the interface but is getting the class.

 It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
 then need to hope that I don't exercise certain Spark code paths that run
 afoul of differences between Hadoop 1 and 2; does that seem correct?

 Thanks!

 On Wed, May 20, 2015 at 1:52 PM Sean Owen so...@cloudera.com wrote:

 I don't think any of those problems are related to Hadoop. Have you
 looked at userClassPathFirst settings?

 On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson ejsa...@gmail.com
 wrote:

 Hi Sean and Ted,
 Thanks for your replies.

 I don't have our current problems nicely written up as good questions
 yet. I'm still sorting out classpath issues, etc.
 In case it is of help, I'm seeing:
 * Exception in thread Spark Context Cleaner
 java.lang.NoClassDefFoundError: 0
 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)
 * We've been having clashing dependencies between a colleague and I
 because of the aforementioned classpath issue
 * The clashing dependencies are also causing issues with what jetty
 libraries are available in the classloader from Spark and don't clash with
 existing libraries we have.

 More anon,

 Cheers,
 Edward



  Original Message 
  Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20
 00:38 From: Sean Owen so...@cloudera.com To: Edward Sargisson 
 esa...@pobox.com Cc: user user@spark.apache.org


 Yes, the published artifacts can only refer to one version of anything
 (OK, modulo publishing a large number of variants under classifiers).

 You aren't intended to rely on Spark's transitive dependencies for
 anything. Compiling against the Spark API has no relation to what
 version of Hadoop it binds against because it's not part of any API.
 You mark the Spark dependency even as provided in your build and get
 all the Spark/Hadoop bindings at runtime from our cluster

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
I think this is causing issues upgrading ADAM
https://github.com/bigdatagenomics/adam to Spark 1.3.1 (cf. adam#690
https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383);
attempting to build against Hadoop 1.0.4 yields errors like:

2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage 0.0
(TID 0)
*java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
(TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
2; Spark 1.3.1 expects the interface but is getting the class.

It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
then need to hope that I don't exercise certain Spark code paths that run
afoul of differences between Hadoop 1 and 2; does that seem correct?

Thanks!

On Wed, May 20, 2015 at 1:52 PM Sean Owen so...@cloudera.com wrote:

 I don't think any of those problems are related to Hadoop. Have you looked
 at userClassPathFirst settings?

 On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson ejsa...@gmail.com
 wrote:

 Hi Sean and Ted,
 Thanks for your replies.

 I don't have our current problems nicely written up as good questions
 yet. I'm still sorting out classpath issues, etc.
 In case it is of help, I'm seeing:
 * Exception in thread Spark Context Cleaner
 java.lang.NoClassDefFoundError: 0
 at
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)
 * We've been having clashing dependencies between a colleague and I
 because of the aforementioned classpath issue
 * The clashing dependencies are also causing issues with what jetty
 libraries are available in the classloader from Spark and don't clash with
 existing libraries we have.

 More anon,

 Cheers,
 Edward



  Original Message 
  Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
 From: Sean Owen so...@cloudera.com To: Edward Sargisson 
 esa...@pobox.com Cc: user user@spark.apache.org


 Yes, the published artifacts can only refer to one version of anything
 (OK, modulo publishing a large number of variants under classifiers).

 You aren't intended to rely on Spark's transitive dependencies for
 anything. Compiling against the Spark API has no relation to what
 version of Hadoop it binds against because it's not part of any API.
 You mark the Spark dependency even as provided in your build and get
 all the Spark/Hadoop bindings at runtime from our cluster.

 What problem are you experiencing?


 On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson esa...@pobox.com
 wrote:

 Hi,
 I'd like to confirm an observation I've just made. Specifically that spark
 is only available in repo1.maven.org for one Hadoop variant.

 The Spark source can be compiled against a number of different Hadoops
 using
 profiles. Yay.
 However, the spark jars in repo1.maven.org appear to be compiled against
 one
 specific Hadoop and no other differentiation is made. (I can see a
 difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
 the version I compiled locally).

 The implication here is that if you have a pom file asking for
 spark-core_2.10 version 1.3.1 then Maven will only give you an Hadoop 2
 

Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
If anyone is curious to try exporting Spark metrics to Graphite, I just
published a post about my experience doing that, building dashboards in
Grafana http://grafana.org/, and using them to monitor Spark jobs:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

Code for generating Grafana dashboards tailored to the metrics emitted by
Spark is here: https://github.com/hammerlab/grafana-spark-dashboards.

If anyone else is interested in working on expanding MetricsSystem to make
this sort of thing more useful, let me know, I've been working on it a fair
amount and have a bunch of ideas about where it should go.

Thanks,

-Ryan


Re: Data Loss - Spark streaming

2014-12-16 Thread Ryan Williams
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s

On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas gerard.m...@gmail.com wrote:

 Hi Jeniba,

 The second part of this meetup recording has a very good answer to your
 question.  TD explains the current behavior and the on-going work in Spark
 Streaming to fix HA.
 https://www.youtube.com/watch?v=jcJq3ZalXD8


 -kr, Gerard.

 On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson 
 jeniba.john...@lntinfotech.com wrote:

 Hi,

 I need a clarification, while running streaming examples, suppose the
 batch interval is set to 5 minutes, after collecting the data from the
 input source(FLUME) and  processing till 5 minutes.
 What will happen to the data which is flowing continuously from the input
 source to spark streaming ? Will that data be stored somewhere or else the
 data will be lost ?
 Or else what is the solution to capture each and every data without any
 loss in Spark streaming.

 Awaiting for your kind reply.


 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain
 confidential or privileged information for the intended recipient(s).
 Unintended recipients are prohibited from taking action on the basis of
 information in this e-mail and using or disseminating the information, and
 must notify the sender and delete it from their system. LT Infotech will
 not accept responsibility or liability for the accuracy or completeness of,
 or the presence of any virus or disabling code in this e-mail




Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Ryan Williams
I heard from one person offline who regularly builds Spark on OSX and Linux
and they felt like they only ever saw this error on OSX; if anyone can
confirm whether they've seen it on Linux, that would be good to know.

Stephen: good to know re: profiles/options. I don't think changing them is
a necessary condition as I believe I've run into it without doing that, but
any set of steps to reproduce this would be welcome so that we could
escalate to Typesafe as appropriate.

On Sun, Oct 26, 2014 at 11:46 PM, Stephen Boesch java...@gmail.com wrote:

 Yes it is necessary to do a mvn clean when encountering this issue.
 Typically you would have changed one or more of the profiles/options -
 which leads to this occurring.

 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com:

 I started building Spark / running Spark tests this weekend and on maybe
 5-10 occasions have run into a compiler crash while compiling
 DataTypeConversions.scala.

 Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a
 full gist of an innocuous test command (mvn test
 -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts
 on L512
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512
 and there’s a final stack trace at the bottom
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671
 .

 mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue
 while compiling with each tool), but are annoying/time-consuming to do,
 obvs, and it’s happening pretty frequently for me when doing only small
 numbers of incremental compiles punctuated by e.g. checking out different
 git commits.

 Have other people seen this? This post
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html
 on this list is basically the same error, but in TestSQLContext.scala
 and this SO post
 http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m
 claims to be hitting it when trying to build in intellij.

 It seems likely to be a bug in scalac; would finding a consistent repro
 case and filing it somewhere be useful?
 ​





scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Ryan Williams
I started building Spark / running Spark tests this weekend and on maybe
5-10 occasions have run into a compiler crash while compiling
DataTypeConversions.scala.

Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full
gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite')
exhibiting this behavior. Problem starts on L512
https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512
and there’s a final stack trace at the bottom
https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671
.

mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue
while compiling with each tool), but are annoying/time-consuming to do,
obvs, and it’s happening pretty frequently for me when doing only small
numbers of incremental compiles punctuated by e.g. checking out different
git commits.

Have other people seen this? This post
http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html
on this list is basically the same error, but in TestSQLContext.scala and this
SO post
http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m
claims to be hitting it when trying to build in intellij.

It seems likely to be a bug in scalac; would finding a consistent repro
case and filing it somewhere be useful?
​