Re: SequenceFile and object reuse

2015-11-18 Thread Ryan Williams
Hey Jeff, in addition to what Sandy said, there are two more reasons that
this might not be as bad as it seems; I may be incorrect in my
understanding though.

First, the "additional step" you're referring to is not likely to be adding
any overhead; the "extra map" is really just materializing the data once
(as opposed to zero times), which is what you want (assuming your access
pattern couldn't be reformulated in the way Sandy described, i.e. where all
the objects in a partition don't need to be in memory at the same time).

Secondly, even if this was an "extra" map step, it wouldn't add any extra
stages to a given pipeline, being a "narrow" dependency, so it would likely
be low-cost anyway.

Let me know if any of the above seems incorrect, thanks!

On Thu, Nov 19, 2015 at 12:41 AM Sandy Ryza  wrote:

> Hi Jeff,
>
> Many access patterns simply take the result of hadoopFile and use it to
> create some other object, and thus have no need for each input record to
> refer to a different object.  In those cases, the current API is more
> performant than an alternative that would create an object for each record,
> because it avoids the unnecessary overhead of creating Java objects.  As
> you've pointed out, this is at the expense of making the code more verbose
> when caching.
>
> -Sandy
>
> On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi 
> wrote:
>
>> So we tried reading a sequencefile in Spark and realized that all our
>> records have ended up becoming the same.
>> THen one of us found this:
>>
>> Note: Because Hadoop's RecordReader class re-uses the same Writable
>> object for each record, directly caching the returned RDD or directly
>> passing it to an aggregation or shuffle operation will create many
>> references to the same object. If you plan to directly cache, sort, or
>> aggregate Hadoop writable objects, you should first copy them using a map
>> function.
>>
>> Is there anyone that can shed some light on this bizzare behavior and the
>> decisions behind it?
>> And I also would like to know if anyone's able to read a binary file and
>> not to incur the additional map() as suggested by the above? What format
>> did you use?
>>
>> thanks
>> Jeff
>>
>
>


Spree: a live-updating web UI for Spark

2015-07-27 Thread Ryan Williams
Probably relevant to people on this list: on Friday I released a clone of
the Spark web UI built using Meteor  so that
everything updates in real-time, saving you from endlessly refreshing the
page while jobs are running :) It can also serve as the UI for running as
well as completed applications, so you don't have to mess with the separate
history-server process if you don't want to.

*This blog post*

and
*the github repo*  have lots of
information on how to use it.

It has two sub-components, JsonRelay
 and Slim
; the former sends SparkListenerEvent
s
out of Spark via a network socket, while the latter receives those events
and writes various stats about them to Mongo (like an external
JobProgressListener

that persists info to a database). You might find them to offer a better
way of storing information about running and completed Spark applications
than the event log files that Spark uses, and they can be used with or
without the real-time web UI.

Give them a try if they sound useful to you, and let me know if you have
questions or comments!

-Ryan


Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
Thanks so much Shixiong! This is great.

On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu  wrote:

> Ryan - I sent a PR to fix your issue:
> https://github.com/apache/spark/pull/6599
>
> Edward - I have no idea why the following error happened. "ContextCleaner"
> doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to
> support both hadoop 1 and hadoop 2.
>
>
> * "Exception in thread "Spark Context Cleaner"
> java.lang.NoClassDefFoundError: 0
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-06-03 0:08 GMT+08:00 Ryan Williams :
>
>> I think this is causing issues upgrading ADAM
>> <https://github.com/bigdatagenomics/adam> to Spark 1.3.1 (cf. adam#690
>> <https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383>);
>> attempting to build against Hadoop 1.0.4 yields errors like:
>>
>> 2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage
>> 0.0 (TID 0)
>> *java.lang.IncompatibleClassChangeError: Found class
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
>> at
>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
>> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
>> (TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
>> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
>> at
>> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
>> at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
>> 2; Spark 1.3.1 expects the interface but is getting the class.
>>
>> It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
>> then need to hope that I don't exercise certain Spark code paths that run
>> afoul of differences between Hadoop 1 and 2; does that seem correct?
>>
>> Thanks!
>>
>> On Wed, May 20, 2015 at 1:52 PM Sean Owen  wrote:
>>
>>> I don't think any of those problems are related to Hadoop. Have you
>>> looked at userClassPathFirst settings?
>>>
>>> On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson 
>>> wrote:
>>>
>>>> Hi Sean and Ted,
>>>> Thanks for your replies.
>>>>
>>>> I don't have our current problems nicely written up as good questions
>>>> yet. I'm still sorting out classpath issues, etc.
>>>> In case it is of help, I'm seeing:
>>>> * "Exception in thread "Spark Context Cleaner"
>>>> java.lang.NoClassDefFoundError: 0
>>>> at
>>>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>>>> * We've been having clashing dependencies between a colleague and I
>>>> because of the aforementioned classpath issue
>>>> * The clashing dependencies are also causing issues with what jetty
>>>&

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
I think this is causing issues upgrading ADAM
 to Spark 1.3.1 (cf. adam#690
);
attempting to build against Hadoop 1.0.4 yields errors like:

2015-06-02 15:57:44 ERROR Executor:96 - Exception in task 0.0 in stage 0.0
(TID 0)
*java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected*
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2015-06-02 15:57:44 WARN  TaskSetManager:71 - Lost task 0.0 in stage 0.0
(TID 0, localhost): java.lang.IncompatibleClassChangeError: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:95)
at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

TaskAttemptContext is a class in Hadoop 1.0.4, but an interface in Hadoop
2; Spark 1.3.1 expects the interface but is getting the class.

It sounds like, while I *can* depend on Spark 1.3.1 and Hadoop 1.0.4, I
then need to hope that I don't exercise certain Spark code paths that run
afoul of differences between Hadoop 1 and 2; does that seem correct?

Thanks!

On Wed, May 20, 2015 at 1:52 PM Sean Owen  wrote:

> I don't think any of those problems are related to Hadoop. Have you looked
> at userClassPathFirst settings?
>
> On Wed, May 20, 2015 at 6:46 PM, Edward Sargisson 
> wrote:
>
>> Hi Sean and Ted,
>> Thanks for your replies.
>>
>> I don't have our current problems nicely written up as good questions
>> yet. I'm still sorting out classpath issues, etc.
>> In case it is of help, I'm seeing:
>> * "Exception in thread "Spark Context Cleaner"
>> java.lang.NoClassDefFoundError: 0
>> at
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:149)"
>> * We've been having clashing dependencies between a colleague and I
>> because of the aforementioned classpath issue
>> * The clashing dependencies are also causing issues with what jetty
>> libraries are available in the classloader from Spark and don't clash with
>> existing libraries we have.
>>
>> More anon,
>>
>> Cheers,
>> Edward
>>
>>
>>
>>  Original Message 
>>  Subject: Re: spark 1.3.1 jars in repo1.maven.org Date: 2015-05-20 00:38
>> From: Sean Owen  To: Edward Sargisson <
>> esa...@pobox.com> Cc: user 
>>
>>
>> Yes, the published artifacts can only refer to one version of anything
>> (OK, modulo publishing a large number of variants under classifiers).
>>
>> You aren't intended to rely on Spark's transitive dependencies for
>> anything. Compiling against the Spark API has no relation to what
>> version of Hadoop it binds against because it's not part of any API.
>> You mark the Spark dependency even as "provided" in your build and get
>> all the Spark/Hadoop bindings at runtime from our cluster.
>>
>> What problem are you experiencing?
>>
>>
>> On Wed, May 20, 2015 at 2:17 AM, Edward Sargisson 
>> wrote:
>>
>> Hi,
>> I'd like to confirm an observation I've just made. Specifically that spark
>> is only available in repo1.maven.org for one Hadoop variant.
>>
>> The Spark source can be compiled against a number of different Hadoops
>> using
>> profiles. Yay.
>> However, the spark jars in repo1.maven.org appear to be compiled against
>> one
>> specific Hadoop and no other differentiation is made. (I can see a
>> difference with hadoop-client being 2.2.0 in repo1.maven.org and 1.0.4 in
>> the version I compiled locally).
>>
>> The implication here is that if you have a pom file asking for
>> spark-core_2.10 version 1.3.1 th

Re: bitten by spark.yarn.executor.memoryOverhead

2015-03-02 Thread Ryan Williams
For reference, the initial version of #3525
 (still open) made this fraction
a configurable value, but consensus went against that being desirable so I
removed it and marked SPARK-4665
 as "won't fix".

My team wasted a lot of time on this failure mode as well and has settled
in to passing "--conf spark.yarn.executor.memoryOverhead=1024" to most jobs
(that works out to 10-20% of --executor-memory, depending on the job).

I agree that learning about this the hard way is a weak part of the
Spark-on-YARN onboarding experience.

The fact that our instinct here is to increase the 0.07 minimum instead of
the alternate 384MB

minimum seems like evidence that the fraction is the thing we should allow
people to configure, instead of absolute amount that is currently
configurable.

Finally, do we feel confident that 0.1 is "always" enough?


On Sat, Feb 28, 2015 at 4:51 PM Corey Nolet  wrote:

> Thanks for taking this on Ted!
>
> On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu  wrote:
>
>> I have created SPARK-6085 with pull request:
>> https://github.com/apache/spark/pull/4836
>>
>> Cheers
>>
>> On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet  wrote:
>>
>>> +1 to a better default as well.
>>>
>>> We were working find until we ran against a real dataset which was much
>>> larger than the test dataset we were using locally. It took me a couple
>>> days and digging through many logs to figure out this value was what was
>>> causing the problem.
>>>
>>> On Sat, Feb 28, 2015 at 11:38 AM, Ted Yu  wrote:
>>>
 Having good out-of-box experience is desirable.

 +1 on increasing the default.


 On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen  wrote:

> There was a recent discussion about whether to increase or indeed make
> configurable this kind of default fraction. I believe the suggestion
> there too was that 9-10% is a safer default.
>
> Advanced users can lower the resulting overhead value; it may still
> have to be increased in some cases, but a fatter default may make this
> kind of surprise less frequent.
>
> I'd support increasing the default; any other thoughts?
>
> On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers 
> wrote:
> > hey,
> > running my first map-red like (meaning disk-to-disk, avoiding in
> memory
> > RDDs) computation in spark on yarn i immediately got bitten by a too
> low
> > spark.yarn.executor.memoryOverhead. however it took me about an hour
> to find
> > out this was the cause. at first i observed failing shuffles leading
> to
> > restarting of tasks, then i realized this was because executors
> could not be
> > reached, then i noticed in containers got shut down and reallocated
> in
> > resourcemanager logs (no mention of errors, it seemed the containers
> > finished their business and shut down successfully), and finally i
> found the
> > reason in nodemanager logs.
> >
> > i dont think this is a pleasent first experience. i realize
> > spark.yarn.executor.memoryOverhead needs to be set differently from
> > situation to situation. but shouldnt the default be a somewhat
> higher value
> > so that these errors are unlikely, and then the experts that are
> willing to
> > deal with these errors can tune it lower? so why not make the
> default 10%
> > instead of 7%? that gives something that works in most situations
> out of the
> > box (at the cost of being a little wasteful). it worked for me.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
If anyone is curious to try exporting Spark metrics to Graphite, I just
published a post about my experience doing that, building dashboards in
Grafana , and using them to monitor Spark jobs:
http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

Code for generating Grafana dashboards tailored to the metrics emitted by
Spark is here: https://github.com/hammerlab/grafana-spark-dashboards.

If anyone else is interested in working on expanding MetricsSystem to make
this sort of thing more useful, let me know, I've been working on it a fair
amount and have a bunch of ideas about where it should go.

Thanks,

-Ryan


Re: Data Loss - Spark streaming

2014-12-16 Thread Ryan Williams
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s

On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas  wrote:

> Hi Jeniba,
>
> The second part of this meetup recording has a very good answer to your
> question.  TD explains the current behavior and the on-going work in Spark
> Streaming to fix HA.
> https://www.youtube.com/watch?v=jcJq3ZalXD8
>
>
> -kr, Gerard.
>
> On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson <
> jeniba.john...@lntinfotech.com> wrote:
>>
>> Hi,
>>
>> I need a clarification, while running streaming examples, suppose the
>> batch interval is set to 5 minutes, after collecting the data from the
>> input source(FLUME) and  processing till 5 minutes.
>> What will happen to the data which is flowing continuously from the input
>> source to spark streaming ? Will that data be stored somewhere or else the
>> data will be lost ?
>> Or else what is the solution to capture each and every data without any
>> loss in Spark streaming.
>>
>> Awaiting for your kind reply.
>>
>>
>> Regards,
>> Jeniba Johnson
>>
>>
>> 
>> The contents of this e-mail and any attachment(s) may contain
>> confidential or privileged information for the intended recipient(s).
>> Unintended recipients are prohibited from taking action on the basis of
>> information in this e-mail and using or disseminating the information, and
>> must notify the sender and delete it from their system. L&T Infotech will
>> not accept responsibility or liability for the accuracy or completeness of,
>> or the presence of any virus or disabling code in this e-mail"
>>
>


FileNotFoundException in appcache shuffle files

2014-10-28 Thread Ryan Williams
My job is failing with the following error:

14/10/29 02:59:14 WARN scheduler.TaskSetManager: Lost task 1543.0 in stage
3.0 (TID 6266, demeter-csmau08-19.demeter.hpc.mssm.edu):
java.io.FileNotFoundException:
/data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-local-20141028214722-43f1/26/shuffle_0_312_0.index
(No such file or directory)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.(FileOutputStream.java:221)

org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)

org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:733)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732)
scala.collection.Iterator$class.foreach(Iterator.scala:727)

org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:790)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:732)

org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:728)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)

org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:56)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)


I get 4 of those on task 1543 before the job aborts. Interspersed in the 4
task-1543 failures are a few instances of this failure on another task. Here
is the entire App Master stdout dump
[1] (~2MB; stack
traces towards the bottom, of course). I am running {Spark 1.1, Hadoop
2.3.0}.

Here's a summary of the RDD manipulations I've done up to the point of
failure:

   - val A = [read a file in 1419 shards]
  - the file is 177GB compressed but ends up being ~5TB uncompressed /
  hydrated into scala objects (I think; see below for more
discussion on this
  point).
  - some relevant Spark options:
 - spark.default.parallelism=2000
 - --master yarn-client
 - --executor-memory 50g
 - --driver-memory 10g
 - --num-executors 100
 - --executor-cores 4


   - A.repartition(3000)
  - 3000 was chosen in an attempt to mitigate shuffle-disk-spillage
  that previous job attempts with 1000 or 1419 shards were mired in


   - A.persist()


   - A.count()  // succeeds
  - screenshot of web UI with stats: http://cl.ly/image/3e130w3J1B2v
  - I don't know why each task reports "8 TB" of "Input"; that metric
  seems like it is always ludicrously high and I don't pay attention to it
  typically.
  - Each task shuffle-writes 3.5GB, for a total of 4.9TB
 - Does that mean that 4.9TB is the uncompressed size of the file
 that A was read from?
 - 4.9TB is pretty close to the total amount of memory I've
 configured the job to use: (50GB/executor) * (100 executors) ~= 5TB.
 - Is that a coincidence, or are my executors shuffle-writing an
 amount equal to all of their memory for some reason?


   - val B = A.groupBy(...).filter(_._2.size == 2).map(_._2).flatMap(x =>
   x).persist()
  - my expectation is that ~all elements pass the filter step, so B
  should ~equal to A, just to give a sense of the expected memory blowup.


   - B.count()
  - this *fails* while executing .groupBy(...) above


I've found a few discussions of issues whose manifestations look *like*
this, but nothing that is obviously the same issue. The closest hit I've
seen is "Stage failure in BlockManager...
"[2]
on this list on 8/20; some key excerpts:

   - "likely due to a bug in shuffle file consolidation"
   - "hopefully fixed in 1.1 with this patch:
   
https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd
   "
   - 78f2af5 [3] implements
  pieces of #1609 [4], on
  which mridulm has a comment
  

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Ryan Williams
I heard from one person offline who regularly builds Spark on OSX and Linux
and they felt like they only ever saw this error on OSX; if anyone can
confirm whether they've seen it on Linux, that would be good to know.

Stephen: good to know re: profiles/options. I don't think changing them is
a necessary condition as I believe I've run into it without doing that, but
any set of steps to reproduce this would be welcome so that we could
escalate to Typesafe as appropriate.

On Sun, Oct 26, 2014 at 11:46 PM, Stephen Boesch  wrote:

> Yes it is necessary to do a mvn clean when encountering this issue.
> Typically you would have changed one or more of the profiles/options -
> which leads to this occurring.
>
> 2014-10-22 22:00 GMT-07:00 Ryan Williams :
>
> I started building Spark / running Spark tests this weekend and on maybe
>> 5-10 occasions have run into a compiler crash while compiling
>> DataTypeConversions.scala.
>>
>> Here <https://gist.github.com/ryan-williams/7673d7da928570907f4d> is a
>> full gist of an innocuous test command (mvn test
>> -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts
>> on L512
>> <https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512>
>> and there’s a final stack trace at the bottom
>> <https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671>
>> .
>>
>> mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue
>> while compiling with each tool), but are annoying/time-consuming to do,
>> obvs, and it’s happening pretty frequently for me when doing only small
>> numbers of incremental compiles punctuated by e.g. checking out different
>> git commits.
>>
>> Have other people seen this? This post
>> <http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html>
>> on this list is basically the same error, but in TestSQLContext.scala
>> and this SO post
>> <http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m>
>> claims to be hitting it when trying to build in intellij.
>>
>> It seems likely to be a bug in scalac; would finding a consistent repro
>> case and filing it somewhere be useful?
>> ​
>>
>
>


scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Ryan Williams
I started building Spark / running Spark tests this weekend and on maybe
5-10 occasions have run into a compiler crash while compiling
DataTypeConversions.scala.

Here <https://gist.github.com/ryan-williams/7673d7da928570907f4d> is a full
gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite')
exhibiting this behavior. Problem starts on L512
<https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512>
and there’s a final stack trace at the bottom
<https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671>
.

mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue
while compiling with each tool), but are annoying/time-consuming to do,
obvs, and it’s happening pretty frequently for me when doing only small
numbers of incremental compiles punctuated by e.g. checking out different
git commits.

Have other people seen this? This post
<http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html>
on this list is basically the same error, but in TestSQLContext.scala and this
SO post
<http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m>
claims to be hitting it when trying to build in intellij.

It seems likely to be a bug in scalac; would finding a consistent repro
case and filing it somewhere be useful?
​