better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Hi Spark devs, I was looking into the memory usage of shuffle and one annoying thing is the default compression codec (LZF) is that the implementation we use allocates buffers pretty generously. I did a simple experiment and found that creating 1000 LZFOutputStream allocated 198976424 bytes (~190M

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Patrick Wendell
> 1. The first error I met is the different SerializationVersionUID in > ExecuterStatus > > I resolved by explicitly declare SerializationVersionUID in > ExecuterStatus.scala and recompile branch-0.1-jdbc > I don't think there is a class in Spark named ExecuterStatus (sic) ... or ExecutorStatus.

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
Ah, sorry, sorry It's executorState under deploy package On Monday, July 14, 2014, Patrick Wendell wrote: > > 1. The first error I met is the different SerializationVersionUID in > ExecuterStatus > > > > I resolved by explicitly declare SerializationVersionUID in > ExecuterStatus.scala and reco

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Mridul Muralidharan
We tried with lower block size for lzf, but it barfed all over the place. Snappy was the way to go for our jobs. Regards, Mridul On Mon, Jul 14, 2014 at 12:31 PM, Reynold Xin wrote: > Hi Spark devs, > > I was looking into the memory usage of shuffle and one annoying thing is > the default comp

Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Hi all, I've been evaluating YourKit and would like to profile the heap and CPU usage of certain tests from the Spark test suite. In particular, I'm very interested in tracking heap usage by allocation site. Unfortunately, I get a lot of crashes running Spark tests with profiling (and thus al

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Matei Zaharia
I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and such), so maybe there's a problem in YourKit or in your release of the JVM. Otherwise I'd suggest increasing the heap size of the unit tests a bit (you can do this in the SBT build file). Maybe they are very close to full a

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd cre

Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Cody Koeninger
Hi all, just wanted to give a heads up that we're seeing a reproducible deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2 If jira is a better place for this, apologies in advance - figured talking about it on the mailing list was friendlier than randomly (re)opening jira tickets. I know Gary had

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust wrote: > Yeah, sadly this dependency was introduced when someone consolidated the > logging infrastructure. However, the dependency should be very small and > thus easy to remove, and I

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Cody, This Jstack seems truncated, would you mind giving the entire stack trace? For the second thread, for instance, we can't see where the lock is being acquired. - Patrick On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger wrote: > Hi all, just wanted to give a heads up that we're seeing a

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Thanks, Matei; I have also had some success with jmap and friends and will probably just stick with them! best, wb - Original Message - > From: "Matei Zaharia" > To: dev@spark.apache.org > Sent: Monday, July 14, 2014 1:02:04 PM > Subject: Re: Profiling Spark tests with YourKit (or som

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The full jstack would still be useful, but our current working theory is that this is due to the fact that Configuration#loadDefaults goes through every Configuration object that was ever created (via Configuration.REGISTRY) and locks it, thus introducing a dependency from new Configuration to old,

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Aaron Davidson
Out of curiosity, what problems are you seeing with Utils.getCallSite? On Mon, Jul 14, 2014 at 2:59 PM, Will Benton wrote: > Thanks, Matei; I have also had some success with jmap and friends and will > probably just stick with them! > > > best, > wb > > > - Original Message - > > From:

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Nishkam Ravi
Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would help. I suspect we will start seeing the ConcurrentModificationException again. The right fix has gone into Hadoop through 10456. Unfortunately, I don't have any bright ideas on how to synchronize this at the Spark level with

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Stephen Haberman
Just a comment from the peanut gallery, but these buffers are a real PITA for us as well. Probably 75% of our non-user-error job failures are related to them. Just naively, what about not doing compression on the fly? E.g. during the shuffle just write straight to disk, uncompressed? For us, we

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Sandy Ryza
Stephen, Often the shuffle is bound by writes to disk, so even if disks have enough space to store the uncompressed data, the shuffle can complete faster by writing less data. Reynold, This isn't a big help in the short term, but if we switch to a sort-based shuffle, we'll only need a single LZFOu

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Matei Zaharia
You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will still be some buffers for the various OutputStreams, but they should be smaller. Matei On Jul 14, 2014, at 3:30 PM, Stephen Haberman wrote: > > Just a comment from the peanut g

Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Nishkam, Aaron's fix should prevent two concurrent accesses to getJobConf (and the Hadoop code therein). But if there is code elsewhere that tries to mutate the configuration, then I could see how we might still have the ConcurrentModificationException. I looked at your patch for HADOOP-10456

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We use the Hadoop configuration inside of our code executing on Spark as we need to list out files in the path. Maybe that is why it is exposed for us. On Mon, Jul 14, 2014 at 6:57 PM, Patrick Wendell wrote: > Hey Nishkam, > > Aaron's fix should prevent two concurrent accesses to getJobConf (a

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Copying Jon here since he worked on the lzf library at Ning. Jon - any comments on this topic? On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia wrote: > You can actually turn off shuffle compression by setting > spark.shuffle.compress to false. Try that out, there will still be some > buffers fo

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Nishkam Ravi
HI Patrick, I'm not aware of another place where the access happens, but it's possible that it does. The original fix synchronized on the broadcastConf object and someone reported the same exception. On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell wrote: > Hey Nishkam, > > Aaron's fix should p

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I observed a deadlock here when using the AvroInputFormat as well. The short of the issue is that there's one configuration object per JVM, but multiple threads, one for each task. If each thread attempts to add a configuration option to the Configuration object at once you get issues because HashM

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Andrew and Gary, Would you guys be able to test https://github.com/apache/spark/pull/1409/files and see if it solves your problem? - Patrick On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash wrote: > I observed a deadlock here when using the AvroInputFormat as well. The > short of the issue is that t

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We'll try to run a build tomorrow AM. On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell wrote: > Andrew and Gary, > > Would you guys be able to test > https://github.com/apache/spark/pull/1409/files and see if it solves > your problem? > > - Patrick > > On Mon, Jul 14, 2014 at 4:18 PM, Andrew As

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The patch won't solve the problem where two people try to add a configuration option at the same time, but I think there is currently an issue where two people can try to initialize the Configuration at the same time and still run into a ConcurrentModificationException. This at least reduces (sligh

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Davies Liu
Maybe we could try LZ4 [1], which has better performance and smaller footprint than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x higher than LZF[2], but memory used is 10x smaller than LZF (16k vs 190k). [1] https://github.com/jpountz/lz4-java [2] http://ning.github.io/jvm-compr

SBT gen-idea doesn't work well after merging SPARK-1776

2014-07-14 Thread DB Tsai
I've a clean clone of spark master repository, and I generated the intellij project file by sbt gen-idea as usual. There are two issues we have after merging SPARK-1776 (read dependencies from Maven). 1) After SPARK-1776, sbt gen-idea will download the dependencies from internet even those jars ar

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Jon Hartlaub
Is the held memory due to just instantiating the LZFOutputStream? If so, I'm a surprised and I consider that a bug. I suspect the held memory may be due to a SoftReference - memory will be released with enough memory pressure. Finally, is it necessary to keep 1000 (or more) decoders active? Wou

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Aaron Davidson
One of the core problems here is the number of open streams we have, which is (# cores * # reduce partitions), which can easily climb into the tens of thousands for large jobs. This is a more general problem that we are planning on fixing for our largest shuffles, as even moderate buffer sizes can

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
- Original Message - > From: "Aaron Davidson" > To: dev@spark.apache.org > Sent: Monday, July 14, 2014 5:21:10 PM > Subject: Re: Profiling Spark tests with YourKit (or something else) > > Out of curiosity, what problems are you seeing with Utils.getCallSite? Aaron, if I enable call site

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Aaron Davidson
Would you mind filing a JIRA for this? That does sound like something bogus happening on the JVM/YourKit level, but this sort of diagnosis is sufficiently important that we should be resilient against it. On Mon, Jul 14, 2014 at 6:01 PM, Will Benton wrote: > - Original Message - > > Fro

ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Just launched an EC2 cluster from git hash 9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD accessing data in S3 yields the following error output. I understand that NoClassDefFoundError errors may mean something in the deployment was messed up. Is that correct? When I launch a c

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread scwf
hi,Cody i met this issue days before and i post a PR for this( https://github.com/apache/spark/pull/1385) it's very strange that if i synchronize conf it will deadlock but it is ok when synchronize initLocalJobConfFuncOpt Here's the entire jstack output. On Mon, Jul 14, 2014 at 4:44 PM, P

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
I resolved the issue by setting an internal maven repository to contain the Spark-1.0.1 jar compiled from branch-0.1-jdbc and replacing the dependency to the central repository with our own repository I believe there should be some more lightweight way Best, -- Nan Zhu On Monday, July 14,

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Sure thing: https://issues.apache.org/jira/browse/SPARK-2486 https://github.com/apache/spark/pull/1413 best, wb - Original Message - > From: "Aaron Davidson" > To: dev@spark.apache.org > Sent: Monday, July 14, 2014 8:38:16 PM > Subject: Re: Profiling Spark tests with YourKit (or

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I'm not sure either of those PRs will fix the concurrent adds to Configuration issue I observed. I've got a stack trace and writeup I'll share in an hour or two (traveling today). On Jul 14, 2014 9:50 PM, "scwf" wrote: > hi,Cody > i met this issue days before and i post a PR for this( > https:/

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Andrew is your issue also a regression from 1.0.0 to 1.0.1? The immediate priority is addressing regressions between these two releases. On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash wrote: > I'm not sure either of those PRs will fix the concurrent adds to > Configuration issue I observed. I've got

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia wrote: > Yeah, I'd just add a spark-util that has

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I don't believe mine is a regression. But it is related to thread safety on Hadoop Configuration objects. Should I start a new thread? On Jul 15, 2014 12:55 AM, "Patrick Wendell" wrote: > Andrew is your issue also a regression from 1.0.0 to 1.0.1? The > immediate priority is addressing regression

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Aaron Davidson
This one is typically due to a mismatch between the Hadoop versions -- i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the classpath, or something like that. Not certain why you're seeing this with spark-ec2, but I'm assuming this is related to the issues you posted in a separate

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Shivaram Venkataraman
My guess is that this is related to https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets excluded from the SBT assembly jar. I am not sure if the assembly jar used in EC2 is generated using SBT though. Shivaram On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson wrote: > Thi

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Patrick Wendell
Yeah - this is likely caused by SPARK-2471. On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman wrote: > My guess is that this is related to > https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets > excluded from the SBT assembly jar. I am not sure if the assembly jar use

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Andrew, Yeah, that would be preferable. Definitely worth investigating both, but the regression is more pressing at the moment. - Patrick On Mon, Jul 14, 2014 at 10:02 PM, Andrew Ash wrote: > I don't believe mine is a regression. But it is related to thread safety on > Hadoop Configuration

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Okie doke--added myself as a watcher on that issue. On a related note, what are the thoughts on automatically spinning up/down EC2 clusters and running tests against them? It would probably be way too cumbersome to do that for every build, but perhaps on some schedule it could help validate that w

Re: SBT gen-idea doesn't work well after merging SPARK-1776

2014-07-14 Thread Patrick Wendell
Thanks DB, Feel free to file sub-jira's under: https://issues.apache.org/jira/browse/SPARK-2487 I've been importing the Maven build into Intellij, it might be worth trying that as well to see if it works. - Patrick On Mon, Jul 14, 2014 at 4:53 PM, DB Tsai wrote: > I've a clean clone of spark m