I agree with Andrew....Every time I underestimate the RAM requirement....my hand calculations are always ways less than what JVM actually allocates...
But I guess I will understand the Scala JVM optimizations as I get more pain.... On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash <and...@andrewash.com> wrote: > The biggest issue I've come across is that the cluster is somewhat > unstable when under memory pressure. Meaning that if you attempt to > persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll > often still get OOMs. I had to carefully modify some of the space tuning > parameters and GC settings to get some jobs to even finish. > > The other issue I've observed is if you group on a key that is highly > skewed, with a few massively-common keys and a long tail of rare keys, the > one massive key can be too big for a single machine and again cause OOMs. > > I'm hopeful that off-heap caching (Tachyon) could fix some of these issues. > > Just my personal experience, but I've observed significant improvements in > stability since even the 0.7.x days, so I'm confident that things will > continue to get better as long as people report what they're seeing so it > can get fixed. > > Andrew > > > On Thu, Apr 10, 2014 at 4:08 PM, Alex Boisvert <alex.boisv...@gmail.com>wrote: > >> I'll provide answers from our own experience at Bizo. We've been using >> Spark for 1+ year now and have found it generally better than previous >> approaches (Hadoop + Hive mostly). >> >> >> >> On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth < >> andras.nem...@lynxanalytics.com> wrote: >> >>> I. Is it too much magic? Lots of things "just work right" in Spark and >>> it's extremely convenient and efficient when it indeed works. But should we >>> be worried that customization is hard if the built in behavior is not quite >>> right for us? Are we to expect hard to track down issues originating from >>> the black box behind the magic? >>> >> >> I think is goes back to understanding Spark's architecture, its design >> constraints and the problems it explicitly set out to address. If the >> solution to your problems can be easily formulated in terms of the >> map/reduce model, then it's a good choice. You'll want your >> "customizations" to go with (not against) the grain of the architecture. >> >> >>> II. Is it mature enough? E.g. we've created a pull >>> request<https://github.com/apache/spark/pull/181>which fixes a problem that >>> we were very surprised no one ever stumbled upon >>> before. So that's why I'm asking: is Spark being already used in >>> professional settings? Can one already trust it being reasonably bug free >>> and reliable? >>> >> >> There are lots of ways to use Spark; and not all of the features are >> necessarily at the same level of maturity. For instance, we put all the >> jars on the main classpath so we've never run into the issue your pull >> request addresses. >> >> We definitely use and rely on Spark on a professional basis. We have 5+ >> spark jobs running nightly on Amazon's EMR, slicing through GBs of data. >> Once we got them working with the proper configuration settings, they have >> been running reliability since. >> >> I would characterize our use of Spark as a "better Hadoop", in the sense >> that we use it for batch processing only, no streaming yet. We're happy >> it performs better than Hadoop but we don't require/rely on its memory >> caching features. In fact, for most of our jobs it would simplify our >> lives if Spark wouldn't cache so many things in memory since it would make >> configuration/tuning a lot simpler and jobs would run successfully on the >> first try instead of having to tweak things (# of partitions and such). >> >> So, to the concrete issues. Sorry for the long mail, and let me know if I >>> should break this out into more threads or if there is some other way to >>> have this discussion... >>> >>> 1. Memory management >>> The general direction of these questions is whether it's possible to >>> take RDD caching related memory management more into our own hands as LRU >>> eviction is nice most of the time but can be very suboptimal in some of our >>> use cases. >>> A. Somehow prioritize cached RDDs, E.g. mark some "essential" that one >>> really wants to keep. I'm fine with going down in flames if I mark too much >>> data essential. >>> B. Memory "reflection": can you pragmatically get the memory size of a >>> cached rdd and memory sizes available in total/per executor? If we could do >>> this we could indirectly avoid automatic evictions of things we might >>> really want to keep in memory. >>> C. Evictions caused by RDD partitions on the driver. I had a setup with >>> huge worker memory and smallish memory on the driver JVM. To my surprise, >>> the system started to cache RDD partitions on the driver as well. As the >>> driver ran out of memory I started to see evictions while there were still >>> plenty of space on workers. This resulted in lengthy recomputations. Can >>> this be avoided somehow? >>> D. Broadcasts. Is it possible to get rid of a broadcast manually, >>> without waiting for the LRU eviction taking care of it? Can you tell the >>> size of a broadcast programmatically? >>> >>> >>> 2. Akka lost connections >>> We have quite often experienced lost executors due to akka exceptions - >>> mostly connection lost or similar. It seems to happen when an executor gets >>> extremely busy with some CPU intensive work. Our hypothesis is that akka >>> network threads get starved and the executor fails to respond within >>> timeout limits. Is this plausible? If yes, what can we do with it? >>> >> >> We've seen these as well. In our case, increasing the akka timeouts and >> framesize helped a lot. >> >> e.g. spark.akka.{timeout, askTimeout, lookupTimeout, frameSize} >> >> >>> >>> In general, these are scary errors in the sense that they come from the >>> very core of the framework and it's hard to link it to something we do in >>> our own code, and thus hard to find a fix. So a question more for the >>> community: how often do you end up scratching your head about cases where >>> spark >>> >> magic doesn't work perfectly? >>> >> >> For us, this happens most often for jobs processing TBs of data (instead >> of GBs)... which is frustrating of course because these jobs cost a lot >> more in $$$ + time to run/debug/diagnose than smaller jobs. >> >> It means we have to comb the logs to understand what happened, interpret >> stack traces, dump memory / object allocations, read Spark source to >> formulate hypothesis about what went wrong and then trial + error to get to >> a configuration that works. Again, if Spark had better defaults and more >> conservative execution model (rely less on in-memory caching of RDDs and >> associated metadata, keepings large communication buffers on the heap, >> etc.), it would definitely simplify our lives. >> >> (Though I recognize that others might use Spark very differently and that >> these defaults and conservative behavior might not please everybody.) >> >> Hopefully this is the kind of feedback you were looking for... >> >> >>> 3. Recalculation of cached rdds >>> I see the following scenario happening. I load two RDDs A,B from disk, >>> cache them and then do some jobs on them, at the very least a count on >>> each. After these jobs are done I see on the storage panel that 100% of >>> these RDDs are cached in memory. >>> >>> Then I create a third RDD C which is created by multiple joins and maps >>> from A and B, also cache it and start a job on C. When I do this I still >>> see A and B completely cached and also see C slowly getting more and more >>> cached. This is all fine and good, but in the meanwhile I see stages >>> running on the UI that point to code which is used to load A and B. How is >>> this possible? Am I misunderstanding how cached RDDs should behave? >>> >>> And again the general question - how can one debug such issues? >>> >>> 4. Shuffle on disk >>> Is it true - I couldn't find it in official docs, but did see this >>> mentioned in various threads - that shuffle _always_ hits disk? >>> (Disregarding OS caches.) Why is this the case? Are you planning to add a >>> function to do shuffle in memory or are there some intrinsic reasons for >>> this to be impossible? >>> >>> >>> Sorry again for the giant mail, and thanks for any insights! >>> >>> Andras >>> >>> >>> >> >