Do you have the link to the Cloudera comment?





Sent from Windows Mail





From: Dean Wampler
Sent: ‎Thursday‎, ‎April‎ ‎10‎, ‎2014 ‎7‎:‎39‎ ‎AM
To: Spark Users
Cc: Daniel Darabos, Andras Barjak






Spark has been endorsed by Cloudera as the successor to MapReduce. That says a 
lot...




On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth 
<andras.nem...@lynxanalytics.com> wrote:





Hello Spark Users,




With the recent graduation of Spark to a top level project (grats, btw!), maybe 
a well timed question. :)




We are at the very beginning of a large scale big data project and after two 
months of exploration work we'd like to settle on the technologies to use, roll 
up our sleeves and start to build the system.




Spark is one of the forerunners for our technology choice.




My question in essence is whether it's a good idea or is Spark too 
'experimental' just yet to bet our lives (well, the project's life) on it.




The benefits of choosing Spark are numerous and I guess all too obvious for 
this audience - e.g. we love its powerful abstraction, ease of development and 
the potential for using a single system for serving and manipulating huge 
amount of data.




This email aims to ask about the risks. I enlist concrete issues we've 
encountered below, but basically my concern boils down to two philosophical 
points:

I. Is it too much magic? Lots of things "just work right" in Spark and it's 
extremely convenient and efficient when it indeed works. But should we be 
worried that customization is hard if the built in behavior is not quite right 
for us? Are we to expect hard to track down issues originating from the black 
box behind the magic?

II. Is it mature enough? E.g. we've created a pull request which fixes a 
problem that we were very surprised no one ever stumbled upon before. So that's 
why I'm asking: is Spark being already used in professional settings? Can one 
already trust it being reasonably bug free and reliable?




I know I'm asking a biased audience, but that's fine, as I want to be 
convinced. :)




So, to the concrete issues. Sorry for the long mail, and let me know if I 
should break this out into more threads or if there is some other way to have 
this discussion...




1. Memory management


The general direction of these questions is whether it's possible to take RDD 
caching related memory management more into our own hands as LRU eviction is 
nice most of the time but can be very suboptimal in some of our use cases.

A. Somehow prioritize cached RDDs, E.g. mark some "essential" that one really 
wants to keep. I'm fine with going down in flames if I mark too much data 
essential.

B. Memory "reflection": can you pragmatically get the memory size of a cached 
rdd and memory sizes available in total/per executor? If we could do this we 
could indirectly avoid automatic evictions of things we might really want to 
keep in memory.

C. Evictions caused by RDD partitions on the driver. I had a setup with huge 
worker memory and smallish memory on the driver JVM. To my surprise, the system 
started to cache RDD partitions on the driver as well. As the driver ran out of 
memory I started to see evictions while there were still plenty of space on 
workers. This resulted in lengthy recomputations. Can this be avoided somehow?

D. Broadcasts. Is it possible to get rid of a broadcast manually, without 
waiting for the LRU eviction taking care of it? Can you tell the size of a 
broadcast programmatically?







2. Akka lost connections


We have quite often experienced lost executors due to akka exceptions - mostly 
connection lost or similar. It seems to happen when an executor gets extremely 
busy with some CPU intensive work. Our hypothesis is that akka network threads 
get starved and the executor fails to respond within timeout limits. Is this 
plausible? If yes, what can we do with it?




In general, these are scary errors in the sense that they come from the very 
core of the framework and it's hard to link it to something we do in our own 
code, and thus hard to find a fix. So a question more for the community: how 
often do you end up scratching your head about cases where spark magic doesn't 
work perfectly?







3. Recalculation of cached rdds


I see the following scenario happening. I load two RDDs A,B from disk, cache 
them and then do some jobs on them, at the very least a count on each. After 
these jobs are done I see on the storage panel that 100% of these RDDs are 
cached in memory.




Then I create a third RDD C which is created by multiple joins and maps from A 
and B, also cache it and start a job on C. When I do this I still see A and B 
completely cached and also see C slowly getting more and more cached. This is 
all fine and good, but in the meanwhile I see stages running on the UI that 
point to code which is used to load A and B. How is this possible? Am I 
misunderstanding how cached RDDs should behave?




And again the general question - how can one debug such issues?




4. Shuffle on disk

Is it true - I couldn't find it in official docs, but did see this mentioned in 
various threads - that shuffle _always_ hits disk? (Disregarding OS caches.) 
Why is this the case? Are you planning to add a function to do shuffle in 
memory or are there some intrinsic reasons for this to be impossible?







Sorry again for the giant mail, and thanks for any insights!




Andras









-- 

Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com

http://polyglotprogramming.com

Reply via email to