Here are several good ones: https://www.google.com/search?q=cloudera+spark&oq=cloudera+spark&aqs=chrome..69i57j69i65l3j69i60l2.4439j0j7&sourceid=chrome&espv=2&es_sm=119&ie=UTF-8
On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira <ianferre...@hotmail.com>wrote: > Do you have the link to the Cloudera comment? > > Sent from Windows Mail > > *From:* Dean Wampler <deanwamp...@gmail.com> > *Sent:* Thursday, April 10, 2014 7:39 AM > *To:* Spark Users <user@spark.apache.org> > *Cc:* Daniel Darabos <daniel.dara...@lynxanalytics.com>, Andras > Barjak<andras.bar...@lynxanalytics.com> > > Spark has been endorsed by Cloudera as the successor to MapReduce. That > says a lot... > > > On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth < > andras.nem...@lynxanalytics.com> wrote: > >> Hello Spark Users, >> >> With the recent graduation of Spark to a top level project (grats, btw!), >> maybe a well timed question. :) >> >> We are at the very beginning of a large scale big data project and after >> two months of exploration work we'd like to settle on the technologies to >> use, roll up our sleeves and start to build the system. >> >> Spark is one of the forerunners for our technology choice. >> >> My question in essence is whether it's a good idea or is Spark too >> 'experimental' just yet to bet our lives (well, the project's life) on it. >> >> The benefits of choosing Spark are numerous and I guess all too obvious >> for this audience - e.g. we love its powerful abstraction, ease of >> development and the potential for using a single system for serving and >> manipulating huge amount of data. >> >> This email aims to ask about the risks. I enlist concrete issues we've >> encountered below, but basically my concern boils down to two philosophical >> points: >> I. Is it too much magic? Lots of things "just work right" in Spark and >> it's extremely convenient and efficient when it indeed works. But should we >> be worried that customization is hard if the built in behavior is not quite >> right for us? Are we to expect hard to track down issues originating from >> the black box behind the magic? >> II. Is it mature enough? E.g. we've created a pull >> request<https://github.com/apache/spark/pull/181>which fixes a problem that >> we were very surprised no one ever stumbled upon >> before. So that's why I'm asking: is Spark being already used in >> professional settings? Can one already trust it being reasonably bug free >> and reliable? >> >> I know I'm asking a biased audience, but that's fine, as I want to be >> convinced. :) >> >> So, to the concrete issues. Sorry for the long mail, and let me know if I >> should break this out into more threads or if there is some other way to >> have this discussion... >> >> 1. Memory management >> The general direction of these questions is whether it's possible to take >> RDD caching related memory management more into our own hands as LRU >> eviction is nice most of the time but can be very suboptimal in some of our >> use cases. >> A. Somehow prioritize cached RDDs, E.g. mark some "essential" that one >> really wants to keep. I'm fine with going down in flames if I mark too much >> data essential. >> B. Memory "reflection": can you pragmatically get the memory size of a >> cached rdd and memory sizes available in total/per executor? If we could do >> this we could indirectly avoid automatic evictions of things we might >> really want to keep in memory. >> C. Evictions caused by RDD partitions on the driver. I had a setup with >> huge worker memory and smallish memory on the driver JVM. To my surprise, >> the system started to cache RDD partitions on the driver as well. As the >> driver ran out of memory I started to see evictions while there were still >> plenty of space on workers. This resulted in lengthy recomputations. Can >> this be avoided somehow? >> D. Broadcasts. Is it possible to get rid of a broadcast manually, without >> waiting for the LRU eviction taking care of it? Can you tell the size of a >> broadcast programmatically? >> >> >> 2. Akka lost connections >> We have quite often experienced lost executors due to akka exceptions - >> mostly connection lost or similar. It seems to happen when an executor gets >> extremely busy with some CPU intensive work. Our hypothesis is that akka >> network threads get starved and the executor fails to respond within >> timeout limits. Is this plausible? If yes, what can we do with it? >> >> In general, these are scary errors in the sense that they come from the >> very core of the framework and it's hard to link it to something we do in >> our own code, and thus hard to find a fix. So a question more for the >> community: how often do you end up scratching your head about cases where >> spark magic doesn't work perfectly? >> >> >> 3. Recalculation of cached rdds >> I see the following scenario happening. I load two RDDs A,B from disk, >> cache them and then do some jobs on them, at the very least a count on >> each. After these jobs are done I see on the storage panel that 100% of >> these RDDs are cached in memory. >> >> Then I create a third RDD C which is created by multiple joins and maps >> from A and B, also cache it and start a job on C. When I do this I still >> see A and B completely cached and also see C slowly getting more and more >> cached. This is all fine and good, but in the meanwhile I see stages >> running on the UI that point to code which is used to load A and B. How is >> this possible? Am I misunderstanding how cached RDDs should behave? >> >> And again the general question - how can one debug such issues? >> >> 4. Shuffle on disk >> Is it true - I couldn't find it in official docs, but did see this >> mentioned in various threads - that shuffle _always_ hits disk? >> (Disregarding OS caches.) Why is this the case? Are you planning to add a >> function to do shuffle in memory or are there some intrinsic reasons for >> this to be impossible? >> >> >> Sorry again for the giant mail, and thanks for any insights! >> >> Andras >> >> >> > > > -- > Dean Wampler, Ph.D. > Typesafe > @deanwampler > http://typesafe.com > http://polyglotprogramming.com > -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com