Re: Spark - ready for prime time?

Dean Wampler Thu, 10 Apr 2014 07:48:12 -0700

Here are several good ones:

https://www.google.com/search?q=cloudera+spark&oq=cloudera+spark&aqs=chrome..69i57j69i65l3j69i60l2.4439j0j7&sourceid=chrome&espv=2&es_sm=119&ie=UTF-8




On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira <ianferre...@hotmail.com>wrote:

>  Do you have the link to the Cloudera comment?
>
> Sent from Windows Mail
>
> *From:* Dean Wampler <deanwamp...@gmail.com>
> *Sent:* Thursday, April 10, 2014 7:39 AM
> *To:* Spark Users <user@spark.apache.org>
> *Cc:* Daniel Darabos <daniel.dara...@lynxanalytics.com>, Andras 
> Barjak<andras.bar...@lynxanalytics.com>
>
> Spark has been endorsed by Cloudera as the successor to MapReduce. That
> says a lot...
>
>
> On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth <
> andras.nem...@lynxanalytics.com> wrote:
>
>> Hello Spark Users,
>>
>> With the recent graduation of Spark to a top level project (grats, btw!),
>> maybe a well timed question. :)
>>
>> We are at the very beginning of a large scale big data project and after
>> two months of exploration work we'd like to settle on the technologies to
>> use, roll up our sleeves and start to build the system.
>>
>> Spark is one of the forerunners for our technology choice.
>>
>> My question in essence is whether it's a good idea or is Spark too
>> 'experimental' just yet to bet our lives (well, the project's life) on it.
>>
>> The benefits of choosing Spark are numerous and I guess all too obvious
>> for this audience - e.g. we love its powerful abstraction, ease of
>> development and the potential for using a single system for serving and
>> manipulating huge amount of data.
>>
>> This email aims to ask about the risks. I enlist concrete issues we've
>> encountered below, but basically my concern boils down to two philosophical
>> points:
>> I. Is it too much magic? Lots of things "just work right" in Spark and
>> it's extremely convenient and efficient when it indeed works. But should we
>> be worried that customization is hard if the built in behavior is not quite
>> right for us? Are we to expect hard to track down issues originating from
>> the black box behind the magic?
>> II. Is it mature enough? E.g. we've created a pull 
>> request<https://github.com/apache/spark/pull/181>which fixes a problem that 
>> we were very surprised no one ever stumbled upon
>> before. So that's why I'm asking: is Spark being already used in
>> professional settings? Can one already trust it being reasonably bug free
>> and reliable?
>>
>> I know I'm asking a biased audience, but that's fine, as I want to be
>> convinced. :)
>>
>> So, to the concrete issues. Sorry for the long mail, and let me know if I
>> should break this out into more threads or if there is some other way to
>> have this discussion...
>>
>> 1. Memory management
>> The general direction of these questions is whether it's possible to take
>> RDD caching related memory management more into our own hands as LRU
>> eviction is nice most of the time but can be very suboptimal in some of our
>> use cases.
>> A. Somehow prioritize cached RDDs, E.g. mark some "essential" that one
>> really wants to keep. I'm fine with going down in flames if I mark too much
>> data essential.
>> B. Memory "reflection": can you pragmatically get the memory size of a
>> cached rdd and memory sizes available in total/per executor? If we could do
>> this we could indirectly avoid automatic evictions of things we might
>> really want to keep in memory.
>> C. Evictions caused by RDD partitions on the driver. I had a setup with
>> huge worker memory and smallish memory on the driver JVM. To my surprise,
>> the system started to cache RDD partitions on the driver as well. As the
>> driver ran out of memory I started to see evictions while there were still
>> plenty of space on workers. This resulted in lengthy recomputations. Can
>> this be avoided somehow?
>> D. Broadcasts. Is it possible to get rid of a broadcast manually, without
>> waiting for the LRU eviction taking care of it? Can you tell the size of a
>> broadcast programmatically?
>>
>>
>> 2. Akka lost connections
>> We have quite often experienced lost executors due to akka exceptions -
>> mostly connection lost or similar. It seems to happen when an executor gets
>> extremely busy with some CPU intensive work. Our hypothesis is that akka
>> network threads get starved and the executor fails to respond within
>> timeout limits. Is this plausible? If yes, what can we do with it?
>>
>> In general, these are scary errors in the sense that they come from the
>> very core of the framework and it's hard to link it to something we do in
>> our own code, and thus hard to find a fix. So a question more for the
>> community: how often do you end up scratching your head about cases where
>> spark magic doesn't work perfectly?
>>
>>
>> 3. Recalculation of cached rdds
>> I see the following scenario happening. I load two RDDs A,B from disk,
>> cache them and then do some jobs on them, at the very least a count on
>> each. After these jobs are done I see on the storage panel that 100% of
>> these RDDs are cached in memory.
>>
>> Then I create a third RDD C which is created by multiple joins and maps
>> from A and B, also cache it and start a job on C. When I do this I still
>> see A and B completely cached and also see C slowly getting more and more
>> cached. This is all fine and good, but in the meanwhile I see stages
>> running on the UI that point to code which is used to load A and B. How is
>> this possible? Am I misunderstanding how cached RDDs should behave?
>>
>> And again the general question - how can one debug such issues?
>>
>> 4. Shuffle on disk
>> Is it true - I couldn't find it in official docs, but did see this
>> mentioned in various threads - that shuffle _always_ hits disk?
>> (Disregarding OS caches.) Why is this the case? Are you planning to add a
>> function to do shuffle in memory or are there some intrinsic reasons for
>> this to be impossible?
>>
>>
>> Sorry again for the giant mail, and thanks for any insights!
>>
>> Andras
>>
>>
>>
>
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: Spark - ready for prime time?

Reply via email to