I think we're missing the point a bit. Everything was actually flowing
through smoothly and in a reasonable time. Until it reached the last two
tasks (out of over a thousand in the final stage alone), at which point it
just fell into a coma. Not so much as a cranky message in the logs.

I don't know *why* that happened. Maybe it isn't the overall amount of
data, but something I'm doing wrong with my flow. In any case, improvements
to diagnostic info would probably be helpful.

I look forward to the next release. :-)


On Tue, Jul 8, 2014 at 3:47 PM, Reynold Xin <r...@databricks.com> wrote:

> Not sure exactly what is happening but perhaps there are ways to
> restructure your program for it to work better. Spark is definitely able to
> handle much, much larger workloads.
>
> I've personally run a workload that shuffled 300 TB of data. I've also ran
> something that shuffled 5TB/node and stuffed my disks fairly full that the
> file system is close to breaking.
>
> We can definitely do a better job in Spark to make it output more
> meaningful diagnosis and more robust with partitions of data that don't fit
> in memory though. A lot of the work in the next few releases will be on
> that.
>
>
>
> On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> I'll respond for Dan.
>>
>> Our test dataset was a total of 10 GB of input data (full production
>> dataset for this particular dataflow would be 60 GB roughly).
>>
>> I'm not sure what the size of the final output data was but I think it
>> was on the order of 20 GBs for the given 10 GB of input data. Also, I can
>> say that when we were experimenting with persist(DISK_ONLY), the size of
>> all RDDs on disk was around 200 GB, which gives a sense of overall
>> transient memory usage with no persistence.
>>
>> In terms of our test cluster, we had 15 nodes. Each node had 24 cores and
>> 2 workers each. Each executor got 14 GB of memory.
>>
>> -Suren
>>
>>
>>
>> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com>
>> wrote:
>>
>>>  When you say "large data sets", how large?
>>> Thanks
>>>
>>>
>>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
>>>
>>>  From a development perspective, I vastly prefer Spark to MapReduce.
>>> The MapReduce API is very constrained; Spark's API feels much more natural
>>> to me. Testing and local development is also very easy - creating a local
>>> Spark context is trivial and it reads local files. For your unit tests you
>>> can just have them create a local context and execute your flow with some
>>> test data. Even better, you can do ad-hoc work in the Spark shell and if
>>> you want that in your production code it will look exactly the same.
>>>
>>>  Unfortunately, the picture isn't so rosy when it gets to production.
>>> In my experience, Spark simply doesn't scale to the volumes that MapReduce
>>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
>>> would be better, but I haven't had the opportunity to try them. I find jobs
>>> tend to just hang forever for no apparent reason on large data sets (but
>>> smaller than what I push through MapReduce).
>>>
>>>  I am hopeful the situation will improve - Spark is developing quickly
>>> - but if you have large amounts of data you should proceed with caution.
>>>
>>>  Keep in mind there are some frameworks for Hadoop which can hide the
>>> ugly MapReduce with something very similar in form to Spark's API; e.g.
>>> Apache Crunch. So you might consider those as well.
>>>
>>>  (Note: the above is with Spark 1.0.0.)
>>>
>>>
>>>
>>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com>
>>> wrote:
>>>
>>>>  Hello Experts,
>>>>
>>>>
>>>>
>>>> I am doing some comparative study on the below:
>>>>
>>>>
>>>>
>>>> Spark vs Impala
>>>>
>>>> Spark vs MapREduce . Is it worth migrating from existing MR
>>>> implementation to Spark?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Please share your thoughts and expertise.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Santosh
>>>>
>>>> ------------------------------
>>>>
>>>> This message is for the designated recipient only and may contain
>>>> privileged, proprietary, or otherwise confidential information. If you have
>>>> received it in error, please notify the sender immediately and delete the
>>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>>> by local law, electronic communications with Accenture and its affiliates,
>>>> including e-mail and instant messaging (including content), may be scanned
>>>> by our systems for the purposes of information security and assessment of
>>>> internal compliance with Accenture policy.
>>>>
>>>> ______________________________________________________________________________________
>>>>
>>>> www.accenture.com
>>>>
>>>
>>>
>>>
>>> --
>>>  Daniel Siegmann, Software Developer
>>> Velos
>>>  Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegm...@velos.io W: www.velos.io
>>>
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to