How wide are the rows of data, either the raw input data or any generated
intermediate data?

We are at a loss as to why our flow doesn't complete. We banged our heads
against it for a few weeks.

-Suren



On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey <kevin.mar...@oracle.com>
wrote:

>  Nothing particularly custom.  We've tested with small (4 node)
> development clusters, single-node pseudoclusters, and bigger, using
> plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master,
> Spark local, Spark Yarn (client and cluster) modes, with total memory
> resources ranging from 4GB to 256GB+.
>
> K
>
>
>
> On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote:
>
> To clarify, we are not persisting to disk. That was just one of the
> experiments we did because of some issues we had along the way.
>
>  At this time, we are NOT using persist but cannot get the flow to
> complete in Standalone Cluster mode. We do not have a YARN-capable cluster
> at this time.
>
>  We agree with what you're saying. Your results are what we were hoping
> for and expecting. :-)  Unfortunately we still haven't gotten the flow to
> run end to end on this relatively small dataset.
>
>  It must be something related to our cluster, standalone mode or our flow
> but as far as we can tell, we are not doing anything unusual.
>
>  Did you do any custom configuration? Any advice would be appreciated.
>
>  -Suren
>
>
>
>
> On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey <kevin.mar...@oracle.com>
> wrote:
>
>>  It seems to me that you're not taking full advantage of the lazy
>> evaluation, especially persisting to disk only.  While it might be true
>> that the cumulative size of the RDDs looks like it's 300GB, only a small
>> portion of that should be resident at any one time.  We've evaluated data
>> sets much greater than 10GB in Spark using the Spark master and Spark with
>> Yarn (cluster -- formerly standalone -- mode).  Nice thing about using Yarn
>> is that it reports the actual memory *demand*, not just the memory
>> requested for driver and workers.  Processing a 60GB data set through
>> thousands of stages in a rather complex set of analytics and
>> transformations consumed a total cluster resource (divided among all
>> workers and driver) of only 9GB.  We were somewhat startled at first by
>> this result, thinking that it would be much greater, but realized that it
>> is a consequence of Spark's lazy evaluation model.  This is even with
>> several intermediate computations being cached as input to multiple
>> evaluation paths.
>>
>> Good luck.
>>
>> Kevin
>>
>>
>>
>> On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:
>>
>> I'll respond for Dan.
>>
>>  Our test dataset was a total of 10 GB of input data (full production
>> dataset for this particular dataflow would be 60 GB roughly).
>>
>>  I'm not sure what the size of the final output data was but I think it
>> was on the order of 20 GBs for the given 10 GB of input data. Also, I can
>> say that when we were experimenting with persist(DISK_ONLY), the size of
>> all RDDs on disk was around 200 GB, which gives a sense of overall
>> transient memory usage with no persistence.
>>
>>  In terms of our test cluster, we had 15 nodes. Each node had 24 cores
>> and 2 workers each. Each executor got 14 GB of memory.
>>
>>  -Suren
>>
>>
>>
>> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com>
>> wrote:
>>
>>>  When you say "large data sets", how large?
>>> Thanks
>>>
>>>
>>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
>>>
>>>  From a development perspective, I vastly prefer Spark to MapReduce.
>>> The MapReduce API is very constrained; Spark's API feels much more natural
>>> to me. Testing and local development is also very easy - creating a local
>>> Spark context is trivial and it reads local files. For your unit tests you
>>> can just have them create a local context and execute your flow with some
>>> test data. Even better, you can do ad-hoc work in the Spark shell and if
>>> you want that in your production code it will look exactly the same.
>>>
>>>  Unfortunately, the picture isn't so rosy when it gets to production.
>>> In my experience, Spark simply doesn't scale to the volumes that MapReduce
>>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
>>> would be better, but I haven't had the opportunity to try them. I find jobs
>>> tend to just hang forever for no apparent reason on large data sets (but
>>> smaller than what I push through MapReduce).
>>>
>>>  I am hopeful the situation will improve - Spark is developing quickly
>>> - but if you have large amounts of data you should proceed with caution.
>>>
>>>  Keep in mind there are some frameworks for Hadoop which can hide the
>>> ugly MapReduce with something very similar in form to Spark's API; e.g.
>>> Apache Crunch. So you might consider those as well.
>>>
>>>  (Note: the above is with Spark 1.0.0.)
>>>
>>>
>>>
>>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com>
>>> wrote:
>>>
>>>>  Hello Experts,
>>>>
>>>>
>>>>
>>>> I am doing some comparative study on the below:
>>>>
>>>>
>>>>
>>>> Spark vs Impala
>>>>
>>>> Spark vs MapREduce . Is it worth migrating from existing MR
>>>> implementation to Spark?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Please share your thoughts and expertise.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Santosh
>>>>
>>>> ------------------------------
>>>>
>>>> This message is for the designated recipient only and may contain
>>>> privileged, proprietary, or otherwise confidential information. If you have
>>>> received it in error, please notify the sender immediately and delete the
>>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>>> by local law, electronic communications with Accenture and its affiliates,
>>>> including e-mail and instant messaging (including content), may be scanned
>>>> by our systems for the purposes of information security and assessment of
>>>> internal compliance with Accenture policy.
>>>>
>>>> ______________________________________________________________________________________
>>>>
>>>> www.accenture.com
>>>>
>>>
>>>
>>>
>>> --
>>>  Daniel Siegmann, Software Developer
>>> Velos
>>>  Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegm...@velos.io W: www.velos.io
>>>
>>>
>>>
>>
>>
>>  --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105 <%28917%29%20525-2466%20ext.%20105>
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>>
>
>
>  --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
> W: www.velos.io
>
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Reply via email to