Re: Largest input data set observed for Spark.

Soila Pertet Kavulya Thu, 20 Mar 2014 13:55:24 -0700

Hi Reynold,

Nice! What spark configuration parameters did you use to get your job to
run successfully on a large dataset? My job is failing on 1TB of input data
(uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
errors just lost executors.


Thanks,

Soila
On Mar 20, 2014 11:29 AM, "Reynold Xin" <r...@databricks.com> wrote:

> I'm not really at liberty to discuss details of the job. It involves some
> expensive aggregated statistics, and took 10 hours to complete (mostly
> bottlenecked by network & io).
>
>
>
>
>
> On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman <
> suren.hira...@velos.io> wrote:
>
>> Reynold,
>>
>> How complex was that job (I guess in terms of number of transforms and
>> actions) and how long did that take to process?
>>
>> -Suren
>>
>>
>>
>> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>> > Actually we just ran a job with 70TB+ compressed data on 28 worker
>> nodes -
>> > I didn't count the size of the uncompressed data, but I am guessing it
>> is
>> > somewhere between 200TB to 700TB.
>> >
>> >
>> >
>> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com>
>> wrote:
>> >
>> > > All,
>> > > What is the largest input data set y'all have come across that has
>> been
>> > > successfully processed in production using spark. Ball park?
>> > >
>> >
>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>> W: www.velos.io
>>
>
>

Re: Largest input data set observed for Spark.

Reply via email to