Hi Reynold, Nice! What spark configuration parameters did you use to get your job to run successfully on a large dataset? My job is failing on 1TB of input data (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory errors just lost executors.
Thanks, Soila On Mar 20, 2014 11:29 AM, "Reynold Xin" <r...@databricks.com> wrote: > I'm not really at liberty to discuss details of the job. It involves some > expensive aggregated statistics, and took 10 hours to complete (mostly > bottlenecked by network & io). > > > > > > On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> Reynold, >> >> How complex was that job (I guess in terms of number of transforms and >> actions) and how long did that take to process? >> >> -Suren >> >> >> >> On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin <r...@databricks.com> wrote: >> >> > Actually we just ran a job with 70TB+ compressed data on 28 worker >> nodes - >> > I didn't count the size of the uncompressed data, but I am guessing it >> is >> > somewhere between 200TB to 700TB. >> > >> > >> > >> > On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani <us...@platfora.com> >> wrote: >> > >> > > All, >> > > What is the largest input data set y'all have come across that has >> been >> > > successfully processed in production using spark. Ball park? >> > > >> > >> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >> W: www.velos.io >> > >