Re: Largest input data set observed for Spark.

2014-03-22 Thread Usman Ghani
I am having similar issues with much smaller data sets. I am using spark
EC2 scripts to launch clusters, but I almost always end up with straggling
executors that take over a node's CPU and memory and end up never finishing.



On Thu, Mar 20, 2014 at 1:54 PM, Soila Pertet Kavulya skavu...@gmail.comwrote:

 Hi Reynold,

 Nice! What spark configuration parameters did you use to get your job to
 run successfully on a large dataset? My job is failing on 1TB of input data
 (uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
 errors just lost executors.

 Thanks,

 Soila
 On Mar 20, 2014 11:29 AM, Reynold Xin r...@databricks.com wrote:

 I'm not really at liberty to discuss details of the job. It involves some
 expensive aggregated statistics, and took 10 hours to complete (mostly
 bottlenecked by network  io).





 On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 Reynold,

 How complex was that job (I guess in terms of number of transforms and
 actions) and how long did that take to process?

 -Suren



 On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com
 wrote:

  Actually we just ran a job with 70TB+ compressed data on 28 worker
 nodes -
  I didn't count the size of the uncompressed data, but I am guessing it
 is
  somewhere between 200TB to 700TB.
 
 
 
  On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com
 wrote:
 
   All,
   What is the largest input data set y'all have come across that has
 been
   successfully processed in production using spark. Ball park?
  
 



 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io





Re: Largest input data set observed for Spark.

2014-03-20 Thread Soila Pertet Kavulya
Hi Reynold,

Nice! What spark configuration parameters did you use to get your job to
run successfully on a large dataset? My job is failing on 1TB of input data
(uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
errors just lost executors.

Thanks,

Soila
On Mar 20, 2014 11:29 AM, Reynold Xin r...@databricks.com wrote:

 I'm not really at liberty to discuss details of the job. It involves some
 expensive aggregated statistics, and took 10 hours to complete (mostly
 bottlenecked by network  io).





 On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman 
 suren.hira...@velos.io wrote:

 Reynold,

 How complex was that job (I guess in terms of number of transforms and
 actions) and how long did that take to process?

 -Suren



 On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin r...@databricks.com wrote:

  Actually we just ran a job with 70TB+ compressed data on 28 worker
 nodes -
  I didn't count the size of the uncompressed data, but I am guessing it
 is
  somewhere between 200TB to 700TB.
 
 
 
  On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani us...@platfora.com
 wrote:
 
   All,
   What is the largest input data set y'all have come across that has
 been
   successfully processed in production using spark. Ball park?
  
 



 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io