Losing executors due to memory problems

2016-08-11 Thread Muttineni, Vinay
Hello, I have a spark job that basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key. Each table is about 10 TB in size but after filtering, the two RDD's are about 500GB each. I have 800 executors with 8GB

Requested array size exceeds VM limit Error

2015-02-05 Thread Muttineni, Vinay
Hi, I have a 170GB data tab limited data set which I am converting into the RDD[LabeledPoint] format. I am then taking a 60% sample of this data set to be used for training a GBT model. I got the Size exceeds Integer.MAX_VALUE error which I fixed by repartitioning the data set to 1000

Optimal Partition Strategy

2014-09-25 Thread Muttineni, Vinay
Hello, A bit of a background. I have a dataset with about 200 million records and around 10 columns. The size of this dataset is around 1.5Tb and is split into around 600 files. When I read this dataset, using sparkContext, by default it creates around 3000 partitions if I do not specify the

RE: Basic Scala and Spark questions

2014-06-24 Thread Muttineni, Vinay
Hello Tilak, 1. I get error Not found: type RDD error. Can someone please tell me which jars do I need to add as external jars and what dhoulf I add iunder import statements so that this error will go away. Do you not see any issues with the import statements? Add the

Better way to use a large data set?

2014-06-20 Thread Muttineni, Vinay
Hi All, I have a 8 mill row, 500 column data set, which is derived by reading a text file and doing a filter, flatMap operation to weed out some anomalies. Now, I have a process which has to run through all 500 columns, do couple of map, reduce, forEach operations on the data set and return some

java.lang.OutOfMemoryError with saveAsTextFile

2014-06-18 Thread Muttineni, Vinay
Hi, I have a 5 million record, 300 column data set. I am running a spark job in yarn-cluster mode, with the following args --driver-memory 11G --executor-memory 11G --executor-cores 16 --num-executors 500 The spark job replaces all categorical variables with some integers. I am getting the below

Custom Accumulator: Type Mismatch Error

2014-05-24 Thread Muttineni, Vinay
Hello, I have been trying to implement a custom accumulator as below import org.apache.spark._ class VectorNew1(val data: Array[Double]) {} implicit object VectorAP extends AccumulatorParam[VectorNew1] { def zero(v: VectorNew1) = new VectorNew1(new