Hello,
I have a spark job that basically reads data from two tables into two
Dataframes which are subsequently converted to RDD's. I, then, join them based
on a common key.
Each table is about 10 TB in size but after filtering, the two RDD's are about
500GB each.
I have 800 executors with 8GB
Hi,
I have a 170GB data tab limited data set which I am converting into the
RDD[LabeledPoint] format. I am then taking a 60% sample of this data set to be
used for training a GBT model.
I got the Size exceeds Integer.MAX_VALUE error which I fixed by repartitioning
the data set to 1000
Hello,
A bit of a background.
I have a dataset with about 200 million records and around 10 columns. The size
of this dataset is around 1.5Tb and is split into around 600 files.
When I read this dataset, using sparkContext, by default it creates around 3000
partitions if I do not specify the
Hello Tilak,
1. I get error Not found: type RDD error. Can someone please tell me which jars
do I need to add as external jars and what dhoulf I add iunder import
statements so that this error will go away.
Do you not see any issues with the import statements?
Add the
Hi All,
I have a 8 mill row, 500 column data set, which is derived by reading a text
file and doing a filter, flatMap operation to weed out some anomalies.
Now, I have a process which has to run through all 500 columns, do couple of
map, reduce, forEach operations on the data set and return some
Hi,
I have a 5 million record, 300 column data set.
I am running a spark job in yarn-cluster mode, with the following args
--driver-memory 11G --executor-memory 11G --executor-cores 16 --num-executors
500
The spark job replaces all categorical variables with some integers.
I am getting the below
Hello,
I have been trying to implement a custom accumulator as below
import org.apache.spark._
class VectorNew1(val data: Array[Double]) {}
implicit object VectorAP extends AccumulatorParam[VectorNew1] {
def zero(v: VectorNew1) = new VectorNew1(new