In R, its easy to split a data set into training, crossValidation, and test
set. Is there something like this in spark.ml? I am using python of now.

My real problem is I want to randomly select a relatively small data set to
do some initial data exploration. Its not clear to me how using spark I
could create a random sample from a large data set. I would prefer to sample
with out replacement.

I have not tried to use sparkR yet. I assume I would not be able to use the
caret package with spark ML

Kind regards

Andy

```{R}
   inTrain <- createDataPartition(y=csv$classe, p=0.7, list=FALSE)
    trainSetDF <- csv[inTrain,]
    testSetDF <- csv[-inTrain,]
```



Reply via email to