I do not believe the order of points in a distributed RDD is in any
way guaranteed. For a simple test, you can always add a last column
which is an id (make it double and throw it in the feature vector).
Printing the rdd back will not give you the points in file order. If
you don't want to go that
Hi,
I have a csv data file, which I have organized in the following format to
be read as a LabeledPoint(following the example in
mllib/data/sample_tree_data.csv):
1,5.1,3.5,1.4,0.2
1,4.9,3,1.4,0.2
1,4.7,3.2,1.3,0.2
1,4.6,3.1,1.5,0.2
The first column is the binary label (1 or 0) and the remainin