Hi All,

I'm trying a simple K-Means example as per the website:

val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))

but I'm trying to write a Java based validation method first so that
missing values are omitted or replaced with 0.

public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
        JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String,
Vector>() {
            public Iterable<Vector> call(String s) {
                String[] split = s.split(",");
                ArrayList<Vector> add = new ArrayList<Vector>();
                if (split.length != 2) {
                    add.add(Vectors.dense(0, 0));
                } else
                {
                    add.add(Vectors.dense(Double.parseDouble(split[0]),
               Double.parseDouble(split[1])));
                }

                return add;
            }
        });

        return words.rdd();
}

When I then call from scala:

val parsedData=dc.prepareKMeans(data);
val p=parsedData.collect();

I get Exception in thread "main" java.lang.ClassCastException:
[Ljava.lang.Object; cannot be cast to
[Lorg.apache.spark.mllib.linalg.Vector;

Why is the class tag is object rather than vector?

1) How do I get this working correctly using the Java validation example
above or
2) How can I modify val parsedData = data.map(s =>
Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 I
ignore the line? or
3) Is there a better way to do input validation first?

Using spark and mlib:
libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"

Many thanks in advance
Dev

Reply via email to