How about

data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
Double.parseDouble(good_entry[1]))
​
(full disclosure, I didn't actually run this). But after the first map you
should have an RDD[Array[String]], then you'd discard everything shorter
than 2, and convert the rest to dense vectors?...In fact if you're
expecting length exactly 2 might want to filter ==2...


On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <devl.developm...@gmail.com>
wrote:

> Hi All,
>
> I'm trying a simple K-Means example as per the website:
>
> val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
>
> but I'm trying to write a Java based validation method first so that
> missing values are omitted or replaced with 0.
>
> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
>         JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String,
> Vector>() {
>             public Iterable<Vector> call(String s) {
>                 String[] split = s.split(",");
>                 ArrayList<Vector> add = new ArrayList<Vector>();
>                 if (split.length != 2) {
>                     add.add(Vectors.dense(0, 0));
>                 } else
>                 {
>                     add.add(Vectors.dense(Double.parseDouble(split[0]),
>                Double.parseDouble(split[1])));
>                 }
>
>                 return add;
>             }
>         });
>
>         return words.rdd();
> }
>
> When I then call from scala:
>
> val parsedData=dc.prepareKMeans(data);
> val p=parsedData.collect();
>
> I get Exception in thread "main" java.lang.ClassCastException:
> [Ljava.lang.Object; cannot be cast to
> [Lorg.apache.spark.mllib.linalg.Vector;
>
> Why is the class tag is object rather than vector?
>
> 1) How do I get this working correctly using the Java validation example
> above or
> 2) How can I modify val parsedData = data.map(s =>
> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 I
> ignore the line? or
> 3) Is there a better way to do input validation first?
>
> Using spark and mlib:
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %  "1.2.0"
> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0"
>
> Many thanks in advance
> Dev
>

Reply via email to