Thanks for the suggestion, can anyone offer any advice on the ClassCast Exception going from Java to Scala? Why does JavaRDD.rdd() and then a collect() result in this exception?
On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <yana.kadiy...@gmail.com> wrote: > How about > > data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]), > Double.parseDouble(good_entry[1])) > > (full disclosure, I didn't actually run this). But after the first map you > should have an RDD[Array[String]], then you'd discard everything shorter > than 2, and convert the rest to dense vectors?...In fact if you're > expecting length exactly 2 might want to filter ==2... > > > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <devl.developm...@gmail.com> > wrote: > >> Hi All, >> >> I'm trying a simple K-Means example as per the website: >> >> val parsedData = data.map(s => >> Vectors.dense(s.split(',').map(_.toDouble))) >> >> but I'm trying to write a Java based validation method first so that >> missing values are omitted or replaced with 0. >> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) { >> JavaRDD<Vector> words = data.flatMap(new FlatMapFunction<String, >> Vector>() { >> public Iterable<Vector> call(String s) { >> String[] split = s.split(","); >> ArrayList<Vector> add = new ArrayList<Vector>(); >> if (split.length != 2) { >> add.add(Vectors.dense(0, 0)); >> } else >> { >> add.add(Vectors.dense(Double.parseDouble(split[0]), >> Double.parseDouble(split[1]))); >> } >> >> return add; >> } >> }); >> >> return words.rdd(); >> } >> >> When I then call from scala: >> >> val parsedData=dc.prepareKMeans(data); >> val p=parsedData.collect(); >> >> I get Exception in thread "main" java.lang.ClassCastException: >> [Ljava.lang.Object; cannot be cast to >> [Lorg.apache.spark.mllib.linalg.Vector; >> >> Why is the class tag is object rather than vector? >> >> 1) How do I get this working correctly using the Java validation example >> above or >> 2) How can I modify val parsedData = data.map(s => >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 >> I >> ignore the line? or >> 3) Is there a better way to do input validation first? >> >> Using spark and mlib: >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0" >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0" >> >> Many thanks in advance >> Dev >> > >