Hi Joseph Thanks for the suggestion, however retag is a private method and when I call in Scala:
val retaggedInput = parsedData.retag(classOf[Vector]) I get: Symbol retag is inaccessible from this place However I can do this from Java, and it works in Scala: return words.rdd().retag(Vector.class); Dev On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley <jos...@databricks.com> wrote: > I believe you're running into an erasure issue which we found in > DecisionTree too. Check out: > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134 > > That retags RDDs which were created from Java to prevent the exception > you're running into. > > Hope this helps! > Joseph > > On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel <devl.developm...@gmail.com> > wrote: > >> Thanks for the suggestion, can anyone offer any advice on the ClassCast >> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a >> collect() result in this exception? >> >> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <yana.kadiy...@gmail.com> >> wrote: >> >> > How about >> > >> > >> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]), >> > Double.parseDouble(good_entry[1])) >> > >> > (full disclosure, I didn't actually run this). But after the first map >> you >> > should have an RDD[Array[String]], then you'd discard everything shorter >> > than 2, and convert the rest to dense vectors?...In fact if you're >> > expecting length exactly 2 might want to filter ==2... >> > >> > >> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <devl.developm...@gmail.com >> > >> > wrote: >> > >> >> Hi All, >> >> >> >> I'm trying a simple K-Means example as per the website: >> >> >> >> val parsedData = data.map(s => >> >> Vectors.dense(s.split(',').map(_.toDouble))) >> >> >> >> but I'm trying to write a Java based validation method first so that >> >> missing values are omitted or replaced with 0. >> >> >> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) { >> >> JavaRDD<Vector> words = data.flatMap(new >> FlatMapFunction<String, >> >> Vector>() { >> >> public Iterable<Vector> call(String s) { >> >> String[] split = s.split(","); >> >> ArrayList<Vector> add = new ArrayList<Vector>(); >> >> if (split.length != 2) { >> >> add.add(Vectors.dense(0, 0)); >> >> } else >> >> { >> >> add.add(Vectors.dense(Double.parseDouble(split[0]), >> >> Double.parseDouble(split[1]))); >> >> } >> >> >> >> return add; >> >> } >> >> }); >> >> >> >> return words.rdd(); >> >> } >> >> >> >> When I then call from scala: >> >> >> >> val parsedData=dc.prepareKMeans(data); >> >> val p=parsedData.collect(); >> >> >> >> I get Exception in thread "main" java.lang.ClassCastException: >> >> [Ljava.lang.Object; cannot be cast to >> >> [Lorg.apache.spark.mllib.linalg.Vector; >> >> >> >> Why is the class tag is object rather than vector? >> >> >> >> 1) How do I get this working correctly using the Java validation >> example >> >> above or >> >> 2) How can I modify val parsedData = data.map(s => >> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size >> <2 >> >> I >> >> ignore the line? or >> >> 3) Is there a better way to do input validation first? >> >> >> >> Using spark and mlib: >> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % >> "1.2.0" >> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % >> "1.2.0" >> >> >> >> Many thanks in advance >> >> Dev >> >> >> > >> > >> > >