(After asking around,) retag() is private[spark] in Scala, but Java ignores the "private[X]," making retag (unintentionally) public in Java.
Currently, your solution of retagging from Java is the best hack I can think of. It may take a bit of engineering to create a proper fix for the long-term. Joseph On Fri, Jan 9, 2015 at 2:41 AM, Devl Devel <devl.developm...@gmail.com> wrote: > Hi Joseph > > Thanks for the suggestion, however retag is a private method and when I > call in Scala: > > val retaggedInput = parsedData.retag(classOf[Vector]) > > I get: > > Symbol retag is inaccessible from this place > > However I can do this from Java, and it works in Scala: > > return words.rdd().retag(Vector.class); > > Dev > > > > On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley <jos...@databricks.com> > wrote: > >> I believe you're running into an erasure issue which we found in >> DecisionTree too. Check out: >> >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134 >> >> That retags RDDs which were created from Java to prevent the exception >> you're running into. >> >> Hope this helps! >> Joseph >> >> On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel <devl.developm...@gmail.com> >> wrote: >> >>> Thanks for the suggestion, can anyone offer any advice on the ClassCast >>> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a >>> collect() result in this exception? >>> >>> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <yana.kadiy...@gmail.com> >>> wrote: >>> >>> > How about >>> > >>> > >>> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]), >>> > Double.parseDouble(good_entry[1])) >>> > >>> > (full disclosure, I didn't actually run this). But after the first map >>> you >>> > should have an RDD[Array[String]], then you'd discard everything >>> shorter >>> > than 2, and convert the rest to dense vectors?...In fact if you're >>> > expecting length exactly 2 might want to filter ==2... >>> > >>> > >>> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel < >>> devl.developm...@gmail.com> >>> > wrote: >>> > >>> >> Hi All, >>> >> >>> >> I'm trying a simple K-Means example as per the website: >>> >> >>> >> val parsedData = data.map(s => >>> >> Vectors.dense(s.split(',').map(_.toDouble))) >>> >> >>> >> but I'm trying to write a Java based validation method first so that >>> >> missing values are omitted or replaced with 0. >>> >> >>> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) { >>> >> JavaRDD<Vector> words = data.flatMap(new >>> FlatMapFunction<String, >>> >> Vector>() { >>> >> public Iterable<Vector> call(String s) { >>> >> String[] split = s.split(","); >>> >> ArrayList<Vector> add = new ArrayList<Vector>(); >>> >> if (split.length != 2) { >>> >> add.add(Vectors.dense(0, 0)); >>> >> } else >>> >> { >>> >> >>> add.add(Vectors.dense(Double.parseDouble(split[0]), >>> >> Double.parseDouble(split[1]))); >>> >> } >>> >> >>> >> return add; >>> >> } >>> >> }); >>> >> >>> >> return words.rdd(); >>> >> } >>> >> >>> >> When I then call from scala: >>> >> >>> >> val parsedData=dc.prepareKMeans(data); >>> >> val p=parsedData.collect(); >>> >> >>> >> I get Exception in thread "main" java.lang.ClassCastException: >>> >> [Ljava.lang.Object; cannot be cast to >>> >> [Lorg.apache.spark.mllib.linalg.Vector; >>> >> >>> >> Why is the class tag is object rather than vector? >>> >> >>> >> 1) How do I get this working correctly using the Java validation >>> example >>> >> above or >>> >> 2) How can I modify val parsedData = data.map(s => >>> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split >>> size <2 >>> >> I >>> >> ignore the line? or >>> >> 3) Is there a better way to do input validation first? >>> >> >>> >> Using spark and mlib: >>> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % >>> "1.2.0" >>> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % >>> "1.2.0" >>> >> >>> >> Many thanks in advance >>> >> Dev >>> >> >>> > >>> > >>> >> >> >