Re: K-Means And Class Tags

Joseph Bradley Wed, 14 Jan 2015 13:26:31 -0800

(After asking around,) retag() is private[spark] in Scala, but Java ignores
the "private[X]," making retag (unintentionally) public in Java.


Currently, your solution of retagging from Java is the best hack I can
think of.  It may take a bit of engineering to create a proper fix for the
long-term.
Joseph

On Fri, Jan 9, 2015 at 2:41 AM, Devl Devel <devl.developm...@gmail.com>
wrote:

> Hi Joseph
>
> Thanks for the suggestion, however retag is a private method and when I
> call in Scala:
>
> val retaggedInput = parsedData.retag(classOf[Vector])
>
> I get:
>
> Symbol retag is inaccessible from this place
>
> However I can do this from Java, and it works in Scala:
>
> return words.rdd().retag(Vector.class);
>
> Dev
>
>
>
> On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> I believe you're running into an erasure issue which we found in
>> DecisionTree too.  Check out:
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134
>>
>> That retags RDDs which were created from Java to prevent the exception
>> you're running into.
>>
>> Hope this helps!
>> Joseph
>>
>> On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel <devl.developm...@gmail.com>
>> wrote:
>>
>>> Thanks for the suggestion, can anyone offer any advice on the ClassCast
>>> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a
>>> collect() result in this exception?
>>>
>>> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska <yana.kadiy...@gmail.com>
>>> wrote:
>>>
>>> > How about
>>> >
>>> >
>>> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]),
>>> > Double.parseDouble(good_entry[1]))
>>> > 
>>> > (full disclosure, I didn't actually run this). But after the first map
>>> you
>>> > should have an RDD[Array[String]], then you'd discard everything
>>> shorter
>>> > than 2, and convert the rest to dense vectors?...In fact if you're
>>> > expecting length exactly 2 might want to filter ==2...
>>> >
>>> >
>>> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel <
>>> devl.developm...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi All,
>>> >>
>>> >> I'm trying a simple K-Means example as per the website:
>>> >>
>>> >> val parsedData = data.map(s =>
>>> >> Vectors.dense(s.split(',').map(_.toDouble)))
>>> >>
>>> >> but I'm trying to write a Java based validation method first so that
>>> >> missing values are omitted or replaced with 0.
>>> >>
>>> >> public RDD<Vector> prepareKMeans(JavaRDD<String> data) {
>>> >>         JavaRDD<Vector> words = data.flatMap(new
>>> FlatMapFunction<String,
>>> >> Vector>() {
>>> >>             public Iterable<Vector> call(String s) {
>>> >>                 String[] split = s.split(",");
>>> >>                 ArrayList<Vector> add = new ArrayList<Vector>();
>>> >>                 if (split.length != 2) {
>>> >>                     add.add(Vectors.dense(0, 0));
>>> >>                 } else
>>> >>                 {
>>> >>
>>>  add.add(Vectors.dense(Double.parseDouble(split[0]),
>>> >>                Double.parseDouble(split[1])));
>>> >>                 }
>>> >>
>>> >>                 return add;
>>> >>             }
>>> >>         });
>>> >>
>>> >>         return words.rdd();
>>> >> }
>>> >>
>>> >> When I then call from scala:
>>> >>
>>> >> val parsedData=dc.prepareKMeans(data);
>>> >> val p=parsedData.collect();
>>> >>
>>> >> I get Exception in thread "main" java.lang.ClassCastException:
>>> >> [Ljava.lang.Object; cannot be cast to
>>> >> [Lorg.apache.spark.mllib.linalg.Vector;
>>> >>
>>> >> Why is the class tag is object rather than vector?
>>> >>
>>> >> 1) How do I get this working correctly using the Java validation
>>> example
>>> >> above or
>>> >> 2) How can I modify val parsedData = data.map(s =>
>>> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split
>>> size <2
>>> >> I
>>> >> ignore the line? or
>>> >> 3) Is there a better way to do input validation first?
>>> >>
>>> >> Using spark and mlib:
>>> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" %
>>> "1.2.0"
>>> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" %
>>> "1.2.0"
>>> >>
>>> >> Many thanks in advance
>>> >> Dev
>>> >>
>>> >
>>> >
>>>
>>
>>
>

Re: K-Means And Class Tags

Reply via email to