MatrixFactorizationModel serialization
I am trying to persist MatrixFactorizationModel (Collaborative Filtering example) and use it in another script to evaluate/apply it. This is the exception I get when I try to use a deserialized model instance: Exception in thread main java.lang.NullPointerException at org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58) at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58) at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324) at java.util.TimSort.sort(TimSort.java:189) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at scala.collection.SeqLike$class.sorted(SeqLike.scala:615) at scala.collection.AbstractSeq.sorted(Seq.scala:40) at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594) at scala.collection.AbstractSeq.sortBy(Seq.scala:40) at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58) at org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536) at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57) ... Is this model serializable at all, I noticed it has two RDDs inside (user product features)? Thanks, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLlib - Naive Bayes Java example bug
Hi, I noticed a bug in the sample java code in MLlib - Naive Bayes docs page: http://spark.apache.org/docs/1.1.0/mllib-naive-bayes.html In the filter: |double accuracy = 1.0 * predictionAndLabel.filter(new FunctionTuple2Double, Double, Boolean() { @Override public Boolean call(Tuple2Double, Double pl) { return pl._1() == pl._2(); } }).count() / test.count(); it tests Double object by references whereas it should test their values: ||double accuracy = 1.0 * predictionAndLabel.filter(new FunctionTuple2Double, Double, Boolean() { @Override public Boolean call(Tuple2Double, Double pl) { ||| |return pl._1().doubleValue() == pl._2().doubleValue(); } }).count() / test.count();| The Java version accuracy is always 0.0. Scala code outputs the correct value 1.0 Thanks,
saveAsHadoopFile into avro format
What is the right way of saving any PairRDD into avro output format. GraphArray extends SpecificRecord etc. I have the following java rdd: JavaPairRDDGraphArray, NullWritable pairRDD = ... and want to save it to avro format: org.apache.hadoop.mapred.JobConf jc = new org.apache.hadoop.mapred.JobConf(); org.apache.avro.mapred.AvroJob.setOutputSchema(jc, GraphArray.getClassSchema()); org.apache.avro.mapred.AvroOutputFormat.setOutputPath(jc, new Path(outURI)); pairRDD.saveAsHadoopDataset(jc); the code above throws: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.NullWritable I also tried wrapping key and values with AvroKey and AvroValue classes respectively. What am I doing wrong? Should I use JavaRDD (list) instead and try with custom serializer? Thanks, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org