MatrixFactorizationModel serialization

2014-11-07 Thread Dariusz Kobylarz
I am trying to persist MatrixFactorizationModel (Collaborative Filtering 
example) and use it in another script to evaluate/apply it.

This is the exception I get when I try to use a deserialized model instance:

Exception in thread main java.lang.NullPointerException
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:189)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:615)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594)
at scala.collection.AbstractSeq.sortBy(Seq.scala:40)
at 
org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at 
org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536)
at 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57)

...

Is this model serializable at all, I noticed it has two RDDs inside 
(user  product features)?


Thanks,



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLlib - Naive Bayes Java example bug

2014-11-03 Thread Dariusz Kobylarz

Hi,
I noticed a bug in the sample java code in MLlib - Naive Bayes docs page:
http://spark.apache.org/docs/1.1.0/mllib-naive-bayes.html

In the filter:

|double  accuracy  =  1.0  *  predictionAndLabel.filter(new  FunctionTuple2Double,  
Double,  Boolean()  {
@Override  public  Boolean  call(Tuple2Double,  Double  pl)  {
  return  pl._1()  ==  pl._2();
}
  }).count()  /  test.count();

it tests Double object by references whereas it should test their values:

||double  accuracy  =  1.0  *  predictionAndLabel.filter(new  FunctionTuple2Double, 
 Double,  Boolean()  {
@Override  public  Boolean  call(Tuple2Double,  Double  pl)  {
|||   |return pl._1().doubleValue() == pl._2().doubleValue();
}
  }).count()  /  test.count();|

The Java version accuracy is always 0.0. Scala code outputs the correct value 
1.0

Thanks,






saveAsHadoopFile into avro format

2014-09-08 Thread Dariusz Kobylarz
What is the right way of saving any PairRDD into avro output format. 
GraphArray extends SpecificRecord etc.

I have the following java rdd:
JavaPairRDDGraphArray, NullWritable pairRDD = ...
and want to save it to avro format:
org.apache.hadoop.mapred.JobConf jc = new 
org.apache.hadoop.mapred.JobConf();
org.apache.avro.mapred.AvroJob.setOutputSchema(jc, 
GraphArray.getClassSchema());

org.apache.avro.mapred.AvroOutputFormat.setOutputPath(jc, new Path(outURI));
pairRDD.saveAsHadoopDataset(jc);

the code above throws:
Exception in thread main org.apache.spark.SparkException: Job aborted 
due to stage failure: Task not serializable: 
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable


I also tried wrapping key and values with AvroKey and AvroValue classes 
respectively.


What am I doing wrong? Should I use JavaRDD (list) instead and try with 
custom serializer?


Thanks,



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org