Re: RowSimilakrity : NotSerializableException
I think you have to create a SparkDistributedContext, which has Mahout specific Kryo serialization and adds Mahout jars. If you let Mahout create the Spark context it’s simpler val implicit mc = mahoutSparkContext(masterUr = “local", appName = “SparkExample”) As I recall the sc will then be implicit through a conversion in the Mahout Spark package but if I’m wrong and you still get errors create the sc with val sc = sdc2sc(mc) This is from memory, your debugger may provide better help. I suspect the error below is from not having kryo serialization classes configured properly. On May 7, 2016, at 3:05 AM, Rohit Jain wrote: hello everyone, I want to run Spark RowSimilarity recommender on data obtained from mongodb. For this purpose, I've written below code which takes input from mongo, converts it to RDD of Objects. This needs to be passed to IndexedDataSetSpark which is then passed to SimilarityAnalysis. rowSimilarityIDS. import org.apache.hadoop.conf.Configuration import org.apache.mahout.math.cf.SimilarityAnalysis import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark import org.apache.spark.rdd.{NewHadoopRDD, RDD} import org.apache.spark.{SparkConf, SparkContext} import org.bson.BSONObject import com.mongodb.hadoop.MongoInputFormat object SparkExample extends App { val mongoConfig = new Configuration() mongoConfig.set("mongo.input.uri", "mongodb://my_mongo_ip:27017/db.collection") val sparkConf = new SparkConf() val sc = new SparkContext("local", "SparkExample", sparkConf) val documents: RDD[(Object, BSONObject)] = sc.newAPIHadoopRDD( mongoConfig, classOf[MongoInputFormat], classOf[Object], classOf[BSONObject] ) val new_doc: RDD[(String, String)] = documents.map( doc1 => ( doc1._2.get("product_id").toString(), doc1._2.get("product_attribute_value").toString().replace("[ \"", "").replace("\"]", "").split("\" , \"").map(value => value.toLowerCase.replace(" ", "-")).mkString(" ") ) ) var myIDs = IndexedDatasetSpark(new_doc)(sc) SimilarityAnalysis.rowSimilarityIDS(myIDs).dfsWrite("hdfs://myhadoop:9000/myfile", readWriteSchema) after runnning the code I am getiing this error: java.io.NotSerializableException: org.apache.mahout.math.DenseVector Serialization stack: - object not serializable (class: org.apache.mahout.math.DenseVector, value: {3:1.0,8:1.0,10:1.0}) - field (class: scala.Some, name: x, type: class java.lang.Object) - object (class scala.Some, Some({3:1.0,8:1.0,10:1.0})) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Please help me with this. -- Thanks & Regards, *Rohit Jain* Web developer | Consultant Mob +91 8097283931
Re: RowSimilakrity : NotSerializableException
Yes, we did figure out this problem. And realised that instead sparkcontext I have to use mahoutsparkcontext, On Sun, May 8, 2016 at 4:26 AM, Pat Ferrel wrote: > I think you have to create a SparkDistributedContext, which has Mahout > specific Kryo serialization and adds Mahout jars. If you let Mahout create > the Spark context it’s simpler > >val implicit mc = mahoutSparkContext(masterUr = “local", appName = > “SparkExample”) > > As I recall the sc will then be implicit through a conversion in the > Mahout Spark package but if I’m wrong and you still get errors create the > sc with > > val sc = sdc2sc(mc) > > This is from memory, your debugger may provide better help. I suspect the > error below is from not having kryo serialization classes configured > properly. > > On May 7, 2016, at 3:05 AM, Rohit Jain wrote: > > hello everyone, > > I want to run Spark RowSimilarity recommender on data obtained from > mongodb. For this purpose, I've written below code which takes input from > mongo, converts it to RDD of Objects. This needs to be passed to > IndexedDataSetSpark which is then passed to SimilarityAnalysis. > rowSimilarityIDS. > > import org.apache.hadoop.conf.Configuration > import org.apache.mahout.math.cf.SimilarityAnalysis > import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark > import org.apache.spark.rdd.{NewHadoopRDD, RDD} > import org.apache.spark.{SparkConf, SparkContext} > import org.bson.BSONObject > import com.mongodb.hadoop.MongoInputFormat > > object SparkExample extends App { > val mongoConfig = new Configuration() > mongoConfig.set("mongo.input.uri", > "mongodb://my_mongo_ip:27017/db.collection") > > val sparkConf = new SparkConf() > val sc = new SparkContext("local", "SparkExample", sparkConf) > > val documents: RDD[(Object, BSONObject)] = sc.newAPIHadoopRDD( >mongoConfig, >classOf[MongoInputFormat], >classOf[Object], >classOf[BSONObject] > ) > val new_doc: RDD[(String, String)] = documents.map( >doc1 => ( >doc1._2.get("product_id").toString(), >doc1._2.get("product_attribute_value").toString().replace("[ \"", > "").replace("\"]", "").split("\" , \"").map(value => > value.toLowerCase.replace(" ", "-")).mkString(" ") >) > ) > var myIDs = IndexedDatasetSpark(new_doc)(sc) > > > > SimilarityAnalysis.rowSimilarityIDS(myIDs).dfsWrite("hdfs://myhadoop:9000/myfile", > readWriteSchema) > > after runnning the code I am getiing this error: > java.io.NotSerializableException: org.apache.mahout.math.DenseVector > Serialization stack: > - object not serializable (class: org.apache.mahout.math.DenseVector, > value: {3:1.0,8:1.0,10:1.0}) > - field (class: scala.Some, name: x, type: class java.lang.Object) > - object (class scala.Some, Some({3:1.0,8:1.0,10:1.0})) > at > > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > at > > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Please help me with this. > > > -- > Thanks & Regards, > > *Rohit Jain* > Web developer | Consultant > Mob +91 8097283931 > > -- Thanks & Regards, *Rohit Jain* Web developer | Consultant Mob +91 8097283931
Re: RowSimilakrity : NotSerializableException
you can also wrap mahout context around existing spark session (aka context). On Sat, May 7, 2016 at 9:41 PM, Rohit Jain wrote: > Yes, we did figure out this problem. And realised that instead sparkcontext > I have to use mahoutsparkcontext, > > On Sun, May 8, 2016 at 4:26 AM, Pat Ferrel wrote: > > > I think you have to create a SparkDistributedContext, which has Mahout > > specific Kryo serialization and adds Mahout jars. If you let Mahout > create > > the Spark context it’s simpler > > > >val implicit mc = mahoutSparkContext(masterUr = “local", appName = > > “SparkExample”) > > > > As I recall the sc will then be implicit through a conversion in the > > Mahout Spark package but if I’m wrong and you still get errors create the > > sc with > > > > val sc = sdc2sc(mc) > > > > This is from memory, your debugger may provide better help. I suspect the > > error below is from not having kryo serialization classes configured > > properly. > > > > On May 7, 2016, at 3:05 AM, Rohit Jain wrote: > > > > hello everyone, > > > > I want to run Spark RowSimilarity recommender on data obtained from > > mongodb. For this purpose, I've written below code which takes input from > > mongo, converts it to RDD of Objects. This needs to be passed to > > IndexedDataSetSpark which is then passed to SimilarityAnalysis. > > rowSimilarityIDS. > > > > import org.apache.hadoop.conf.Configuration > > import org.apache.mahout.math.cf.SimilarityAnalysis > > import org.apache.mahout.sparkbindings.indexeddataset.IndexedDatasetSpark > > import org.apache.spark.rdd.{NewHadoopRDD, RDD} > > import org.apache.spark.{SparkConf, SparkContext} > > import org.bson.BSONObject > > import com.mongodb.hadoop.MongoInputFormat > > > > object SparkExample extends App { > > val mongoConfig = new Configuration() > > mongoConfig.set("mongo.input.uri", > > "mongodb://my_mongo_ip:27017/db.collection") > > > > val sparkConf = new SparkConf() > > val sc = new SparkContext("local", "SparkExample", sparkConf) > > > > val documents: RDD[(Object, BSONObject)] = sc.newAPIHadoopRDD( > >mongoConfig, > >classOf[MongoInputFormat], > >classOf[Object], > >classOf[BSONObject] > > ) > > val new_doc: RDD[(String, String)] = documents.map( > >doc1 => ( > >doc1._2.get("product_id").toString(), > >doc1._2.get("product_attribute_value").toString().replace("[ \"", > > "").replace("\"]", "").split("\" , \"").map(value => > > value.toLowerCase.replace(" ", "-")).mkString(" ") > >) > > ) > > var myIDs = IndexedDatasetSpark(new_doc)(sc) > > > > > > > > > SimilarityAnalysis.rowSimilarityIDS(myIDs).dfsWrite("hdfs://myhadoop:9000/myfile", > > readWriteSchema) > > > > after runnning the code I am getiing this error: > > java.io.NotSerializableException: org.apache.mahout.math.DenseVector > > Serialization stack: > > - object not serializable (class: org.apache.mahout.math.DenseVector, > > value: {3:1.0,8:1.0,10:1.0}) > > - field (class: scala.Some, name: x, type: class java.lang.Object) > > - object (class scala.Some, Some({3:1.0,8:1.0,10:1.0})) > > at > > > > > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > > at > > > > > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > > at > > > > > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:240) > > at > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > > > Please help me with this. > > > > > > -- > > Thanks & Regards, > > > > *Rohit Jain* > > Web developer | Consultant > > Mob +91 8097283931 > > > > > > > -- > Thanks & Regards, > > *Rohit Jain* > Web developer | Consultant > Mob +91 8097283931 >