Re: SequenceFileRDDFunctions cannot be used output of spark package
Hi Sonal, There are no custom objects in saveRDD, it is of type RDD[(String, String)]. Thanks, Pradeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3508.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SequenceFileRDDFunctions cannot be used output of spark package
Hi Aureliano, I followed this thread to create a custom saveAsObjectFile. The following is the code. /new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable, BytesWritable](saveRDD.mapPartitions(iter = iter.grouped(10).map(_.toArray)).map(x = (NullWritable.get(), new BytesWritable(serialize(x).saveAsSequenceFile(objFiles) / But, I get the following error when executed. / org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) / Any idea about this error? or Is there anything wrong in the line of code? Thanks, Pradeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3442.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SequenceFileRDDFunctions cannot be used output of spark package
What does your saveRDD contain? If you are using custom objects, they should be serializable. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 srinivasa.prad...@gmail.comwrote: Hi Aureliano, I followed this thread to create a custom saveAsObjectFile. The following is the code. /new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable, BytesWritable](saveRDD.mapPartitions(iter = iter.grouped(10).map(_.toArray)).map(x = (NullWritable.get(), new BytesWritable(serialize(x).saveAsSequenceFile(objFiles) / But, I get the following error when executed. / org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) / Any idea about this error? or Is there anything wrong in the line of code? Thanks, Pradeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3442.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SequenceFileRDDFunctions cannot be used output of spark package
Matei It turns out that saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano found out in this post http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3019.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SequenceFileRDDFunctions cannot be used output of spark package
I think you bumped the wrong thread. As I mentioned in the other thread: saveAsHadoopFile only applies compression when the codec is available, and it does not seem to respect the global hadoop compression properties. I'm not sure if this is a feature, or a bug in spark. if this is a feature, the docs should make it clear that mapred.output.compression.* properties are read only. On Sat, Mar 22, 2014 at 12:20 AM, deenar.toraskar deenar.toras...@db.comwrote: Matei It turns out that saveAsObjectFile(), saveAsSequenceFile() and saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano found out in this post http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html Deenar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3019.html Sent from the Apache Spark User List mailing list archive at Nabble.com.