Re: SPARK LIMITATION - more than one case class is not allowed !!
Tobias, Understand and thanks for quick resolution of problem. Thanks ~Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20446.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SPARK LIMITATION - more than one case class is not allowed !!
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID [person_obj]. Next time when I am trying to load another objectfile [e.g. office_obj] and I think spark is matching serialVersionUID [person_obj] with previous serialVersionUID [person_obj] and giving mismatch error. In my first post, I have give statements which can be executed easily to replicate this issue. Can you post the Scala source for your case classes? I have tried the following in spark-shell: case class Dog(name: String) case class Cat(age: Int) val dogs = sc.parallelize(Dog(foo) :: Dog(bar) :: Nil) val cats = sc.parallelize(Cat(1) :: Cat(2) :: Nil) dogs.saveAsObjectFile(test_dogs) cats.saveAsObjectFile(test_cats) This gives two directories test_dogs/ and test_cats/. Then I restarted spark-shell and entered: case class Dog(name: String) case class Cat(age: Int) val dogs = sc.objectFile(test_dogs) val cats = sc.objectFile(test_cats) I don't get an exception, but: dogs: org.apache.spark.rdd.RDD[Nothing] = FlatMappedRDD[1] at objectFile at console:12 You need to specify the type of the RDD. The compiler does not know what is in test_dogs. val dogs = sc.objectFile[Dog](test_dogs) val cats = sc.objectFile[Cat](test_cats) It's an easy mistake to make... I wonder if an assertion could be implemented that makes sure the type parameter is present.
Re: SPARK LIMITATION - more than one case class is not allowed !!
It's an easy mistake to make... I wonder if an assertion could be implemented that makes sure the type parameter is present. We could use the NotNothing pattern http://blog.evilmonkeylabs.com/2012/05/31/Forcing_Compiler_Nothing_checks/ but I wonder if it would just make the method signature very confusing for the avg user ...
Re: SPARK LIMITATION - more than one case class is not allowed !!
On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: Is it a limitation that spark does not support more than one case class at a time. What do you mean? I do not have the slightest idea what you *could* possibly mean by to support a case class. Tobias
Re: SPARK LIMITATION - more than one case class is not allowed !!
Hi Tobias, Thanks Tobias for your response. I have created objectfiles [person_obj,office_obj] from csv[person_csv,office_csv] files using case classes[person,office] with API (saveAsObjectFile) Now I restarted spark-shell and load objectfiles using API(objectFile). *Once any of one object-class is loaded successfully, rest of object-class gives serialization error.* So my understanding is that more than one case class is not allowed. Hope, I am able to clarify myself. Regards, Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20421.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SPARK LIMITATION - more than one case class is not allowed !!
Rahul, On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have created objectfiles [person_obj,office_obj] from csv[person_csv,office_csv] files using case classes[person,office] with API (saveAsObjectFile) Now I restarted spark-shell and load objectfiles using API(objectFile). *Once any of one object-class is loaded successfully, rest of object-class gives serialization error.* I have not used saveAsObjectFile, but I think that if you define your case classes in the spark-shell and serialized the objects, and then you restart the spark-shell, the *classes* (structure, names etc.) will not be known to the JVM any more. So if you try to restore the *objects* from a file, the JVM may fail in restoring them, because there is no class it could create objects of. Just a guess. Try to write a Scala program, compile it and see if it still fails when executed. Tobias
Re: SPARK LIMITATION - more than one case class is not allowed !!
Tobias, Thanks for quick reply. Definitely, after restart case classes need to be defined again. I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID [person_obj]. Next time when I am trying to load another objectfile [e.g. office_obj] and I think spark is matching serialVersionUID [person_obj] with previous serialVersionUID [person_obj] and giving mismatch error. In my first post, I have give statements which can be executed easily to replicate this issue. Thanks ~Rahul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20428.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SPARK LIMITATION - more than one case class is not allowed !!
Rahul, On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: I have done so thats why spark is able to load objectfile [e.g. person_obj] and spark has maintained serialVersionUID [person_obj]. Next time when I am trying to load another objectfile [e.g. office_obj] and I think spark is matching serialVersionUID [person_obj] with previous serialVersionUID [person_obj] and giving mismatch error. In my first post, I have give statements which can be executed easily to replicate this issue. Can you post the Scala source for your case classes? I have tried the following in spark-shell: case class Dog(name: String) case class Cat(age: Int) val dogs = sc.parallelize(Dog(foo) :: Dog(bar) :: Nil) val cats = sc.parallelize(Cat(1) :: Cat(2) :: Nil) dogs.saveAsObjectFile(test_dogs) cats.saveAsObjectFile(test_cats) This gives two directories test_dogs/ and test_cats/. Then I restarted spark-shell and entered: case class Dog(name: String) case class Cat(age: Int) val dogs = sc.objectFile(test_dogs) val cats = sc.objectFile(test_cats) I don't get an exception, but: dogs: org.apache.spark.rdd.RDD[Nothing] = FlatMappedRDD[1] at objectFile at console:12 Trying to access the elements of the RDD gave: scala dogs.collect() 14/12/05 15:08:58 INFO FileInputFormat: Total input paths to process : 8 ... org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:980) ... at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object; at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88) at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1129) ... org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:976) ... 10 more So even in the simplest of cases, this doesn't work for me in the spark-shell, but with a different error. I guess we need to see more of your code to help. Tobias
Re: SPARK LIMITATION - more than one case class is not allowed !!
Tobias, Find csv and scala files and below are steps: 1. Copy csv files in current directory. 2. Open spark-shell from this directory. 3. Run one_scala file which will create object-files from csv-files in current directory. 4. Restart spark-shell 5. a. Run two_scala file, while running it is giving error during loading of office_csv b. If we edit two_scala file by below contents --- case class person(id: Int, name: String, fathername: String, officeid: Int) case class office(id: Int, name: String, landmark: String, areacode: String) sc.objectFile[office](office_obj).count sc.objectFile[person](person_obj).count while running it is giving error during loading of person_csv Regards, Rahul sample.gz http://apache-spark-user-list.1001560.n3.nabble.com/file/n20435/sample.gz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20435.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SPARK LIMITATION - more than one case class is not allowed !!
Rahul, On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish rahul.bindl...@nectechnologies.in wrote: 1. Copy csv files in current directory. 2. Open spark-shell from this directory. 3. Run one_scala file which will create object-files from csv-files in current directory. 4. Restart spark-shell 5. a. Run two_scala file, while running it is giving error during loading of office_csv b. If we edit two_scala file by below contents --- case class person(id: Int, name: String, fathername: String, officeid: Int) case class office(id: Int, name: String, landmark: String, areacode: String) sc.objectFile[office](office_obj).count sc.objectFile[person](person_obj).count while running it is giving error during loading of person_csv One good news is: I can reproduce the error you see. Another good news is: I can tell you how to fix this. In your one.scala file, define all case classes *before* you use saveAsObjectFile() for the first time. With case class person(id: Int, name: String, fathername: String, officeid: Int) case class office(id: Int, name: String, landmark: String, areacode: String) val baseperson = sc.textFile(person_csv)saveAsObjectFile(person_obj) val baseoffice = sc.textFile(office_csv)saveAsObjectFile(office_obj) I can deserialize the obj files (in any order). The bad news is: I have no idea about the reason for this. I blame it on the REPL/shell and assume it would not happen for a compiled application. Tobias