Hi Anoop, I don't see the exception you mentioned in the link. I can use spark-avro to read the sample file users.avro in spark successfully. Do you have the details of the union issue ?
On Sat, Feb 27, 2016 at 10:05 AM, Anoop Shiralige <anoop.shiral...@gmail.com > wrote: > Hi Jeff, > > Thank you for looking into the post. > > I had explored spark-avro option earlier. Since, we have union of multiple > complex data types in our avro schema we couldn't use it. > Couple of things I tried. > > - > > https://stackoverflow.com/questions/31261376/how-to-read-pyspark-avro-file-and-extract-the-values > : > "Spark Exception : Unions may only consist of concrete type and null" > - Use of dataFrame/DataSet : serialization problem. > > For now, I got it working by modifing AvroConversionUtils, to address the > union of multiple data-types. > > Thanks, > AnoopShiralige > > > On Thu, Feb 25, 2016 at 7:25 AM, Jeff Zhang <zjf...@gmail.com> wrote: > >> Avro Record is not supported by pickler, you need to create a custom >> pickler for it. But I don't think it worth to do that. Actually you can >> use package spark-avro to load avro data and then convert it to RDD if >> necessary. >> >> https://github.com/databricks/spark-avro >> >> >> On Thu, Feb 11, 2016 at 10:38 PM, Anoop Shiralige < >> anoop.shiral...@gmail.com> wrote: >> >>> Hi All, >>> >>> I am working with Spark 1.6.0 and pySpark shell specifically. I have an >>> JavaRDD[org.apache.avro.GenericRecord] which I have converted to >>> pythonRDD >>> in the following way. >>> >>> javaRDD = sc._jvm.java.package.loadJson("path to data", sc._jsc) >>> javaPython = sc._jvm.SerDe.javaToPython(javaRDD) >>> from pyspark.rdd import RDD >>> pythonRDD=RDD(javaPython,sc) >>> >>> pythonRDD.first() >>> >>> However everytime I am trying to call collect() or first() method on >>> pythonRDD I am getting the following error : >>> >>> 16/02/11 06:19:19 ERROR python.PythonRunner: Python worker exited >>> unexpectedly (crashed) >>> org.apache.spark.api.python.PythonException: Traceback (most recent call >>> last): >>> File >>> >>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py", >>> line 98, in main >>> command = pickleSer._read_with_length(infile) >>> File >>> >>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", >>> line 156, in _read_with_length >>> length = read_int(stream) >>> File >>> >>> "/disk2/spark6/spark-1.6.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", >>> line 545, in read_int >>> raise EOFError >>> EOFError >>> >>> at >>> >>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) >>> at >>> >>> org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) >>> at >>> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) >>> at >>> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >>> at >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) >>> at >>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> at >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:744) >>> Caused by: net.razorvine.pickle.PickleException: couldn't pickle object >>> of >>> type class org.apache.avro.generic.GenericData$Record >>> at net.razorvine.pickle.Pickler.save(Pickler.java:142) >>> at >>> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:493) >>> at net.razorvine.pickle.Pickler.dispatch(Pickler.java:205) >>> at net.razorvine.pickle.Pickler.save(Pickler.java:137) >>> at net.razorvine.pickle.Pickler.dump(Pickler.java:107) >>> at net.razorvine.pickle.Pickler.dumps(Pickler.java:92) >>> at >>> >>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:121) >>> at >>> >>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:110) >>> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >>> at >>> >>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110) >>> at >>> >>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452) >>> at >>> >>> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280) >>> at >>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) >>> at >>> >>> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239) >>> >>> Thanks for your time, >>> AnoopShiralige >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-couldn-t-pickle-object-of-type-class-T-tp26204.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang