On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
> What version of Spark are you using? > Probably HDP's Spark 1.4.1; that's what the jars in my install say, and the welcome message in the pyspark console agrees. Are there any other traces of exceptions anywhere? > No other exceptions that I can find. YARN apparently doesn't want to aggregate spark's logs. > Are all your Spark nodes set up to point to the same phoenix-client-spark > JAR? > Yes, as far as I can tell... though come to think of it, is that jar shipped by the spark runtime to workers, or is it loaded locally on each host? I only changed spark-defaults.conf on the client machine, the machine from which I submitted the job. Thanks for taking a look Josh! On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org> wrote: > >> Hi guys, >> >> I'm doing my best to follow along with [0], but I'm hitting some >> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix >> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm >> using pyspark for now. >> >> I've added phoenix-$VERSION-client-spark.jar to both >> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows >> me to use sqlContext.read to define a DataFrame against a Phoenix table. >> This appears to basically work, as I see PhoenixInputFormat in the logs and >> df.printSchema() shows me what I expect. However, when I try df.take(5), I >> get "IllegalStateException: unread block data" [1] from the workers. Poking >> around, this is commonly a problem with classpath. Any ideas as to >> specifically which jars are needed? Or better still, how to debug this >> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath >> gives me a VerifyError about netty method version mismatch. Indeed I see >> two netty versions in that lib directory... >> >> Thanks a lot, >> -n >> >> [0]: http://phoenix.apache.org/phoenix_spark.html >> [1]: >> >> java.lang.IllegalStateException: unread block data >> at >> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383) >> at >> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) >> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >> at >> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69) >> at >> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> >> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <jamestay...@apache.org> >> wrote: >> >>> Thanks for remembering about the docs, Josh. >>> >>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmaho...@gmail.com> >>> wrote: >>> >>>> Just an update for anyone interested, PHOENIX-2503 was just committed >>>> for 4.7.0 and the docs have been updated to include these samples for >>>> PySpark users. >>>> >>>> https://phoenix.apache.org/phoenix_spark.html >>>> >>>> Josh >>>> >>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jmaho...@gmail.com> >>>> wrote: >>>> >>>>> Hey Nick, >>>>> >>>>> I think this used to work, and will again once PHOENIX-2503 gets >>>>> resolved. With the Spark DataFrame support, all the necessary glue is >>>>> there >>>>> for Phoenix and pyspark to play nice. With that client JAR (or by >>>>> overriding the com.fasterxml.jackson JARS), you can do something like: >>>>> >>>>> df = sqlContext.read \ >>>>> .format("org.apache.phoenix.spark") \ >>>>> .option("table", "TABLE1") \ >>>>> .option("zkUrl", "localhost:63512") \ >>>>> .load() >>>>> >>>>> And >>>>> >>>>> df.write \ >>>>> .format("org.apache.phoenix.spark") \ >>>>> .mode("overwrite") \ >>>>> .option("table", "TABLE1") \ >>>>> .option("zkUrl", "localhost:63512") \ >>>>> .save() >>>>> >>>>> >>>>> Yes, this should be added to the documentation. I hadn't actually >>>>> tried this till just now. :) >>>>> >>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <ndimi...@apache.org> >>>>> wrote: >>>>> >>>>>> Heya, >>>>>> >>>>>> Has anyone any experience using phoenix-spark integration from >>>>>> pyspark instead of scala? Folks prefer python around here... >>>>>> >>>>>> I did find this example [0] of using HBaseOutputFormat from pyspark, >>>>>> haven't tried extending it for phoenix. Maybe someone with more >>>>>> experience >>>>>> in pyspark knows better? Would be a great addition to our documentation. >>>>>> >>>>>> Thanks, >>>>>> Nick >>>>>> >>>>>> [0]: >>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py >>>>>> >>>>> >>>>> >>>> >>> >> >