Re: phoenix-spark and pyspark

Nick Dimiduk Tue, 19 Jan 2016 16:37:25 -0800

On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com> wrote:


> What version of Spark are you using?
>

Probably HDP's Spark 1.4.1; that's what the jars in my install say, and the
welcome message in the pyspark console agrees.

Are there any other traces of exceptions anywhere?
>

No other exceptions that I can find. YARN apparently doesn't want to
aggregate spark's logs.


> Are all your Spark nodes set up to point to the same phoenix-client-spark
> JAR?
>

Yes, as far as I can tell... though come to think of it, is that jar
shipped by the spark runtime to workers, or is it loaded locally on each
host? I only changed spark-defaults.conf on the client machine, the machine
from which I submitted the job.

Thanks for taking a look Josh!

On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org> wrote:
>
>> Hi guys,
>>
>> I'm doing my best to follow along with [0], but I'm hitting some
>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>> using pyspark for now.
>>
>> I've added phoenix-$VERSION-client-spark.jar to both
>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>> around, this is commonly a problem with classpath. Any ideas as to
>> specifically which jars are needed? Or better still, how to debug this
>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>> gives me a VerifyError about netty method version mismatch. Indeed I see
>> two netty versions in that lib directory...
>>
>> Thanks a lot,
>> -n
>>
>> [0]: http://phoenix.apache.org/phoenix_spark.html
>> [1]:
>>
>> java.lang.IllegalStateException: unread block data
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <jamestay...@apache.org>
>> wrote:
>>
>>> Thanks for remembering about the docs, Josh.
>>>
>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmaho...@gmail.com>
>>> wrote:
>>>
>>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>>> for 4.7.0 and the docs have been updated to include these samples for
>>>> PySpark users.
>>>>
>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>
>>>> Josh
>>>>
>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jmaho...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Nick,
>>>>>
>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>> resolved. With the Spark DataFrame support, all the necessary glue is 
>>>>> there
>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>
>>>>> df = sqlContext.read \
>>>>>   .format("org.apache.phoenix.spark") \
>>>>>   .option("table", "TABLE1") \
>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>   .load()
>>>>>
>>>>> And
>>>>>
>>>>> df.write \
>>>>>   .format("org.apache.phoenix.spark") \
>>>>>   .mode("overwrite") \
>>>>>   .option("table", "TABLE1") \
>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>   .save()
>>>>>
>>>>>
>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>> tried this till just now. :)
>>>>>
>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <ndimi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Heya,
>>>>>>
>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>
>>>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>>>> haven't tried extending it for phoenix. Maybe someone with more 
>>>>>> experience
>>>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>>>
>>>>>> Thanks,
>>>>>> Nick
>>>>>>
>>>>>> [0]:
>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Reply via email to