Sadly, it needs to be installed onto each Spark worker (for now). The
executor config tells each Spark worker to look for that file to add to its
classpath, so once you have it installed, you'll probably need to restart
all the Spark workers.

I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
/usr/hdp/current/phoenix-client/, but anywhere that each worker can
consistently see is fine.

One day we'll be able to have Spark ship the JAR around and use it without
this classpath nonsense, but we need to do some extra work on the Phoenix
side to make sure that Phoenix's calls to DriverManager actually go through
Spark's weird wrapper version of it.

On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <ndimi...@apache.org> wrote:

> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
>
>> What version of Spark are you using?
>>
>
> Probably HDP's Spark 1.4.1; that's what the jars in my install say, and
> the welcome message in the pyspark console agrees.
>
> Are there any other traces of exceptions anywhere?
>>
>
> No other exceptions that I can find. YARN apparently doesn't want to
> aggregate spark's logs.
>
>
>> Are all your Spark nodes set up to point to the same phoenix-client-spark
>> JAR?
>>
>
> Yes, as far as I can tell... though come to think of it, is that jar
> shipped by the spark runtime to workers, or is it loaded locally on each
> host? I only changed spark-defaults.conf on the client machine, the machine
> from which I submitted the job.
>
> Thanks for taking a look Josh!
>
> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <ndimi...@apache.org> wrote:
>>
>>> Hi guys,
>>>
>>> I'm doing my best to follow along with [0], but I'm hitting some
>>> stumbling blocks. I'm running with HDP 2.3 for HBase and Spark. My phoenix
>>> build is much newer, basically 4.6-branch + PHOENIX-2503, PHOENIX-2568. I'm
>>> using pyspark for now.
>>>
>>> I've added phoenix-$VERSION-client-spark.jar to both
>>> spark.executor.extraClassPath and spark.driver.extraClassPath. This allows
>>> me to use sqlContext.read to define a DataFrame against a Phoenix table.
>>> This appears to basically work, as I see PhoenixInputFormat in the logs and
>>> df.printSchema() shows me what I expect. However, when I try df.take(5), I
>>> get "IllegalStateException: unread block data" [1] from the workers. Poking
>>> around, this is commonly a problem with classpath. Any ideas as to
>>> specifically which jars are needed? Or better still, how to debug this
>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to the classpath
>>> gives me a VerifyError about netty method version mismatch. Indeed I see
>>> two netty versions in that lib directory...
>>>
>>> Thanks a lot,
>>> -n
>>>
>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>> [1]:
>>>
>>> java.lang.IllegalStateException: unread block data
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>> at
>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>> at
>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <jamestay...@apache.org>
>>> wrote:
>>>
>>>> Thanks for remembering about the docs, Josh.
>>>>
>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <jmaho...@gmail.com>
>>>> wrote:
>>>>
>>>>> Just an update for anyone interested, PHOENIX-2503 was just committed
>>>>> for 4.7.0 and the docs have been updated to include these samples for
>>>>> PySpark users.
>>>>>
>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>
>>>>> Josh
>>>>>
>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <jmaho...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Nick,
>>>>>>
>>>>>> I think this used to work, and will again once PHOENIX-2503 gets
>>>>>> resolved. With the Spark DataFrame support, all the necessary glue is 
>>>>>> there
>>>>>> for Phoenix and pyspark to play nice. With that client JAR (or by
>>>>>> overriding the com.fasterxml.jackson JARS), you can do something like:
>>>>>>
>>>>>> df = sqlContext.read \
>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>   .option("table", "TABLE1") \
>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>   .load()
>>>>>>
>>>>>> And
>>>>>>
>>>>>> df.write \
>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>   .mode("overwrite") \
>>>>>>   .option("table", "TABLE1") \
>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>   .save()
>>>>>>
>>>>>>
>>>>>> Yes, this should be added to the documentation. I hadn't actually
>>>>>> tried this till just now. :)
>>>>>>
>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <ndimi...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Heya,
>>>>>>>
>>>>>>> Has anyone any experience using phoenix-spark integration from
>>>>>>> pyspark instead of scala? Folks prefer python around here...
>>>>>>>
>>>>>>> I did find this example [0] of using HBaseOutputFormat from pyspark,
>>>>>>> haven't tried extending it for phoenix. Maybe someone with more 
>>>>>>> experience
>>>>>>> in pyspark knows better? Would be a great addition to our documentation.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nick
>>>>>>>
>>>>>>> [0]:
>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to