Re: phoenix-spark and pyspark

Josh Mahonin Sun, 24 Jan 2016 10:20:42 -0800

Hey Nick,

Hopefully PHOENIX-2599 works out for you, but it's entirely possible
there's some other issues that crop up. My usage patterns are probably a
bit unique, in that I generally do a few JDBC UPSERT SELECT aggregations
from my "hot" tables to intermediate ones before Spark ever involved. That
likely hides an entire class of bugs, like the kind PHOENIX-2599 fixes.


However, as you're likely aware, the Spark integration is a pretty thin
wrapper over the regular MR integration, including Pig, so hopefully
there's not too much in the way of unexplored territory there in terms of
bulk loading and saving.

Good luck!

Josh

On Sat, Jan 23, 2016 at 9:12 PM, Nick Dimiduk <ndimi...@apache.org> wrote:

> That looks about right. I was unaware of this patch; thanks!
>
> On Fri, Jan 22, 2016 at 5:10 PM, James Taylor <jamestay...@apache.org>
> wrote:
>
>> FYI, Nick - do you know about Josh's fix for PHOENIX-2599? Does that help
>> here?
>>
>> On Fri, Jan 22, 2016 at 4:32 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>>
>>> On Thu, Jan 21, 2016 at 7:36 AM, Josh Mahonin <jmaho...@gmail.com>
>>> wrote:
>>>
>>>> Amazing, good work.
>>>>
>>>
>>> All I did was consume your code and configure my cluster. Thanks though
>>> :)
>>>
>>> FWIW, I've got a support case in with Hortonworks to get the
>>>> phoenix-spark integration working out of the box. Assuming it gets
>>>> resolved, that'll hopefully help keep these classpath-hell issues to a
>>>> minimum going forward.
>>>>
>>>
>>> That's fine for HWX customers, but it doesn't solve the general problem
>>> community-side. Unless, of course, we assume the only Phoenix users are on
>>> HDP.
>>>
>>> Interesting point re: PHOENIX-2535. Spark does offer builds for specific
>>>> Hadoop versions, and also no Hadoop at all (with the assumption you'll
>>>> provide the necessary JARs). Phoenix is pretty tightly coupled with its own
>>>> HBase (and by extension, Hadoop) versions though... do you think it be
>>>> possible to work around (2) if you locally added the HDP Maven repo and
>>>> adjusted versions accordingly? I've had some success with that in other
>>>> projects, though as I recall when I tried it with Phoenix I ran into a snag
>>>> trying to resolve some private transitive dependency of Hadoop.
>>>>
>>>
>>> I could specify Hadoop version and build Phoenix locally to remove the
>>> issue. It would work for me because I happen to be packaging my own Phoenix
>>> this week, but it doesn't help for the Apache releases. Phoenix _is_
>>> tightly coupled to HBase versions, but I don't think HBase versions are
>>> that tightly coupled to Hadoop versions. We build HBase 1.x releases
>>> against a long list of Hadoop releases as part of our usual build. I think
>>> what HBase consumes re: Hadoop API's is pretty limited.
>>>
>>> Now that you have Spark working you can start hitting real bugs! If you
>>>> haven't backported the full patchset, you might want to take a look at the
>>>> phoenix-spark history [1], there's been a lot of churn there, especially
>>>> with regards to the DataFrame API.
>>>>
>>>
>>> Yeah, speaking of which... :)
>>>
>>> I find this integration is basically unusable when the underlying HBase
>>> table partitions are in flux; i.e., when you have data being loaded
>>> concurrent to query. Spark RDD's assume stable partitions, multiple queries
>>> against a single DataFrame, or even a single query/DataFrame with lots of
>>> underlying splits and not enough workers, will inevitably fail
>>> with ConcurrentModificationException like [0].
>>>
>>> I think in HBase/MR integration, we define split points for the job as
>>> region boundaries initially, but then we're just using the regular API, so
>>> repartitions are handled transiently by the client layer. I need to dig
>>> into this Phoenix/Spark stuff to see if we can do anything similar here.
>>>
>>> Thanks again Josh,
>>> -n
>>>
>>> [0]:
>>>
>>> java.lang.RuntimeException: java.util.ConcurrentModificationException
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:125)
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:69)
>>> at
>>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:131)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>>> at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:57)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.util.ConcurrentModificationException
>>> at java.util.Hashtable$Enumerator.next(Hashtable.java:1367)
>>> at org.apache.hadoop.conf.Configuration.iterator(Configuration.java:2154)
>>> at
>>> org.apache.phoenix.util.PropertiesUtil.extractProperties(PropertiesUtil.java:52)
>>> at
>>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:60)
>>> at
>>> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
>>> at
>>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:278)
>>> at
>>> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectStatement(PhoenixConfigurationUtil.java:306)
>>> at
>>> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:113)
>>> ... 32 more
>>>
>>> On Thu, Jan 21, 2016 at 12:41 AM, Nick Dimiduk <ndimi...@apache.org>
>>>> wrote:
>>>>
>>>>> I finally got to the bottom of things. There were two issues at play
>>>>> in my particular environment.
>>>>>
>>>>> 1. An Ambari bug [0] means my spark-defaults.conf file was garbage. I
>>>>> hardly thought of it when I hit the issue with MR job submission; its
>>>>> impact on Spark was much more subtle.
>>>>>
>>>>> 2. YARN client version mismatch (Phoenix is compiled vs Apache 2.5.1
>>>>> while my cluster is running HDP's 2.7.1 build), per my earlier email. Once
>>>>> I'd worked around (1), I was able to work around (2) by setting
>>>>> "/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar"
>>>>> for both spark.driver.extraClassPath and spark.executor.extraClassPath. 
>>>>> Per
>>>>> the linked thread above, I believe this places the correct YARN client
>>>>> version first in the classpath.
>>>>>
>>>>> With the above in place, I'm able to submit work vs the Phoenix tables
>>>>> to the YARN cluster. Success!
>>>>>
>>>>> Ironically enough, I would not have been able to work around (2) if we
>>>>> had PHOENIX-2535 in place. Food for thought in tackling that issue. It may
>>>>> be worth while to ship a uberclient jar that is entirely without Hadoop 
>>>>> (or
>>>>> HBase) classes. I believe spark does this for Hadoop with their builds as
>>>>> well.
>>>>>
>>>>> Thanks again for your help here Josh! I really appreciate it.
>>>>>
>>>>> [0]: https://issues.apache.org/jira/browse/AMBARI-14751
>>>>>
>>>>> On Wed, Jan 20, 2016 at 2:23 PM, Nick Dimiduk <ndimi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Well, I spoke too soon. It's working, but in local mode only. When I
>>>>>> invoke `pyspark --master yarn` (or yarn-client), the submitted 
>>>>>> application
>>>>>> goes from ACCEPTED to FAILED, with a NumberFormatException [0] in my
>>>>>> container log. Now that Phoenix is on my classpath, I'm suspicious that 
>>>>>> the
>>>>>> versions of YARN client libraries are incompatible. I found an old thread
>>>>>> [1] with the same stack trace I'm seeing, similar conclusion. I tried
>>>>>> setting spark.driver.extraClassPath and spark.executor.extraClassPath
>>>>>> to 
>>>>>> /usr/hdp/current/hadoop-yarn-client:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>> but that appears to have no impact.
>>>>>>
>>>>>> [0]:
>>>>>> 16/01/20 22:03:45 ERROR yarn.ApplicationMaster: Uncaught exception:
>>>>>> java.lang.IllegalArgumentException: Invalid ContainerId:
>>>>>> container_e07_1452901320122_0042_01_000001
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:182)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.YarnRMClient.getAttemptId(YarnRMClient.scala:93)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:85)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:574)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
>>>>>> at java.security.AccessController.doPrivileged(Native Method)
>>>>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>>>>>> at
>>>>>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:572)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:599)
>>>>>> at
>>>>>> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
>>>>>> Caused by: java.lang.NumberFormatException: For input string: "e07"
>>>>>> at
>>>>>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>>>>>> at java.lang.Long.parseLong(Long.java:589)
>>>>>> at java.lang.Long.parseLong(Long.java:631)
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
>>>>>> at
>>>>>> org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
>>>>>> ... 12 more
>>>>>>
>>>>>> [1]:
>>>>>> http://mail-archives.us.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAAqMD1jSEvfyw9oUBymhZukm7f+WTDVZ8E6Zp3L4a9OBJ-hz=a...@mail.gmail.com%3E
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 1:29 PM, Josh Mahonin <jmaho...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> That's great to hear. Looking forward to the doc patch!
>>>>>>>
>>>>>>> On Wed, Jan 20, 2016 at 3:43 PM, Nick Dimiduk <ndimi...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Josh -- I deployed my updated phoenix build across the cluster,
>>>>>>>> added the phoenix-client-spark.jar to configs on the whole cluster, 
>>>>>>>> and now
>>>>>>>> basic dataframe access is now working. Let me see about updating the 
>>>>>>>> docs
>>>>>>>> page to be more clear, I'll send a patch by you for review.
>>>>>>>>
>>>>>>>> Thanks a lot for the help!
>>>>>>>> -n
>>>>>>>>
>>>>>>>> On Tue, Jan 19, 2016 at 5:59 PM, Josh Mahonin <jmaho...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Right, this cluster I just tested on is HDP 2.3.4, so it's Spark
>>>>>>>>> on YARN as well. I suppose the JAR is probably shipped by YARN, 
>>>>>>>>> though I
>>>>>>>>> don't see any logging saying it, so I'm not certain how the nuts and 
>>>>>>>>> bolts
>>>>>>>>> of that work. By explicitly setting the classpath, we're bypassing 
>>>>>>>>> Spark's
>>>>>>>>> native JAR broadcast though.
>>>>>>>>>
>>>>>>>>> Taking a quick look at the config in Ambari (which ships the
>>>>>>>>> config to each node after saving), in 'Custom spark-defaults' I have 
>>>>>>>>> the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>> spark.driver.extraClassPath ->
>>>>>>>>> /etc/hbase/conf:/usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>>> spark.executor.extraClassPath ->
>>>>>>>>> /usr/hdp/current/phoenix-client/phoenix-client-spark.jar
>>>>>>>>>
>>>>>>>>> I'm not sure if the /etc/hbase/conf is necessarily needed, but I
>>>>>>>>> think that gets the Ambari generated hbase-site.xml in the classpath. 
>>>>>>>>> Each
>>>>>>>>> node has the custom phoenix-client-spark.jar installed to that same 
>>>>>>>>> path as
>>>>>>>>> well.
>>>>>>>>>
>>>>>>>>> I can pop into regular spark-shell and load RDDs/DataFrames using:
>>>>>>>>> /usr/hdp/current/spark-client/bin/spark-shell --master yarn-client
>>>>>>>>>
>>>>>>>>> or pyspark via:
>>>>>>>>> /usr/hdp/current/spark-client/bin/pyspark
>>>>>>>>>
>>>>>>>>> I also do this as the Ambari-created 'spark' user, I think there
>>>>>>>>> was some fun HDFS permission issue otherwise.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 19, 2016 at 8:23 PM, Nick Dimiduk <ndimi...@apache.org
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> I'm using Spark on YARN, not spark stand-alone. YARN NodeManagers
>>>>>>>>>> are colocated with RegionServers; all the hosts have everything. 
>>>>>>>>>> There are
>>>>>>>>>> no spark workers to restart. You're sure it's not shipped by the YARN
>>>>>>>>>> runtime?
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 19, 2016 at 5:07 PM, Josh Mahonin <jmaho...@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Sadly, it needs to be installed onto each Spark worker (for
>>>>>>>>>>> now). The executor config tells each Spark worker to look for that 
>>>>>>>>>>> file to
>>>>>>>>>>> add to its classpath, so once you have it installed, you'll 
>>>>>>>>>>> probably need
>>>>>>>>>>> to restart all the Spark workers.
>>>>>>>>>>>
>>>>>>>>>>> I co-locate Spark and HBase/Phoenix nodes, so I just drop it in
>>>>>>>>>>> /usr/hdp/current/phoenix-client/, but anywhere that each worker can
>>>>>>>>>>> consistently see is fine.
>>>>>>>>>>>
>>>>>>>>>>> One day we'll be able to have Spark ship the JAR around and use
>>>>>>>>>>> it without this classpath nonsense, but we need to do some extra 
>>>>>>>>>>> work on
>>>>>>>>>>> the Phoenix side to make sure that Phoenix's calls to DriverManager
>>>>>>>>>>> actually go through Spark's weird wrapper version of it.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 19, 2016 at 7:36 PM, Nick Dimiduk <
>>>>>>>>>>> ndimi...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 19, 2016 at 4:17 PM, Josh Mahonin <
>>>>>>>>>>>> jmaho...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What version of Spark are you using?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Probably HDP's Spark 1.4.1; that's what the jars in my install
>>>>>>>>>>>> say, and the welcome message in the pyspark console agrees.
>>>>>>>>>>>>
>>>>>>>>>>>> Are there any other traces of exceptions anywhere?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No other exceptions that I can find. YARN apparently doesn't
>>>>>>>>>>>> want to aggregate spark's logs.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Are all your Spark nodes set up to point to the same
>>>>>>>>>>>>> phoenix-client-spark JAR?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, as far as I can tell... though come to think of it, is
>>>>>>>>>>>> that jar shipped by the spark runtime to workers, or is it loaded 
>>>>>>>>>>>> locally
>>>>>>>>>>>> on each host? I only changed spark-defaults.conf on the client 
>>>>>>>>>>>> machine, the
>>>>>>>>>>>> machine from which I submitted the job.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking a look Josh!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 19, 2016 at 5:02 PM, Nick Dimiduk <
>>>>>>>>>>>>> ndimi...@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm doing my best to follow along with [0], but I'm hitting
>>>>>>>>>>>>>> some stumbling blocks. I'm running with HDP 2.3 for HBase and 
>>>>>>>>>>>>>> Spark. My
>>>>>>>>>>>>>> phoenix build is much newer, basically 4.6-branch + PHOENIX-2503,
>>>>>>>>>>>>>> PHOENIX-2568. I'm using pyspark for now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've added phoenix-$VERSION-client-spark.jar to both
>>>>>>>>>>>>>> spark.executor.extraClassPath and spark.driver.extraClassPath. 
>>>>>>>>>>>>>> This allows
>>>>>>>>>>>>>> me to use sqlContext.read to define a DataFrame against a 
>>>>>>>>>>>>>> Phoenix table.
>>>>>>>>>>>>>> This appears to basically work, as I see PhoenixInputFormat in 
>>>>>>>>>>>>>> the logs and
>>>>>>>>>>>>>> df.printSchema() shows me what I expect. However, when I try 
>>>>>>>>>>>>>> df.take(5), I
>>>>>>>>>>>>>> get "IllegalStateException: unread block data" [1] from the 
>>>>>>>>>>>>>> workers. Poking
>>>>>>>>>>>>>> around, this is commonly a problem with classpath. Any ideas as 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> specifically which jars are needed? Or better still, how to 
>>>>>>>>>>>>>> debug this
>>>>>>>>>>>>>> issue myself. Adding "/usr/hdp/current/hbase-client/lib/*" to 
>>>>>>>>>>>>>> the classpath
>>>>>>>>>>>>>> gives me a VerifyError about netty method version mismatch. 
>>>>>>>>>>>>>> Indeed I see
>>>>>>>>>>>>>> two netty versions in that lib directory...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>>>> -n
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [0]: http://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>> [1]:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> java.lang.IllegalStateException: unread block data
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:69)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:95)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>>>>>>>>>> at
>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:33 AM, James Taylor <
>>>>>>>>>>>>>> jamestay...@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for remembering about the docs, Josh.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Dec 21, 2015 at 8:27 AM, Josh Mahonin <
>>>>>>>>>>>>>>> jmaho...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just an update for anyone interested, PHOENIX-2503 was just
>>>>>>>>>>>>>>>> committed for 4.7.0 and the docs have been updated to include 
>>>>>>>>>>>>>>>> these samples
>>>>>>>>>>>>>>>> for PySpark users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://phoenix.apache.org/phoenix_spark.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Josh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 10, 2015 at 1:20 PM, Josh Mahonin <
>>>>>>>>>>>>>>>> jmaho...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey Nick,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think this used to work, and will again once
>>>>>>>>>>>>>>>>> PHOENIX-2503 gets resolved. With the Spark DataFrame support, 
>>>>>>>>>>>>>>>>> all the
>>>>>>>>>>>>>>>>> necessary glue is there for Phoenix and pyspark to play nice. 
>>>>>>>>>>>>>>>>> With that
>>>>>>>>>>>>>>>>> client JAR (or by overriding the com.fasterxml.jackson JARS), 
>>>>>>>>>>>>>>>>> you can do
>>>>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> df = sqlContext.read \
>>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>>>   .load()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> df.write \
>>>>>>>>>>>>>>>>>   .format("org.apache.phoenix.spark") \
>>>>>>>>>>>>>>>>>   .mode("overwrite") \
>>>>>>>>>>>>>>>>>   .option("table", "TABLE1") \
>>>>>>>>>>>>>>>>>   .option("zkUrl", "localhost:63512") \
>>>>>>>>>>>>>>>>>   .save()
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, this should be added to the documentation. I hadn't
>>>>>>>>>>>>>>>>> actually tried this till just now. :)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Dec 9, 2015 at 6:39 PM, Nick Dimiduk <
>>>>>>>>>>>>>>>>> ndimi...@apache.org> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Heya,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Has anyone any experience using phoenix-spark integration
>>>>>>>>>>>>>>>>>> from pyspark instead of scala? Folks prefer python around 
>>>>>>>>>>>>>>>>>> here...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I did find this example [0] of using HBaseOutputFormat
>>>>>>>>>>>>>>>>>> from pyspark, haven't tried extending it for phoenix. Maybe 
>>>>>>>>>>>>>>>>>> someone with
>>>>>>>>>>>>>>>>>> more experience in pyspark knows better? Would be a great 
>>>>>>>>>>>>>>>>>> addition to our
>>>>>>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [0]:
>>>>>>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_outputformat.py
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: phoenix-spark and pyspark

Reply via email to