Re: [HELP:]Save Spark Dataframe in Phoenix Table

2016-04-08 Thread Josh Mahonin
Hi Divya,

That's strange. Are you able to post a snippet of your code to look at? And
are you sure that you're saving the dataframes as per the docs (
https://phoenix.apache.org/phoenix_spark.html)?

Depending on your HDP version, it may or may not actually have
phoenix-spark support. Double-check that your Spark configuration is setup
with the right worker/driver classpath settings. and that the phoenix JARs
contain the necessary phoenix-spark classes
(e.g. org.apache.phoenix.spark.PhoenixRelation). If not, I suggest
following up with Hortonworks.

Josh



On Fri, Apr 8, 2016 at 1:22 AM, Divya Gehlot 
wrote:

> Hi,
> I hava a Hortonworks Hadoop cluster having below Configurations :
> Spark 1.5.2
> HBASE 1.1.x
> Phoenix 4.4
>
> I am able to connect to Phoenix through JDBC connection and able to read
> the Phoenix tables .
> But while writing the data back to Phoenix table
> I am getting below error :
>
> org.apache.spark.sql.AnalysisException:
> org.apache.phoenix.spark.DefaultSource does not allow user-specified
> schemas.;
>
> Can any body help in resolving the above errors or any other solution of
> saving Spark Dataframes to Phoenix.
>
> Would really appareciate the help.
>
> Thanks,
> Divya
>


Re: About Spark On Hbase

2015-12-15 Thread Josh Mahonin
And as yet another option, there is
https://phoenix.apache.org/phoenix_spark.html

It however requires that you are also using Phoenix in conjunction with
HBase.

On Tue, Dec 15, 2015 at 4:16 PM, Ted Yu  wrote:

> There is also
> http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
>
> FYI
>
> On Tue, Dec 15, 2015 at 11:51 AM, Zhan Zhang 
> wrote:
>
>> If you want dataframe support, you can refer to
>> https://github.com/zhzhan/shc, which I am working on to integrate to
>> HBase upstream with existing support.
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>> On Dec 15, 2015, at 4:34 AM, censj  wrote:
>>
>>
>> hi,*fight fate*
>> *Did I can in *bulkPut() function use Get value first ,then put this
>> value to Hbase ?
>>
>>
>> 在 2015年12月9日,16:02,censj  写道:
>>
>> Thank you! I know
>>
>> 在 2015年12月9日,15:59,fightf...@163.com 写道:
>>
>> If you are using maven , you can add the cloudera maven repo to the
>> repository in pom.xml
>> and add the dependency of spark-hbase.
>> I just found this :
>> http://spark-packages.org/package/nerdammer/spark-hbase-connector
>> as Feng Dongyu recommend, you can try this also, but I had no experience
>> of using this.
>>
>>
>> --
>> fightf...@163.com
>>
>>
>> *发件人:* censj 
>> *发送时间:* 2015-12-09 15:44
>> *收件人:* fightf...@163.com
>> *抄送:* user@spark.apache.org
>> *主题:* Re: About Spark On Hbase
>> So, I how to get this jar? I use set package project.I not found sbt lib.
>>
>> 在 2015年12月9日,15:42,fightf...@163.com 写道:
>>
>> I don't think it really need CDH component. Just use the API
>>
>> --
>> fightf...@163.com
>>
>>
>> *发件人:* censj 
>> *发送时间:* 2015-12-09 15:31
>> *收件人:* fightf...@163.com
>> *抄送:* user@spark.apache.org
>> *主题:* Re: About Spark On Hbase
>> But this is dependent on CDH。I not install CDH。
>>
>> 在 2015年12月9日,15:18,fightf...@163.com 写道:
>>
>> Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase
>> Also, HBASE-13992 
>> already integrates that feature into the hbase side, but
>> that feature has not been released.
>>
>> Best,
>> Sun.
>>
>> --
>> fightf...@163.com
>>
>>
>> *From:* censj 
>> *Date:* 2015-12-09 15:04
>> *To:* user@spark.apache.org
>> *Subject:* About Spark On Hbase
>> hi all,
>>  now I using spark,but I not found spark operation hbase open
>> source. Do any one tell me?
>>
>>
>>
>>
>>
>


Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-11 Thread Josh Mahonin
Hi Jeroen,

No problem. I think there's some magic involved with how the Spark
classloader(s) works, especially with regards to the HBase dependencies. I
know there's probably a more light-weight solution that doesn't require
customizing the Spark setup, but that's the most straight-forward way I've
found that works.

Looking again at the docs, I thought I had a PR that mentioned the
SPARK_CLASSPATH, but either I'm dreaming it or it got dropped on the floor.
I'll search around for it today.

Thanks for the StackOverflow heads up, but feel free to update your post
with the resolution, maybe with a GMane link to the thread?

Good luck,

Josh

On Thu, Jun 11, 2015 at 2:38 AM, Jeroen Vlek j.v...@anchormen.nl wrote:

 Hi Josh,

 That worked! Thank you so much! (I can't believe it was something so
 obvious
 ;) )

 If you care about such a thing you could answer my question here for
 bounty:

 http://stackoverflow.com/questions/30639659/apache-phoenix-4-3-1-and-4-4-0-hbase-0-98-on-spark-1-3-1-classnotfoundexceptio

 Have a great day!

 Cheers,
 Jeroen

 On Wednesday 10 June 2015 08:58:02 Josh Mahonin wrote:
  Hi Jeroen,
 
  Rather than bundle the Phoenix client JAR with your app, are you able to
  include it in a static location either in the SPARK_CLASSPATH, or set the
  conf values below (I use SPARK_CLASSPATH myself, though it's deprecated):
 
spark.driver.extraClassPath
spark.executor.extraClassPath
 
  Josh
 
  On Wed, Jun 10, 2015 at 4:11 AM, Jeroen Vlek j.v...@anchormen.nl
 wrote:
   Hi Josh,
  
   Thank you for your effort. Looking at your code, I feel that mine is
   semantically the same, except written in Java. The dependencies in the
   pom.xml
   all have the scope provided. The job is submitted as follows:
  
   $ rm spark.log  MASTER=spark://maprdemo:7077
   /opt/mapr/spark/spark-1.3.1/bin/spark-submit-jars
   /home/mapr/projects/customer/lib/spark-streaming-
  
  
 kafka_2.10-1.3.1.jar,/home/mapr/projects/customer/lib/kafka_2.10-0.8.1.1.j
  
 ar,/home/mapr/projects/customer/lib/zkclient-0.3.jar,/home/mapr/projects/c
   ustomer/lib/metrics-
   core-3.1.0.jar,/home/mapr/projects/customer/lib/metrics-
  
  
 core-2.2.0.jar,lib/spark-sql_2.10-1.3.1.jar,/opt/mapr/phoenix/phoenix-4.4.
   0- HBase-0.98-bin/phoenix-4.4.0-HBase-0.98-client.jar --class
   nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector
   KafkaStreamConsumer.jar maprdemo:5181 0 topic
 jdbc:phoenix:maprdemo:5181
   true
  
   The spark-defaults.conf is reverted back to its defaults (i.e. no
   userClassPathFirst). In the catch-block of the Phoenix connection
 buildup
   the
   class path is printed by recursively iterating over the class loaders.
 The
   first one already prints the phoenix-client jar [1]. It's also very
   unlikely to
   be a bug in Spark or Phoenix, if your proof-of-concept just works.
  
   So if the JAR that contains the offending class is known by the class
   loader,
   then that might indicate that there's a second JAR providing the same
   class
   but with a different version, right?
   Yet, the only Phoenix JAR on the whole class path hierarchy is the
   aforementioned phoenix-client JAR. Furthermore, I googled the class in
   question, ClientRpcControllerFactory, and it really only exists in the
   Phoenix
   project. We're not talking about some low-level AOP Alliance stuff
 here ;)
  
   Maybe I'm missing some fundamental class loading knowledge, in that
 case
   I'd
   be very happy to be enlightened. This all seems very strange.
  
   Cheers,
   Jeroen
  
   [1]
  
 [file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
   streaming-kafka_2.10-1.3.1.jar,
  
  
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./kafka_2.1
   0-0.8.1.1.jar,
  
  
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./zkclient-
   0.3.jar,
  
  
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./phoenix-4
   .4.0- HBase-0.98-client.jar,
   file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
   sql_2.10-1.3.1.jar,
  
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
   core-3.1.0.jar,
  
  
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./KafkaStre
   amConsumer.jar,
  
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
   core-2.2.0.jar]
  
   On Tuesday, June 09, 2015 11:18:08 AM Josh Mahonin wrote:
This may or may not be helpful for your classpath issues, but I
 wanted
to
verify that basic functionality worked, so I made a sample app here:
   
https://github.com/jmahonin/spark-streaming-phoenix
   
This consumes events off a Kafka topic using spark streaming, and
 writes
out event counts to Phoenix using the new phoenix-spark
 functionality:
http://phoenix.apache.org/phoenix_spark.html
   
It's definitely overkill, and would probably be more efficient to use
the
JDBC driver directly, but it serves as a proof-of-concept

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Josh Mahonin
Hi Jeroen,

Rather than bundle the Phoenix client JAR with your app, are you able to
include it in a static location either in the SPARK_CLASSPATH, or set the
conf values below (I use SPARK_CLASSPATH myself, though it's deprecated):

  spark.driver.extraClassPath
  spark.executor.extraClassPath

Josh

On Wed, Jun 10, 2015 at 4:11 AM, Jeroen Vlek j.v...@anchormen.nl wrote:

 Hi Josh,

 Thank you for your effort. Looking at your code, I feel that mine is
 semantically the same, except written in Java. The dependencies in the
 pom.xml
 all have the scope provided. The job is submitted as follows:

 $ rm spark.log  MASTER=spark://maprdemo:7077
 /opt/mapr/spark/spark-1.3.1/bin/spark-submit-jars
 /home/mapr/projects/customer/lib/spark-streaming-

 kafka_2.10-1.3.1.jar,/home/mapr/projects/customer/lib/kafka_2.10-0.8.1.1.jar,/home/mapr/projects/customer/lib/zkclient-0.3.jar,/home/mapr/projects/customer/lib/metrics-
 core-3.1.0.jar,/home/mapr/projects/customer/lib/metrics-

 core-2.2.0.jar,lib/spark-sql_2.10-1.3.1.jar,/opt/mapr/phoenix/phoenix-4.4.0-
 HBase-0.98-bin/phoenix-4.4.0-HBase-0.98-client.jar --class
 nl.work.kafkastreamconsumer.phoenix.KafkaPhoenixConnector
 KafkaStreamConsumer.jar maprdemo:5181 0 topic jdbc:phoenix:maprdemo:5181
 true

 The spark-defaults.conf is reverted back to its defaults (i.e. no
 userClassPathFirst). In the catch-block of the Phoenix connection buildup
 the
 class path is printed by recursively iterating over the class loaders. The
 first one already prints the phoenix-client jar [1]. It's also very
 unlikely to
 be a bug in Spark or Phoenix, if your proof-of-concept just works.

 So if the JAR that contains the offending class is known by the class
 loader,
 then that might indicate that there's a second JAR providing the same class
 but with a different version, right?
 Yet, the only Phoenix JAR on the whole class path hierarchy is the
 aforementioned phoenix-client JAR. Furthermore, I googled the class in
 question, ClientRpcControllerFactory, and it really only exists in the
 Phoenix
 project. We're not talking about some low-level AOP Alliance stuff here ;)

 Maybe I'm missing some fundamental class loading knowledge, in that case
 I'd
 be very happy to be enlightened. This all seems very strange.

 Cheers,
 Jeroen

 [1]
 [file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
 streaming-kafka_2.10-1.3.1.jar,

 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./kafka_2.10-0.8.1.1.jar,

 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./zkclient-0.3.jar,

 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./phoenix-4.4.0-
 HBase-0.98-client.jar,
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./spark-
 sql_2.10-1.3.1.jar,
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
 core-3.1.0.jar,

 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./KafkaStreamConsumer.jar,
 file:/opt/mapr/spark/spark-1.3.1/tmp/app-20150610010512-0001/0/./metrics-
 core-2.2.0.jar]


 On Tuesday, June 09, 2015 11:18:08 AM Josh Mahonin wrote:
  This may or may not be helpful for your classpath issues, but I wanted to
  verify that basic functionality worked, so I made a sample app here:
 
  https://github.com/jmahonin/spark-streaming-phoenix
 
  This consumes events off a Kafka topic using spark streaming, and writes
  out event counts to Phoenix using the new phoenix-spark functionality:
  http://phoenix.apache.org/phoenix_spark.html
 
  It's definitely overkill, and would probably be more efficient to use the
  JDBC driver directly, but it serves as a proof-of-concept.
 
  I've only tested this in local mode. To convert it to a full jobs JAR, I
  suspect that keeping all of the spark and phoenix dependencies marked as
  'provided', and including the Phoenix client JAR in the Spark classpath
  would work as well.
 
  Good luck,
 
  Josh
 
  On Tue, Jun 9, 2015 at 4:40 AM, Jeroen Vlek j.v...@work.nl wrote:
   Hi,
  
   I posted a question with regards to Phoenix and Spark Streaming on
   StackOverflow [1]. Please find a copy of the question to this email
 below
   the
   first stack trace. I also already contacted the Phoenix mailing list
 and
   tried
   the suggestion of setting spark.driver.userClassPathFirst.
 Unfortunately
   that
   only pushed me further into the dependency hell, which I tried to
 resolve
   until I hit a wall with an UnsatisfiedLinkError on Snappy.
  
   What I am trying to achieve: To save a stream from Kafka into
   Phoenix/Hbase
   via Spark Streaming. I'm using MapR as a platform and the original
   exception
   happens both on a 3-node cluster, as on the MapR Sandbox (a VM for
   experimentation), in YARN and stand-alone mode. Further experimentation
   (like
   the saveAsNewHadoopApiFile below), was done only on the sandbox in
   standalone
   mode.
  
   Phoenix only supports Spark from 4.4.0 onwards, but I thought I could
   use a naive implementation

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Josh Mahonin
This may or may not be helpful for your classpath issues, but I wanted to
verify that basic functionality worked, so I made a sample app here:

https://github.com/jmahonin/spark-streaming-phoenix

This consumes events off a Kafka topic using spark streaming, and writes
out event counts to Phoenix using the new phoenix-spark functionality:
http://phoenix.apache.org/phoenix_spark.html

It's definitely overkill, and would probably be more efficient to use the
JDBC driver directly, but it serves as a proof-of-concept.

I've only tested this in local mode. To convert it to a full jobs JAR, I
suspect that keeping all of the spark and phoenix dependencies marked as
'provided', and including the Phoenix client JAR in the Spark classpath
would work as well.

Good luck,

Josh

On Tue, Jun 9, 2015 at 4:40 AM, Jeroen Vlek j.v...@anchormen.nl wrote:

 Hi,

 I posted a question with regards to Phoenix and Spark Streaming on
 StackOverflow [1]. Please find a copy of the question to this email below
 the
 first stack trace. I also already contacted the Phoenix mailing list and
 tried
 the suggestion of setting spark.driver.userClassPathFirst. Unfortunately
 that
 only pushed me further into the dependency hell, which I tried to resolve
 until I hit a wall with an UnsatisfiedLinkError on Snappy.

 What I am trying to achieve: To save a stream from Kafka into
 Phoenix/Hbase
 via Spark Streaming. I'm using MapR as a platform and the original
 exception
 happens both on a 3-node cluster, as on the MapR Sandbox (a VM for
 experimentation), in YARN and stand-alone mode. Further experimentation
 (like
 the saveAsNewHadoopApiFile below), was done only on the sandbox in
 standalone
 mode.

 Phoenix only supports Spark from 4.4.0 onwards, but I thought I could
 use a naive implementation that creates a new connection for
 every RDD from the DStream in 4.3.1.  This resulted in the
 ClassNotFoundException described in [1], so I switched to 4.4.0.

 Unfortunately the saveToPhoenix method is only available in Scala. So I did
 find the suggestion to try it via the saveAsNewHadoopApiFile method [2]
 and an
 example implementation [3], which I adapted to my own needs.

 However, 4.4.0 + saveAsNewHadoopApiFile  raises the same
 ClassNotFoundExeption, just a slightly different stacktrace:

   java.lang.RuntimeException: java.sql.SQLException: ERROR 103
 (08004): Unable to establish connection.
 at

 org.apache.phoenix.mapreduce.PhoenixOutputFormat.getRecordWriter(PhoenixOutputFormat.java:58)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:995)
 at

 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.sql.SQLException: ERROR 103 (08004): Unable to
 establish connection.
 at

 org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:386)
 at

 org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:288)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:171)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1881)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1860)
 at

 org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
 at

 org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1860)
 at

 org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:162)
 at

 org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:131)
 at
 org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:133)
 at
 java.sql.DriverManager.getConnection(DriverManager.java:571)
 at
 java.sql.DriverManager.getConnection(DriverManager.java:187)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:92)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:80)
 at

 org.apache.phoenix.mapreduce.util.ConnectionUtil.getOutputConnection(ConnectionUtil.java:68)
 at

 org.apache.phoenix.mapreduce.PhoenixRecordWriter.init(PhoenixRecordWriter.java:49)
 

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-24 Thread Josh Mahonin
Hi Alaa Ali,

That's right, when using the PhoenixInputFormat, you can do simple 'WHERE'
clauses and then perform any aggregate functions you'd like from within
Spark. Any aggregations you run won't be quite as fast as running the
native Spark queries, but once it's available as an RDD you can also do a
lot more with it than just the Phoenix functions provide.

I don't know if this works with PySpark or not, but assuming the
'newHadoopRDD' functionality works for other input formats, it should work
for Phoenix as well.

Josh

On Fri, Nov 21, 2014 at 5:12 PM, Alaa Ali contact.a...@gmail.com wrote:

 Awesome, thanks Josh, I missed that previous post of yours! But your code
 snippet shows a select statement, so what I can do is just run a simple
 select with a where clause if I want to, and then run my data processing on
 the RDD to mimic the aggregation I want to do with SQL, right? Also,
 another question, I still haven't tried this out, but I'll actually be
 using this with PySpark, so I'm guessing the PhoenixPigConfiguration and
 newHadoopRDD can be defined in PySpark as well?

 Regards,
 Alaa Ali

 On Fri, Nov 21, 2014 at 4:34 PM, Josh Mahonin jmaho...@interset.com
 wrote:

 Hi Alaa Ali,

 In order for Spark to split the JDBC query in parallel, it expects an
 upper and lower bound for your input data, as well as a number of
 partitions so that it can split the query across multiple tasks.

 For example, depending on your data distribution, you could set an upper
 and lower bound on your timestamp range, and spark should be able to create
 new sub-queries to split up the data.

 Another option is to load up the whole table using the PhoenixInputFormat
 as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate
 functions, but it does let you load up whole tables as RDDs.

 I've previously posted example code here:

 http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAJ6CGtA1DoTdadRtT5M0+75rXTyQgu5gexT+uLccw_8Ppzyt=q...@mail.gmail.com%3E

 There's also an example library implementation here, although I haven't
 had a chance to test it yet:
 https://github.com/simplymeasured/phoenix-spark

 Josh

 On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali contact.a...@gmail.com wrote:

 I want to run queries on Apache Phoenix which has a JDBC driver. The
 query that I want to run is:

 select ts,ename from random_data_date limit 10

 But I'm having issues with the JdbcRDD upper and lowerBound parameters
 (that I don't actually understand).

 Here's what I have so far:

 import org.apache.spark.rdd.JdbcRDD
 import java.sql.{Connection, DriverManager, ResultSet}

 val url=jdbc:phoenix:zookeeper
 val sql = select ts,ename from random_data_date limit ?
 val myRDD = new JdbcRDD(sc, () = DriverManager.getConnection(url), sql,
 5, 10, 2, r = r.getString(ts) + ,  + r.getString(ename))

 But this doesn't work because the sql expression that the JdbcRDD
 expects has to have two ?s to represent the lower and upper bound.

 How can I run my query through the JdbcRDD?

 Regards,
 Alaa Ali






Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Josh Mahonin
Hi Alaa Ali,

In order for Spark to split the JDBC query in parallel, it expects an upper
and lower bound for your input data, as well as a number of partitions so
that it can split the query across multiple tasks.

For example, depending on your data distribution, you could set an upper
and lower bound on your timestamp range, and spark should be able to create
new sub-queries to split up the data.

Another option is to load up the whole table using the PhoenixInputFormat
as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate
functions, but it does let you load up whole tables as RDDs.

I've previously posted example code here:
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAJ6CGtA1DoTdadRtT5M0+75rXTyQgu5gexT+uLccw_8Ppzyt=q...@mail.gmail.com%3E

There's also an example library implementation here, although I haven't had
a chance to test it yet:
https://github.com/simplymeasured/phoenix-spark

Josh

On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali contact.a...@gmail.com wrote:

 I want to run queries on Apache Phoenix which has a JDBC driver. The query
 that I want to run is:

 select ts,ename from random_data_date limit 10

 But I'm having issues with the JdbcRDD upper and lowerBound parameters
 (that I don't actually understand).

 Here's what I have so far:

 import org.apache.spark.rdd.JdbcRDD
 import java.sql.{Connection, DriverManager, ResultSet}

 val url=jdbc:phoenix:zookeeper
 val sql = select ts,ename from random_data_date limit ?
 val myRDD = new JdbcRDD(sc, () = DriverManager.getConnection(url), sql,
 5, 10, 2, r = r.getString(ts) + ,  + r.getString(ename))

 But this doesn't work because the sql expression that the JdbcRDD expects
 has to have two ?s to represent the lower and upper bound.

 How can I run my query through the JdbcRDD?

 Regards,
 Alaa Ali



Re: Data from Mysql using JdbcRDD

2014-07-30 Thread Josh Mahonin
Hi Srini,

I believe the JdbcRDD requires input splits based on ranges within the
query itself. As an example, you could adjust your query to something like:
SELECT * FROM student_info WHERE id = ? AND id = ?

Note that the values you've passed in '1, 20, 2' correspond to the lower
bound index, upper bound index, and number of partitions. With that example
query and those values, you should end up with an RDD with two partitions,
one with the student_info from 1 through 10, and the second with ids 11
through 20.

Josh


On Wed, Jul 30, 2014 at 6:58 PM, chaitu reddy chaitzre...@gmail.com wrote:

 Kc
 On Jul 30, 2014 3:55 PM, srinivas kusamsrini...@gmail.com wrote:

 Hi,
  I am trying to get data from mysql using JdbcRDD using code The table
 have
 three columns

 val url = jdbc:mysql://localhost:3306/studentdata
 val username = root
 val password = root
  val mysqlrdd = new org.apache.spark.rdd.JdbcRDD(sc,() = {
   Class.forName(com.mysql.jdbc.Driver)
   DriverManager.getConnection(url, username, password)
 },SELECT * FROM student_info,
   1, 20, 2, r = r.getString(studentname))
   mysqlrdd.saveAsTextFile(/home/ubuntu/mysqljdbc)

 I am getting runtime error as

  14/07/30 22:05:04 INFO JdbcRDD: statement fetch size set to: -2147483648
 to
 force MySQL streaming
 14/07/30 22:05:04 INFO JdbcRDD: statement fetch size set to: -2147483648
 to
 force MySQL streaming
 14/07/30 22:05:04 INFO JdbcRDD: closed connection
 14/07/30 22:05:04 INFO JdbcRDD: closed connection
 14/07/30 22:05:04 ERROR Executor: Exception in task ID 0
 java.sql.SQLException: Parameter index out of range (1  number of
 parameters, which is 0).
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1084)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:987)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:973)

 Can anyone help. And let me know if i am missing anything.

 Thanks,
 -Srini.






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Data-from-Mysql-using-JdbcRDD-tp10994.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: Spark and HBase

2014-04-25 Thread Josh Mahonin
Phoenix generally presents itself as an endpoint using JDBC, which in my
testing seems to play nicely using JdbcRDD.

However, a few days ago a patch was made against Phoenix to implement
support via PIG using a custom Hadoop InputFormat, which means now it has
Spark support too.

Here's a code snippet that sets up an RDD for a specific query:

--
val phoenixConf = new PhoenixPigConfiguration(new Configuration())
phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS
WHERE EVENTTYPE = 'some_type')
phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME)
phoenixConf.configure(servername, EVENTS, 100L)

val phoenixRDD = sc.newAPIHadoopRDD(
phoenixConf.getConfiguration(),
classOf[PhoenixInputFormat],
  classOf[NullWritable],
  classOf[PhoenixRecord])
--

I'm still very new at Spark and even less experienced with Phoenix, but I'm
hoping there's an advantage over the JdbcRDD in terms of partitioning. The
JdbcRDD seems to implement partitioning based on a query predicate that is
user defined, but I think Phoenix's InputFormat is able to figure out the
splits which Spark is able to leverage. I don't really know how to verify
if this is the case or not though, so if anyone else is looking into this,
I'd love to hear their thoughts.

Josh


On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Just took a quick look at the overview 
 herehttp://phoenix.incubator.apache.org/ and
 the quick start guide 
 herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
 .

 It looks like Apache Phoenix aims to provide flexible SQL access to data,
 both for transactional and analytic purposes, and at interactive speeds.

 Nick


 On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote:

 First, I have not tried it myself. However, what I have heard it has some
 basic SQL features so you can query you HBase table like query content on
 HDFS using Hive.
 So it is not query a simple column, I believe you can do joins and
 other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
 preconfigured and give it a try.

 Sorry cannot provide more detailed explanation and help.



 On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier pomperma...@okkam.it
  wrote:

 Thanks for the quick reply Bin. Phenix is something I'm going to try for
 sure but is seems somehow useless if I can use Spark.
 Probably, as you said, since Phoenix use a dedicated data structure
 within each HBase Table has a more effective memory usage but if I need to
 deserialize data stored in a HBase cell I still have to read in memory that
 object and thus I need Spark. From what I understood Phoenix is good if I
 have to query a simple column of HBase but things get really complicated if
 I have to add an index for each column in my table and I store complex
 object within the cells. Is it correct?

 Best,
 Flavio




 On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang binwang...@gmail.com wrote:

 Hi Flavio,

 I happened to attend, actually attending the 2014 Apache Conf, I heard
 a project called Apache Phoenix, which fully leverage HBase and suppose
 to be 1000x faster than Hive. And it is not memory bounded, in which case
 sets up a limit for Spark. It is still in the incubating group and the
 stats functions spark has already implemented are still on the roadmap. I
 am not sure whether it will be good but might be something interesting to
 check out.

 /usr/bin


 On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier 
 pomperma...@okkam.it wrote:

 Hi to everybody,

  in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio