Hi Alaa Ali,

In order for Spark to split the JDBC query in parallel, it expects an upper
and lower bound for your input data, as well as a number of partitions so
that it can split the query across multiple tasks.

For example, depending on your data distribution, you could set an upper
and lower bound on your timestamp range, and spark should be able to create
new sub-queries to split up the data.

Another option is to load up the whole table using the PhoenixInputFormat
as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate
functions, but it does let you load up whole tables as RDDs.

I've previously posted example code here:
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAJ6CGtA1DoTdadRtT5M0+75rXTyQgu5gexT+uLccw_8Ppzyt=q...@mail.gmail.com%3E

There's also an example library implementation here, although I haven't had
a chance to test it yet:
https://github.com/simplymeasured/phoenix-spark

Josh

On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali <contact.a...@gmail.com> wrote:

> I want to run queries on Apache Phoenix which has a JDBC driver. The query
> that I want to run is:
>
>     select ts,ename from random_data_date limit 10
>
> But I'm having issues with the JdbcRDD upper and lowerBound parameters
> (that I don't actually understand).
>
> Here's what I have so far:
>
> import org.apache.spark.rdd.JdbcRDD
> import java.sql.{Connection, DriverManager, ResultSet}
>
> val url="jdbc:phoenix:zookeeper"
> val sql = "select ts,ename from random_data_date limit ?"
> val myRDD = new JdbcRDD(sc, () => DriverManager.getConnection(url), sql,
> 5, 10, 2, r => r.getString("ts") + ", " + r.getString("ename"))
>
> But this doesn't work because the sql expression that the JdbcRDD expects
> has to have two ?s to represent the lower and upper bound.
>
> How can I run my query through the JdbcRDD?
>
> Regards,
> Alaa Ali
>

Reply via email to