Awesome, thanks Josh, I missed that previous post of yours! But your code
snippet shows a select statement, so what I can do is just run a simple
select with a where clause if I want to, and then run my data processing on
the RDD to mimic the aggregation I want to do with SQL, right? Also,
another question, I still haven't tried this out, but I'll actually be
using this with PySpark, so I'm guessing the PhoenixPigConfiguration and
newHadoopRDD can be defined in PySpark as well?

Regards,
Alaa Ali

On Fri, Nov 21, 2014 at 4:34 PM, Josh Mahonin <jmaho...@interset.com> wrote:

> Hi Alaa Ali,
>
> In order for Spark to split the JDBC query in parallel, it expects an
> upper and lower bound for your input data, as well as a number of
> partitions so that it can split the query across multiple tasks.
>
> For example, depending on your data distribution, you could set an upper
> and lower bound on your timestamp range, and spark should be able to create
> new sub-queries to split up the data.
>
> Another option is to load up the whole table using the PhoenixInputFormat
> as a NewHadoopRDD. It doesn't yet support many of Phoenix's aggregate
> functions, but it does let you load up whole tables as RDDs.
>
> I've previously posted example code here:
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAJ6CGtA1DoTdadRtT5M0+75rXTyQgu5gexT+uLccw_8Ppzyt=q...@mail.gmail.com%3E
>
> There's also an example library implementation here, although I haven't
> had a chance to test it yet:
> https://github.com/simplymeasured/phoenix-spark
>
> Josh
>
> On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali <contact.a...@gmail.com> wrote:
>
>> I want to run queries on Apache Phoenix which has a JDBC driver. The
>> query that I want to run is:
>>
>>     select ts,ename from random_data_date limit 10
>>
>> But I'm having issues with the JdbcRDD upper and lowerBound parameters
>> (that I don't actually understand).
>>
>> Here's what I have so far:
>>
>> import org.apache.spark.rdd.JdbcRDD
>> import java.sql.{Connection, DriverManager, ResultSet}
>>
>> val url="jdbc:phoenix:zookeeper"
>> val sql = "select ts,ename from random_data_date limit ?"
>> val myRDD = new JdbcRDD(sc, () => DriverManager.getConnection(url), sql,
>> 5, 10, 2, r => r.getString("ts") + ", " + r.getString("ename"))
>>
>> But this doesn't work because the sql expression that the JdbcRDD expects
>> has to have two ?s to represent the lower and upper bound.
>>
>> How can I run my query through the JdbcRDD?
>>
>> Regards,
>> Alaa Ali
>>
>
>

Reply via email to