Phoenix generally presents itself as an endpoint using JDBC, which in my
testing seems to play nicely using JdbcRDD.

However, a few days ago a patch was made against Phoenix to implement
support via PIG using a custom Hadoop InputFormat, which means now it has
Spark support too.

Here's a code snippet that sets up an RDD for a specific query:

--
val phoenixConf = new PhoenixPigConfiguration(new Configuration())
phoenixConf.setSelectStatement("SELECT EVENTTYPE,EVENTTIME FROM EVENTS
WHERE EVENTTYPE = 'some_type')
phoenixConf.setSelectColumns("EVENTTYPE,EVENTTIME")
phoenixConf.configure("servername", "EVENTS", 100L)

val phoenixRDD = sc.newAPIHadoopRDD(
                        phoenixConf.getConfiguration(),
classOf[PhoenixInputFormat],
      classOf[NullWritable],
      classOf[PhoenixRecord])
--

I'm still very new at Spark and even less experienced with Phoenix, but I'm
hoping there's an advantage over the JdbcRDD in terms of partitioning. The
JdbcRDD seems to implement partitioning based on a query predicate that is
user defined, but I think Phoenix's InputFormat is able to figure out the
splits which Spark is able to leverage. I don't really know how to verify
if this is the case or not though, so if anyone else is looking into this,
I'd love to hear their thoughts.

Josh


On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Just took a quick look at the overview 
> here<http://phoenix.incubator.apache.org/> and
> the quick start guide 
> here<http://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html>
> .
>
> It looks like Apache Phoenix aims to provide flexible SQL access to data,
> both for transactional and analytic purposes, and at interactive speeds.
>
> Nick
>
>
> On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang <binwang...@gmail.com> wrote:
>
>> First, I have not tried it myself. However, what I have heard it has some
>> basic SQL features so you can query you HBase table like query content on
>> HDFS using Hive.
>> So it is not "query a simple column", I believe you can do joins and
>> other SQL queries. Maybe you can wrap up an EMR cluster with Hbase
>> preconfigured and give it a try.
>>
>> Sorry cannot provide more detailed explanation and help.
>>
>>
>>
>> On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier <pomperma...@okkam.it
>> > wrote:
>>
>>> Thanks for the quick reply Bin. Phenix is something I'm going to try for
>>> sure but is seems somehow useless if I can use Spark.
>>> Probably, as you said, since Phoenix use a dedicated data structure
>>> within each HBase Table has a more effective memory usage but if I need to
>>> deserialize data stored in a HBase cell I still have to read in memory that
>>> object and thus I need Spark. From what I understood Phoenix is good if I
>>> have to query a simple column of HBase but things get really complicated if
>>> I have to add an index for each column in my table and I store complex
>>> object within the cells. Is it correct?
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>>
>>>
>>> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang <binwang...@gmail.com> wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>> I happened to attend, actually attending the 2014 Apache Conf, I heard
>>>> a project called "Apache Phoenix", which fully leverage HBase and suppose
>>>> to be 1000x faster than Hive. And it is not memory bounded, in which case
>>>> sets up a limit for Spark. It is still in the incubating group and the
>>>> "stats" functions spark has already implemented are still on the roadmap. I
>>>> am not sure whether it will be good but might be something interesting to
>>>> check out.
>>>>
>>>> /usr/bin
>>>>
>>>>
>>>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier <
>>>> pomperma...@okkam.it> wrote:
>>>>
>>>>> Hi to everybody,
>>>>>
>>>>>  in these days I looked a bit at the recent evolution of the big data
>>>>> stacks and it seems that HBase is somehow fading away in favour of
>>>>> Spark+HDFS. Am I correct?
>>>>> Do you think that Spark and HBase should work together or not?
>>>>>
>>>>> Best regards,
>>>>> Flavio
>>>>>
>>>>
>>
>

Reply via email to