It has some implication because it imposes the SQL model on Hbase. Internally 
it translates the SQL queries into custom Hbase processors. Keep also in mind 
for what Hbase need a proper key design and how Phoenix designs those keys to 
get the best performance out of it. I think for oltp it is a workable model and 
I think they plan to offer Phoenix as a default interface as part of Hbase 
anyway.
For OLAP it depends. 


> On 17 Oct 2016, at 22:34, ayan guha <guha.a...@gmail.com> wrote:
> 
> Hi
> 
> Any reason not to recommend Phoneix? I haven't used it myself so curious 
> about pro's and cons about the use of it.
> 
>> On 18 Oct 2016 03:17, "Michael Segel" <msegel_had...@hotmail.com> wrote:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a 
>> Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases 
>> and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations 
>> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
>> performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world 
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
>> part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
>> Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the 
>>> Tableau side of things. The reports don’t need to be real-time, and they 
>>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP 
>>> HBase for ID generation, but after further investigation into the use-case 
>>> and architecture, the ID generation needs to happen local to the Ad Servers 
>>> where we generate a unique ID and store it in a ID linking table. Even 
>>> better, many of the 3rd party services supply this ID. So, data only needs 
>>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>>> required. This is also goes for the REST Endpoints. 3rd party services will 
>>> hit ours to update our data with no need to read from our data. And, when 
>>> we want to update their data, we will hit theirs to update their data using 
>>> a triggered job.
>>> 
>>> This al boils down to just integrating with Kafka.
>>> 
>>> Once again, thanks for all the help.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>> 
>>>> please keep also in mind that Tableau Server has the capabilities to store 
>>>> data in-memory and refresh only when needed the in-memory data. This means 
>>>> you can import it from any source and let your users work only on the 
>>>> in-memory data in Tableau Server.
>>>> 
>>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich 
>>>>> provided already a good alternative. However, you should check if it 
>>>>> contains a recent version of Hbase and Phoenix. That being said, I just 
>>>>> wonder what is the dataflow, data model and the analysis you plan to do. 
>>>>> Maybe there are completely different solutions possible. Especially these 
>>>>> single inserts, upserts etc. should be avoided as much as possible in the 
>>>>> Big Data (analysis) world with any technology, because they do not 
>>>>> perform well. 
>>>>> 
>>>>> Hive with Llap will provide an in-memory cache for interactive analytics. 
>>>>> You can put full tables in-memory with Hive using Ignite HDFS in-memory 
>>>>> solution. All this does only make sense if you do not use MR as an 
>>>>> engine, the right input format (ORC, parquet) and a recent Hive version.
>>>>> 
>>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>> 
>>>>>> Mich,
>>>>>> 
>>>>>> Unfortunately, we are moving away from Hive and unifying on Spark using 
>>>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC 
>>>>>> driver too. I will either try Phoenix JDBC Server for HBase or push to 
>>>>>> move faster to Kudu with Impala. We will use Impala as the JDBC 
>>>>>> in-between until the Kudu team completes Spark SQL support for JDBC.
>>>>>> 
>>>>>> Thanks for the advice.
>>>>>> 
>>>>>> Cheers,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Sure. But essentially you are looking at batch data for analytics for 
>>>>>>> your tableau users so Hive may be a better choice with its rich SQL and 
>>>>>>> ODBC.JDBC connection to Tableau already.
>>>>>>> 
>>>>>>> I would go for Hive especially the new release will have an in-memory 
>>>>>>> offering as well for frequently accessed data :)
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>>>>> Mich,
>>>>>>>> 
>>>>>>>> First and foremost, we have visualization servers that run Tableau for 
>>>>>>>> external user reports. Second, we have servers that are ad servers and 
>>>>>>>> REST endpoints for cookie sync and segmentation data exchange. These 
>>>>>>>> will use JDBC directly within the same data-center. When not colocated 
>>>>>>>> in the same data-center, they will connected to a located database 
>>>>>>>> server using JDBC. Either way, by using JDBC everywhere, it simplifies 
>>>>>>>> and unifies the code on the JDBC industry standard.
>>>>>>>> 
>>>>>>>> Does this make sense?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
>>>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Like any other design what is your presentation layer and end users?
>>>>>>>>> 
>>>>>>>>> Are they SQL centric users from Tableau background or they may use 
>>>>>>>>> spark functional programming.
>>>>>>>>> 
>>>>>>>>> It is best to describe the use case.
>>>>>>>>> 
>>>>>>>>> HTH
>>>>>>>>> 
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>  
>>>>>>>>> LinkedIn  
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>  
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>> 
>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for 
>>>>>>>>> any loss, damage or destruction of data or any other property which 
>>>>>>>>> may arise from relying on this email's technical content is 
>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any 
>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix 
>>>>>>>>>> JDBC server - HBASE would work better.
>>>>>>>>>> 
>>>>>>>>>> Without naming specifics, there are at least 4 or 5 different 
>>>>>>>>>> implementations of HBASE sources, each at varying level of 
>>>>>>>>>> development and different requirements (HBASE release version, 
>>>>>>>>>> Kerberos support etc)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _____________________________
>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>>>>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Mich,
>>>>>>>>>> 
>>>>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about 
>>>>>>>>>> that alternative.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh 
>>>>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> I don't think it will work
>>>>>>>>>> 
>>>>>>>>>> you can use phoenix on top of hbase
>>>>>>>>>> 
>>>>>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>>>>>> ROW                                                       COLUMN+CELL
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>>>>>> 
>>>>>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select 
>>>>>>>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, 
>>>>>>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's 
>>>>>>>>>> Low", "open" AS "Day's Open", "ticker", "volume", 
>>>>>>>>>> (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from 
>>>>>>>>>> "tsco" where to_number("volume") > 0 and "high" != '-' and 
>>>>>>>>>> to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') 
>>>>>>>>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open 
>>>>>>>>>>  | ticker  |  volume   | AverageDailyPrice  |
>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20     
>>>>>>>>>>  | TSCO    | 30046994  | 191.445            |
>>>>>>>>>> 
>>>>>>>>>> HTH
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>  
>>>>>>>>>> LinkedIn  
>>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>  
>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>> 
>>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for 
>>>>>>>>>> any loss, damage or destructionof data or any other property which 
>>>>>>>>>> may arise from relying on this email's technical content is 
>>>>>>>>>> explicitly disclaimed.The author will in no case be liable for any 
>>>>>>>>>> monetary damages arising from suchloss, damage or destruction.
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>>> On 8 October 2016 at 19:05, Felix Cheung 
>>>>>>>>>>> <felixcheun...@hotmail.com> wrote:
>>>>>>>>>>> Great, then I think those packages as Spark data source should 
>>>>>>>>>>> allow you to do exactly that (replace org.apache.spark.sql.jdbc 
>>>>>>>>>>> with HBASE one)
>>>>>>>>>>> 
>>>>>>>>>>> I do think it will be great to get more examples around this 
>>>>>>>>>>> though. Would be great if you could share your experience with this!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _____________________________
>>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>>>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>>>>>>>>> Cc: <user@spark.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Felix,
>>>>>>>>>>> 
>>>>>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase 
>>>>>>>>>>> tables using just SQL. I have been able to CREATE tables using this 
>>>>>>>>>>> statement below in the past:
>>>>>>>>>>> 
>>>>>>>>>>> CREATE TABLE <table-name>
>>>>>>>>>>> USING org.apache.spark.sql.jdbc
>>>>>>>>>>> OPTIONS (
>>>>>>>>>>>   url 
>>>>>>>>>>> "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>>>>>>   dbtable "dim.dimension_acamp"
>>>>>>>>>>> );
>>>>>>>>>>> 
>>>>>>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL 
>>>>>>>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, 
>>>>>>>>>>> etc.). I want to do the same with HBase tables. We tried this using 
>>>>>>>>>>> Hive and HiveServer2, but the response times are just too long.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung 
>>>>>>>>>>> <felixcheun...@hotmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Ben,
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure I'm following completely.
>>>>>>>>>>> 
>>>>>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If 
>>>>>>>>>>> so the link below and several packages out there support that by 
>>>>>>>>>>> having a HBASE data source for Spark. There are some examples on 
>>>>>>>>>>> how the Spark code look like in that link as well. On that note, 
>>>>>>>>>>> you should also be able to use the HBASE data source from pure SQL 
>>>>>>>>>>> (Spark SQL) query as well, which should work in the case with the 
>>>>>>>>>>> Spark SQL JDBC Thrift Server (with 
>>>>>>>>>>> USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _____________________________
>>>>>>>>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>>>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>>>>>>>>> Cc: <user@spark.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Felix,
>>>>>>>>>>> 
>>>>>>>>>>> The only alternative way is to create a stored procedure (udf) in 
>>>>>>>>>>> database terms that would run Spark scala code underneath. In this 
>>>>>>>>>>> way, I can use Spark SQL JDBC Thriftserver to execute it using SQL 
>>>>>>>>>>> code passing the key, values I want to UPSERT. I wonder if this is 
>>>>>>>>>>> possible since I cannot CREATE a wrapper table on top of a HBase 
>>>>>>>>>>> table in Spark SQL?
>>>>>>>>>>> 
>>>>>>>>>>> What do you think? Is this the right approach?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung 
>>>>>>>>>>> <felixcheun...@hotmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> HBase has released support for Spark
>>>>>>>>>>> hbase.apache.org/book.html#spark
>>>>>>>>>>> 
>>>>>>>>>>> And if you search you should find several alternative approaches.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" 
>>>>>>>>>>> <bbuil...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone know if Spark can work with HBase tables using Spark 
>>>>>>>>>>> SQL? I know in Hive we are able to create tables on top of an 
>>>>>>>>>>> underlying HBase table that can be accessed using MapReduce jobs. 
>>>>>>>>>>> Can the same be done using HiveContext or SQLContext? We are trying 
>>>>>>>>>>> to setup a way to GET and POST data to and from the HBase table 
>>>>>>>>>>> using the Spark SQL JDBC thriftserver from our RESTful API 
>>>>>>>>>>> endpoints and/or HTTP web farms. If we can get this to work, then 
>>>>>>>>>>> we can load balance the thriftservers. In addition, this will 
>>>>>>>>>>> benefit us in giving us a way to abstract the data storage layer 
>>>>>>>>>>> away from the presentation layer code. There is a chance that we 
>>>>>>>>>>> will swap out the data storage technology in the future. We are 
>>>>>>>>>>> currently experimenting with Kudu.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 

Reply via email to