Re: Spark SQL Thriftserver with HBase

Benjamin Kim Mon, 17 Oct 2016 10:50:06 -0700

Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?


Just curious.

Cheers,
Ben

 
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
> <vincent.gromakow...@gmail.com> wrote:
> 
> I would suggest to code your own Spark thriftserver which seems to be very 
> easy.
> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>  
> <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
> 
> I am starting to test it. The big advantage is that you can implement any 
> logic because it's a spark job and then start a thrift server on temporary 
> table. For example you can query a micro batch rdd from a kafka stream, or 
> pre load some tables and implement a rolling cache to periodically update the 
> spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues 
> doing this but I think Spark community should look at this path: making the 
> thriftserver be instantiable in any spark job.
> 
> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>>:
> Guys, 
> Sorry for jumping in late to the game… 
> 
> If memory serves (which may not be a good thing…) :
> 
> You can use HiveServer2 as a connection point to HBase.  
> While this doesn’t perform well, its probably the cleanest solution. 
> I’m not keen on Phoenix… wouldn’t recommend it…. 
> 
> 
> The issue is that you’re trying to make HBase, a key/value object store, a 
> Relational Engine… its not. 
> 
> There are some considerations which make HBase not ideal for all use cases 
> and you may find better performance with Parquet files. 
> 
> One thing missing is the use of secondary indexing and query optimizations 
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
> performance will vary. 
> 
> With respect to Tableau… their entire interface in to the big data world 
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
> part of your solution, you’re DOA w respect to Tableau. 
> 
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
> Apache project) 
> 
> 
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>> Thanks for all the suggestions. It would seem you guys are right about the 
>> Tableau side of things. The reports don’t need to be real-time, and they 
>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>> 
>> I originally thought that we needed two-way data retrieval from the DMP 
>> HBase for ID generation, but after further investigation into the use-case 
>> and architecture, the ID generation needs to happen local to the Ad Servers 
>> where we generate a unique ID and store it in a ID linking table. Even 
>> better, many of the 3rd party services supply this ID. So, data only needs 
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>> required. This is also goes for the REST Endpoints. 3rd party services will 
>> hit ours to update our data with no need to read from our data. And, when we 
>> want to update their data, we will hit theirs to update their data using a 
>> triggered job.
>> 
>> This al boils down to just integrating with Kafka.
>> 
>> Once again, thanks for all the help.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> 
>>> please keep also in mind that Tableau Server has the capabilities to store 
>>> data in-memory and refresh only when needed the in-memory data. This means 
>>> you can import it from any source and let your users work only on the 
>>> in-memory data in Tableau Server.
>>> 
>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
>>> already a good alternative. However, you should check if it contains a 
>>> recent version of Hbase and Phoenix. That being said, I just wonder what is 
>>> the dataflow, data model and the analysis you plan to do. Maybe there are 
>>> completely different solutions possible. Especially these single inserts, 
>>> upserts etc. should be avoided as much as possible in the Big Data 
>>> (analysis) world with any technology, because they do not perform well. 
>>> 
>>> Hive with Llap will provide an in-memory cache for interactive analytics. 
>>> You can put full tables in-memory with Hive using Ignite HDFS in-memory 
>>> solution. All this does only make sense if you do not use MR as an engine, 
>>> the right input format (ORC, parquet) and a recent Hive version.
>>> 
>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>>> Mich,
>>>> 
>>>> Unfortunately, we are moving away from Hive and unifying on Spark using 
>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver 
>>>> too. I will either try Phoenix JDBC Server for HBase or push to move 
>>>> faster to Kudu with Impala. We will use Impala as the JDBC in-between 
>>>> until the Kudu team completes Spark SQL support for JDBC.
>>>> 
>>>> Thanks for the advice.
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>> 
>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>> 
>>>>> Sure. But essentially you are looking at batch data for analytics for 
>>>>> your tableau users so Hive may be a better choice with its rich SQL and 
>>>>> ODBC.JDBC connection to Tableau already.
>>>>> 
>>>>> I would go for Hive especially the new release will have an in-memory 
>>>>> offering as well for frequently accessed data :)
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>  
>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Mich,
>>>>> 
>>>>> First and foremost, we have visualization servers that run Tableau for 
>>>>> external user reports. Second, we have servers that are ad servers and 
>>>>> REST endpoints for cookie sync and segmentation data exchange. These will 
>>>>> use JDBC directly within the same data-center. When not colocated in the 
>>>>> same data-center, they will connected to a located database server using 
>>>>> JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the 
>>>>> code on the JDBC industry standard.
>>>>> 
>>>>> Does this make sense?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>> 
>>>>>> Like any other design what is your presentation layer and end users?
>>>>>> 
>>>>>> Are they SQL centric users from Tableau background or they may use spark 
>>>>>> functional programming.
>>>>>> 
>>>>>> It is best to describe the use case.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com 
>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>> damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com 
>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC 
>>>>>> server - HBASE would work better.
>>>>>> 
>>>>>> Without naming specifics, there are at least 4 or 5 different 
>>>>>> implementations of HBASE sources, each at varying level of development 
>>>>>> and different requirements (HBASE release version, Kerberos support etc)
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>> <mailto:mich.talebza...@gmail.com>>
>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>, Felix Cheung 
>>>>>> <felixcheun...@hotmail.com <mailto:felixcheun...@hotmail.com>>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Mich,
>>>>>> 
>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about 
>>>>>> that alternative.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>> 
>>>>>> I don't think it will work
>>>>>> 
>>>>>> you can use phoenix on top of hbase
>>>>>> 
>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>> ROW                                                       COLUMN+CELL
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>  TSCO-1-Apr-08                                            
>>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>> 
>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>> 
>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> 
>>>>>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, 
>>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", 
>>>>>> "open" AS "Day's Open", "ticker", "volume", 
>>>>>> (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from 
>>>>>> "tsco" where to_number("volume") > 0 and "high" != '-' and 
>>>>>> to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order 
>>>>>> by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | 
>>>>>> ticker  |  volume   | AverageDailyPrice  |
>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | 
>>>>>> TSCO    | 30046994  | 191.445            |
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com 
>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>> loss, damage or destructionof data or any other property which may arise 
>>>>>> from relying on this email's technical content is explicitly 
>>>>>> disclaimed.The author will in no case be liable for any monetary damages 
>>>>>> arising from suchloss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheun...@hotmail.com 
>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>> Great, then I think those packages as Spark data source should allow you 
>>>>>> to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>> 
>>>>>> I do think it will be great to get more examples around this though. 
>>>>>> Would be great if you could share your experience with this!
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <felixcheun...@hotmail.com 
>>>>>> <mailto:felixcheun...@hotmail.com>>
>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables 
>>>>>> using just SQL. I have been able to CREATE tables using this statement 
>>>>>> below in the past:
>>>>>> 
>>>>>> CREATE TABLE <table-name>
>>>>>> USING org.apache.spark.sql.jdbc
>>>>>> OPTIONS (
>>>>>>   url 
>>>>>> "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>   dbtable "dim.dimension_acamp"
>>>>>> );
>>>>>> 
>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC 
>>>>>> Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want 
>>>>>> to do the same with HBase tables. We tried this using Hive and 
>>>>>> HiveServer2, but the response times are just too long.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheun...@hotmail.com 
>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>> 
>>>>>> Ben,
>>>>>> 
>>>>>> I'm not sure I'm following completely.
>>>>>> 
>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so the 
>>>>>> link below and several packages out there support that by having a HBASE 
>>>>>> data source for Spark. There are some examples on how the Spark code 
>>>>>> look like in that link as well. On that note, you should also be able to 
>>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which 
>>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with 
>>>>>> USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10
>>>>>>  
>>>>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <felixcheun...@hotmail.com 
>>>>>> <mailto:felixcheun...@hotmail.com>>
>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> The only alternative way is to create a stored procedure (udf) in 
>>>>>> database terms that would run Spark scala code underneath. In this way, 
>>>>>> I can use Spark SQL JDBC Thriftserver to execute it using SQL code 
>>>>>> passing the key, values I want to UPSERT. I wonder if this is possible 
>>>>>> since I cannot CREATE a wrapper table on top of a HBase table in Spark 
>>>>>> SQL?
>>>>>> 
>>>>>> What do you think? Is this the right approach?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheun...@hotmail.com 
>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>> 
>>>>>> HBase has released support for Spark
>>>>>> hbase.apache.org/book.html#spark 
>>>>>> <http://hbase.apache.org/book.html#spark>
>>>>>> 
>>>>>> And if you search you should find several alternative approaches.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> 
>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I 
>>>>>> know in Hive we are able to create tables on top of an underlying HBase 
>>>>>> table that can be accessed using MapReduce jobs. Can the same be done 
>>>>>> using HiveContext or SQLContext? We are trying to setup a way to GET and 
>>>>>> POST data to and from the HBase table using the Spark SQL JDBC 
>>>>>> thriftserver from our RESTful API endpoints and/or HTTP web farms. If we 
>>>>>> can get this to work, then we can load balance the thriftservers. In 
>>>>>> addition, this will benefit us in giving us a way to abstract the data 
>>>>>> storage layer away from the presentation layer code. There is a chance 
>>>>>> that we will swap out the data storage technology in the future. We are 
>>>>>> currently experimenting with Kudu.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
>

Re: Spark SQL Thriftserver with HBase

Reply via email to