Re: Spark SQL Thriftserver with HBase

Benjamin Kim Mon, 17 Oct 2016 13:44:22 -0700

This will give me an opportunity to start using Structured Streaming. Then, I 
can try adding more functionality. If all goes well, then we could transition 
off of HBase to a more in-memory data solution that can “spill-over” data for 
us.


> On Oct 17, 2016, at 11:53 AM, vincent gromakowski 
> <vincent.gromakow...@gmail.com> wrote:
> 
> Instead of (or additionally to) saving results somewhere, you just start a 
> thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
> now). That means you can implement any logic (and maybe use structured 
> streaming) to expose your data. Today using the thriftserver means reading 
> data from the persistent store every query, so if the data modeling doesn't 
> fit the query it can be quite long.  What you generally do in a common spark 
> job is to load the data and cache spark table in a in-memory columnar table 
> which is quite efficient for any kind of query, the counterpart is that the 
> cache isn't updated you have to implement a reload mechanism, and this 
> solution isn't available using the thriftserver.
> What I propose is to mix the two world: periodically/delta load data in spark 
> table cache and expose it through the thriftserver. But you have to implement 
> the loading logic, it can be very simple to very complex depending on your 
> needs.
> 
> 
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>>:
> Is this technique similar to what Kinesis is offering or what Structured 
> Streaming is going to have eventually?
> 
> Just curious.
> 
> Cheers,
> Ben
> 
>  
>> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
>> <vincent.gromakow...@gmail.com <mailto:vincent.gromakow...@gmail.com>> wrote:
>> 
>> I would suggest to code your own Spark thriftserver which seems to be very 
>> easy.
>> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>>  
>> <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
>> 
>> I am starting to test it. The big advantage is that you can implement any 
>> logic because it's a spark job and then start a thrift server on temporary 
>> table. For example you can query a micro batch rdd from a kafka stream, or 
>> pre load some tables and implement a rolling cache to periodically update 
>> the spark in memory tables with persistent store...
>> It's not part of the public API and I don't know yet what are the issues 
>> doing this but I think Spark community should look at this path: making the 
>> thriftserver be instantiable in any spark job.
>> 
>> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com 
>> <mailto:msegel_had...@hotmail.com>>:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a 
>> Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases 
>> and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations 
>> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
>> performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world 
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
>> part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
>> Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the 
>>> Tableau side of things. The reports don’t need to be real-time, and they 
>>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP 
>>> HBase for ID generation, but after further investigation into the use-case 
>>> and architecture, the ID generation needs to happen local to the Ad Servers 
>>> where we generate a unique ID and store it in a ID linking table. Even 
>>> better, many of the 3rd party services supply this ID. So, data only needs 
>>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>>> required. This is also goes for the REST Endpoints. 3rd party services will 
>>> hit ours to update our data with no need to read from our data. And, when 
>>> we want to update their data, we will hit theirs to update their data using 
>>> a triggered job.
>>> 
>>> This al boils down to just integrating with Kafka.
>>> 
>>> Once again, thanks for all the help.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com 
>>>> <mailto:jornfra...@gmail.com>> wrote:
>>>> 
>>>> please keep also in mind that Tableau Server has the capabilities to store 
>>>> data in-memory and refresh only when needed the in-memory data. This means 
>>>> you can import it from any source and let your users work only on the 
>>>> in-memory data in Tableau Server.
>>>> 
>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com 
>>>> <mailto:jornfra...@gmail.com>> wrote:
>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
>>>> already a good alternative. However, you should check if it contains a 
>>>> recent version of Hbase and Phoenix. That being said, I just wonder what 
>>>> is the dataflow, data model and the analysis you plan to do. Maybe there 
>>>> are completely different solutions possible. Especially these single 
>>>> inserts, upserts etc. should be avoided as much as possible in the Big 
>>>> Data (analysis) world with any technology, because they do not perform 
>>>> well. 
>>>> 
>>>> Hive with Llap will provide an in-memory cache for interactive analytics. 
>>>> You can put full tables in-memory with Hive using Ignite HDFS in-memory 
>>>> solution. All this does only make sense if you do not use MR as an engine, 
>>>> the right input format (ORC, parquet) and a recent Hive version.
>>>> 
>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>>> Mich,
>>>>> 
>>>>> Unfortunately, we are moving away from Hive and unifying on Spark using 
>>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver 
>>>>> too. I will either try Phoenix JDBC Server for HBase or push to move 
>>>>> faster to Kudu with Impala. We will use Impala as the JDBC in-between 
>>>>> until the Kudu team completes Spark SQL support for JDBC.
>>>>> 
>>>>> Thanks for the advice.
>>>>> 
>>>>> Cheers,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>> 
>>>>>> Sure. But essentially you are looking at batch data for analytics for 
>>>>>> your tableau users so Hive may be a better choice with its rich SQL and 
>>>>>> ODBC.JDBC connection to Tableau already.
>>>>>> 
>>>>>> I would go for Hive especially the new release will have an in-memory 
>>>>>> offering as well for frequently accessed data :)
>>>>>> 
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com 
>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>> damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> Mich,
>>>>>> 
>>>>>> First and foremost, we have visualization servers that run Tableau for 
>>>>>> external user reports. Second, we have servers that are ad servers and 
>>>>>> REST endpoints for cookie sync and segmentation data exchange. These 
>>>>>> will use JDBC directly within the same data-center. When not colocated 
>>>>>> in the same data-center, they will connected to a located database 
>>>>>> server using JDBC. Either way, by using JDBC everywhere, it simplifies 
>>>>>> and unifies the code on the JDBC industry standard.
>>>>>> 
>>>>>> Does this make sense?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Like any other design what is your presentation layer and end users?
>>>>>>> 
>>>>>>> Are they SQL centric users from Tableau background or they may use 
>>>>>>> spark functional programming.
>>>>>>> 
>>>>>>> It is best to describe the use case.
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com 
>>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com 
>>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC 
>>>>>>> server - HBASE would work better.
>>>>>>> 
>>>>>>> Without naming specifics, there are at least 4 or 5 different 
>>>>>>> implementations of HBASE sources, each at varying level of development 
>>>>>>> and different requirements (HBASE release version, Kerberos support etc)
>>>>>>> 
>>>>>>> 
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>>> <mailto:mich.talebza...@gmail.com>>
>>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>, Felix 
>>>>>>> Cheung <felixcheun...@hotmail.com <mailto:felixcheun...@hotmail.com>>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Mich,
>>>>>>> 
>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about 
>>>>>>> that alternative.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> I don't think it will work
>>>>>>> 
>>>>>>> you can use phoenix on top of hbase
>>>>>>> 
>>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>>> ROW                                                       COLUMN+CELL
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>>  TSCO-1-Apr-08                                            
>>>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>>> 
>>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>>> 
>>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> 
>>>>>>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, 
>>>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", 
>>>>>>> "open" AS "Day's Open", "ticker", "volume", 
>>>>>>> (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from 
>>>>>>> "tsco" where to_number("volume") > 0 and "high" != '-' and 
>>>>>>> to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order 
>>>>>>> by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | 
>>>>>>> ticker  |  volume   | AverageDailyPrice  |
>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | 
>>>>>>> TSCO    | 30046994  | 191.445            |
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com 
>>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destructionof data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed.The author will in no case be liable for any monetary 
>>>>>>> damages arising from suchloss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheun...@hotmail.com 
>>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>>> Great, then I think those packages as Spark data source should allow 
>>>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE 
>>>>>>> one)
>>>>>>> 
>>>>>>> I do think it will be great to get more examples around this though. 
>>>>>>> Would be great if you could share your experience with this!
>>>>>>> 
>>>>>>> 
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com 
>>>>>>> <mailto:felixcheun...@hotmail.com>>
>>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>
>>>>>>> 
>>>>>>> 
>>>>>>> Felix,
>>>>>>> 
>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables 
>>>>>>> using just SQL. I have been able to CREATE tables using this statement 
>>>>>>> below in the past:
>>>>>>> 
>>>>>>> CREATE TABLE <table-name>
>>>>>>> USING org.apache.spark.sql.jdbc
>>>>>>> OPTIONS (
>>>>>>>   url 
>>>>>>> "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>>   dbtable "dim.dimension_acamp"
>>>>>>> );
>>>>>>> 
>>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL 
>>>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). 
>>>>>>> I want to do the same with HBase tables. We tried this using Hive and 
>>>>>>> HiveServer2, but the response times are just too long.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheun...@hotmail.com 
>>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>>> 
>>>>>>> Ben,
>>>>>>> 
>>>>>>> I'm not sure I'm following completely.
>>>>>>> 
>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so 
>>>>>>> the link below and several packages out there support that by having a 
>>>>>>> HBASE data source for Spark. There are some examples on how the Spark 
>>>>>>> code look like in that link as well. On that note, you should also be 
>>>>>>> able to use the HBASE data source from pure SQL (Spark SQL) query as 
>>>>>>> well, which should work in the case with the Spark SQL JDBC Thrift 
>>>>>>> Server (with 
>>>>>>> USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10
>>>>>>>  
>>>>>>> <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
>>>>>>> 
>>>>>>> 
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Felix Cheung <felixcheun...@hotmail.com 
>>>>>>> <mailto:felixcheun...@hotmail.com>>
>>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>
>>>>>>> 
>>>>>>> 
>>>>>>> Felix,
>>>>>>> 
>>>>>>> The only alternative way is to create a stored procedure (udf) in 
>>>>>>> database terms that would run Spark scala code underneath. In this way, 
>>>>>>> I can use Spark SQL JDBC Thriftserver to execute it using SQL code 
>>>>>>> passing the key, values I want to UPSERT. I wonder if this is possible 
>>>>>>> since I cannot CREATE a wrapper table on top of a HBase table in Spark 
>>>>>>> SQL?
>>>>>>> 
>>>>>>> What do you think? Is this the right approach?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheun...@hotmail.com 
>>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>>> 
>>>>>>> HBase has released support for Spark
>>>>>>> hbase.apache.org/book.html#spark 
>>>>>>> <http://hbase.apache.org/book.html#spark>
>>>>>>> 
>>>>>>> And if you search you should find several alternative approaches.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" 
>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I 
>>>>>>> know in Hive we are able to create tables on top of an underlying HBase 
>>>>>>> table that can be accessed using MapReduce jobs. Can the same be done 
>>>>>>> using HiveContext or SQLContext? We are trying to setup a way to GET 
>>>>>>> and POST data to and from the HBase table using the Spark SQL JDBC 
>>>>>>> thriftserver from our RESTful API endpoints and/or HTTP web farms. If 
>>>>>>> we can get this to work, then we can load balance the thriftservers. In 
>>>>>>> addition, this will benefit us in giving us a way to abstract the data 
>>>>>>> storage layer away from the presentation layer code. There is a chance 
>>>>>>> that we will swap out the data storage technology in the future. We are 
>>>>>>> currently experimenting with Kudu.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>>>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: Spark SQL Thriftserver with HBase

Reply via email to