Re: Spark SQL Thriftserver with HBase

vincent gromakowski Mon, 17 Oct 2016 10:14:32 -0700

I would suggest to code your own Spark thriftserver which seems to be very
easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server


I am starting to test it. The big advantage is that you can implement any
logic because it's a spark job and then start a thrift server on temporary
table. For example you can query a micro batch rdd from a kafka stream, or
pre load some tables and implement a rolling cache to periodically update
the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues
doing this but I think Spark community should look at this path: making the
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com>:

> Guys,
> Sorry for jumping in late to the game…
>
> If memory serves (which may not be a good thing…) :
>
> You can use HiveServer2 as a connection point to HBase.
> While this doesn’t perform well, its probably the cleanest solution.
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and they
> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be
> batched to Parquet or Kudu/Impala or even PostgreSQL.
>
> I originally thought that we needed two-way data retrieval from the DMP
> HBase for ID generation, but after further investigation into the use-case
> and architecture, the ID generation needs to happen local to the Ad Servers
> where we generate a unique ID and store it in a ID linking table. Even
> better, many of the 3rd party services supply this ID. So, data only needs
> to flow in one direction. We will use Kafka as the bus for this. No JDBC
> required. This is also goes for the REST Endpoints. 3rd party services will
> hit ours to update our data with no need to read from our data. And, when
> we want to update their data, we will hit theirs to update their data using
> a triggered job.
>
> This al boils down to just integrating with Kafka.
>
> Once again, thanks for all the help.
>
> Cheers,
> Ben
>
>
> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> please keep also in mind that Tableau Server has the capabilities to store
> data in-memory and refresh only when needed the in-memory data. This means
> you can import it from any source and let your users work only on the
> in-memory data in Tableau Server.
>
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>> provided already a good alternative. However, you should check if it
>> contains a recent version of Hbase and Phoenix. That being said, I just
>> wonder what is the dataflow, data model and the analysis you plan to do.
>> Maybe there are completely different solutions possible. Especially these
>> single inserts, upserts etc. should be avoided as much as possible in the
>> Big Data (analysis) world with any technology, because they do not perform
>> well.
>>
>> Hive with Llap will provide an in-memory cache for interactive analytics.
>> You can put full tables in-memory with Hive using Ignite HDFS in-memory
>> solution. All this does only make sense if you do not use MR as an engine,
>> the right input format (ORC, parquet) and a recent Hive version.
>>
>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>> Mich,
>>
>> Unfortunately, we are moving away from Hive and unifying on Spark using
>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>> Kudu team completes Spark SQL support for JDBC.
>>
>> Thanks for the advice.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Sure. But essentially you are looking at batch data for analytics for
>> your tableau users so Hive may be a better choice with its rich SQL and
>> ODBC.JDBC connection to Tableau already.
>>
>> I would go for Hive especially the new release will have an in-memory
>> offering as well for frequently accessed data :)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Mich,
>>>
>>> First and foremost, we have visualization servers that run Tableau for
>>> external user reports. Second, we have servers that are ad servers and REST
>>> endpoints for cookie sync and segmentation data exchange. These will use
>>> JDBC directly within the same data-center. When not colocated in the same
>>> data-center, they will connected to a located database server using JDBC.
>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>> the JDBC industry standard.
>>>
>>> Does this make sense?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Like any other design what is your presentation layer and end users?
>>>
>>> Are they SQL centric users from Tableau background or they may use spark
>>> functional programming.
>>>
>>> It is best to describe the use case.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>>>> server - HBASE would work better.
>>>>
>>>> Without naming specifics, there are at least 4 or 5 different
>>>> implementations of HBASE sources, each at varying level of development and
>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>
>>>>
>>>> _____________________________
>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com>
>>>>
>>>>
>>>>
>>>> Mich,
>>>>
>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>> that alternative.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think it will work
>>>>
>>>> you can use phoenix on top of hbase
>>>>
>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>> ROW                                                       COLUMN+CELL
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>
>>>> And the same on Phoenix on top of Hvbase table
>>>>
>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate,
>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low",
>>>> "open" AS "Day's Open", "ticker", "volume", 
>>>> (to_number("low")+to_number("high"))/2
>>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>> +-------------+--------------+-------------+------------+---
>>>> ----------+---------+-----------+--------------------+
>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
>>>> ticker  |  volume   | AverageDailyPrice  |
>>>> +-------------+--------------+-------------+------------+---
>>>> ----------+---------+-----------+--------------------+
>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
>>>> TSCO    | 30046994  | 191.445            |
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destructionof data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed.The author will in no case be liable for any monetary damages
>>>> arising from suchloss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheun...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Great, then I think those packages as Spark data source should allow
>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>
>>>>> I do think it will be great to get more examples around this though.
>>>>> Would be great if you could share your experience with this!
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>>> Cc: <user@spark.apache.org>
>>>>>
>>>>>
>>>>> Felix,
>>>>>
>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>>> using just SQL. I have been able to CREATE tables using this statement
>>>>> below in the past:
>>>>>
>>>>> CREATE TABLE <table-name>
>>>>> USING org.apache.spark.sql.jdbc
>>>>> OPTIONS (
>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>>> word=<password>",
>>>>>   dbtable "dim.dimension_acamp"
>>>>> );
>>>>>
>>>>>
>>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>>> want to do the same with HBase tables. We tried this using Hive and
>>>>> HiveServer2, but the response times are just too long.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheun...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> Ben,
>>>>>
>>>>> I'm not sure I'm following completely.
>>>>>
>>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>>> the link below and several packages out there support that by having a
>>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>>> look like in that link as well. On that note, you should also be able to
>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>>> ide.html#tab_sql_10).
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>>> Cc: <user@spark.apache.org>
>>>>>
>>>>>
>>>>> Felix,
>>>>>
>>>>> The only alternative way is to create a stored procedure (udf) in
>>>>> database terms that would run Spark scala code underneath. In this way, I
>>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>
>>>>> What do you think? Is this the right approach?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheun...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> HBase has released support for Spark
>>>>> hbase.apache.org/book.html#spark
>>>>>
>>>>> And if you search you should find several alternative approaches.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>>> bbuil...@gmail.com> wrote:
>>>>>
>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL?
>>>>> I know in Hive we are able to create tables on top of an underlying HBase
>>>>> table that can be accessed using MapReduce jobs. Can the same be done 
>>>>> using
>>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver 
>>>>> from
>>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>>> work, then we can load balance the thriftservers. In addition, this will
>>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>>> the presentation layer code. There is a chance that we will swap out the
>>>>> data storage technology in the future. We are currently experimenting with
>>>>> Kudu.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>

Re: Spark SQL Thriftserver with HBase

Reply via email to