Re: Spark SQL Thriftserver with HBase

Jörn Franke Sun, 09 Oct 2016 03:17:33 -0700

please keep also in mind that Tableau Server has the capabilities to store
data in-memory and refresh only when needed the in-memory data. This means
you can import it from any source and let your users work only on the
in-memory data in Tableau Server.


On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided
> already a good alternative. However, you should check if it contains a
> recent version of Hbase and Phoenix. That being said, I just wonder what is
> the dataflow, data model and the analysis you plan to do. Maybe there are
> completely different solutions possible. Especially these single inserts,
> upserts etc. should be avoided as much as possible in the Big Data
> (analysis) world with any technology, because they do not perform well.
>
> Hive with Llap will provide an in-memory cache for interactive analytics.
> You can put full tables in-memory with Hive using Ignite HDFS in-memory
> solution. All this does only make sense if you do not use MR as an engine,
> the right input format (ORC, parquet) and a recent Hive version.
>
> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Mich,
>
> Unfortunately, we are moving away from Hive and unifying on Spark using
> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
> too. I will either try Phoenix JDBC Server for HBase or push to move faster
> to Kudu with Impala. We will use Impala as the JDBC in-between until the
> Kudu team completes Spark SQL support for JDBC.
>
> Thanks for the advice.
>
> Cheers,
> Ben
>
>
> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Sure. But essentially you are looking at batch data for analytics for your
> tableau users so Hive may be a better choice with its rich SQL and
> ODBC.JDBC connection to Tableau already.
>
> I would go for Hive especially the new release will have an in-memory
> offering as well for frequently accessed data :)
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Mich,
>>
>> First and foremost, we have visualization servers that run Tableau for
>> external user reports. Second, we have servers that are ad servers and REST
>> endpoints for cookie sync and segmentation data exchange. These will use
>> JDBC directly within the same data-center. When not colocated in the same
>> data-center, they will connected to a located database server using JDBC.
>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>> the JDBC industry standard.
>>
>> Does this make sense?
>>
>> Thanks,
>> Ben
>>
>>
>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Like any other design what is your presentation layer and end users?
>>
>> Are they SQL centric users from Tableau background or they may use spark
>> functional programming.
>>
>> It is best to describe the use case.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com>
>> wrote:
>>
>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>>> server - HBASE would work better.
>>>
>>> Without naming specifics, there are at least 4 or 5 different
>>> implementations of HBASE sources, each at varying level of development and
>>> different requirements (HBASE release version, Kerberos support etc)
>>>
>>>
>>> _____________________________
>>> From: Benjamin Kim <bbuil...@gmail.com>
>>> Sent: Saturday, October 8, 2016 11:26 AM
>>> Subject: Re: Spark SQL Thriftserver with HBase
>>> To: Mich Talebzadeh <mich.talebza...@gmail.com>
>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com>
>>>
>>>
>>>
>>> Mich,
>>>
>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>> that alternative.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> I don't think it will work
>>>
>>> you can use phoenix on top of hbase
>>>
>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>> ROW                                                       COLUMN+CELL
>>>  TSCO-1-Apr-08
>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>  TSCO-1-Apr-08
>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>  TSCO-1-Apr-08
>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>  TSCO-1-Apr-08
>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>  TSCO-1-Apr-08
>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>  TSCO-1-Apr-08
>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>  TSCO-1-Apr-08
>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>  TSCO-1-Apr-08
>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>
>>> And the same on Phoenix on top of Hvbase table
>>>
>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close"
>>> AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS
>>> "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>> +-------------+--------------+-------------+------------+---
>>> ----------+---------+-----------+--------------------+
>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
>>> ticker  |  volume   | AverageDailyPrice  |
>>> +-------------+--------------+-------------+------------+---
>>> ----------+---------+-----------+--------------------+
>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
>>> TSCO    | 30046994  | 191.445            |
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destructionof data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed.The author will in no case be liable for any monetary damages
>>> arising from suchloss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> Great, then I think those packages as Spark data source should allow
>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>
>>>> I do think it will be great to get more examples around this though.
>>>> Would be great if you could share your experience with this!
>>>>
>>>>
>>>> _____________________________
>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>> Cc: <user@spark.apache.org>
>>>>
>>>>
>>>> Felix,
>>>>
>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>> using just SQL. I have been able to CREATE tables using this statement
>>>> below in the past:
>>>>
>>>> CREATE TABLE <table-name>
>>>> USING org.apache.spark.sql.jdbc
>>>> OPTIONS (
>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>> word=<password>",
>>>>   dbtable "dim.dimension_acamp"
>>>> );
>>>>
>>>>
>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>> want to do the same with HBase tables. We tried this using Hive and
>>>> HiveServer2, but the response times are just too long.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheun...@hotmail.com>
>>>> wrote:
>>>>
>>>> Ben,
>>>>
>>>> I'm not sure I'm following completely.
>>>>
>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>> the link below and several packages out there support that by having a
>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>> look like in that link as well. On that note, you should also be able to
>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>> ide.html#tab_sql_10).
>>>>
>>>>
>>>> _____________________________
>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Felix Cheung <felixcheun...@hotmail.com>
>>>> Cc: <user@spark.apache.org>
>>>>
>>>>
>>>> Felix,
>>>>
>>>> The only alternative way is to create a stored procedure (udf) in
>>>> database terms that would run Spark scala code underneath. In this way, I
>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>
>>>> What do you think? Is this the right approach?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheun...@hotmail.com>
>>>> wrote:
>>>>
>>>> HBase has released support for Spark
>>>> hbase.apache.org/book.html#spark
>>>>
>>>> And if you search you should find several alternative approaches.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>> bbuil...@gmail.com> wrote:
>>>>
>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I
>>>> know in Hive we are able to create tables on top of an underlying HBase
>>>> table that can be accessed using MapReduce jobs. Can the same be done using
>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>> work, then we can load balance the thriftservers. In addition, this will
>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>> the presentation layer code. There is a chance that we will swap out the
>>>> data storage technology in the future. We are currently experimenting with
>>>> Kudu.
>>>>
>>>> Thanks,
>>>> Ben
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: Spark SQL Thriftserver with HBase

Reply via email to