Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between until the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your 
tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC 
connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering 
as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external 
user reports. Second, we have servers that are ad servers and REST endpoints 
for cookie sync and segmentation data exchange. These will use JDBC directly 
within the same data-center. When not colocated in the same data-center, they 
will connected to a located database server using JDBC. Either way, by using 
JDBC everywhere, it simplifies and unifies the code on the JDBC industry 
standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark 
functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server 
- HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations 
of HBASE sources, each at varying level of development and different 
requirements (HBASE release version, Kerberos support etc)


_
From: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that 
alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
;
>> please keep also in mind that Tableau Server has the capabilities to
>> store data in-memory and refresh only when needed the in-memory data. This
>> means you can import it from any source and let your users work only on the
>> in-memory data in Tableau Server.
>>
>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>>> provided already a good alternative. However, you should check if it
>>> contains a recent version of Hbase and Phoenix. That being said, I just
>>> wonder what is the dataflow, data model and the analysis you plan to do.
>>> Maybe there are completely different solutions possible. Especially these
>>> single inserts, upserts etc. should be avoided as much as possible in the
>>> Big Data (analysis) world with any technology, because they do not perform
>>> well.
>>>
>>> Hive with Llap will provide an in-memory cache for interactive
>>> analytics. You can put full tables in-memory with Hive using Ignite HDFS
>>> in-memory solution. All this does only make sense if you do not use MR as
>>> an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>
>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>> Mich,
>>>
>>> Unfortunately, we are moving away from Hive and unifying on Spark using
>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>>> Kudu team completes Spark SQL support for JDBC.
>>>
>>> Thanks for the advice.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Sure. But essentially you are looking at batch data for analytics for
>>> your tableau users so Hive may be a better choice with its rich SQL and
>>> ODBC.JDBC connection to Tableau already.
>>>
>>> I would go for Hive especially the new release will have an in-memory
>>> offering as well for frequently accessed data :)
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>>> Mich,
>>>>
>>>> First and foremost, we have visualization servers that run Tableau for
>>>> external user reports. Second, we have servers that are ad servers and REST
>>>> endpoints for cookie sync and segmentation data exchange. These will use
>>>> JDBC directly within the same data-center. When not colocated in the same
>>>> data-center, they will connected to a located database server using JDBC.
>>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>>> the JDBC industry standard.
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> Like any other design what is your presentation layer and end users?
>>>>
>>>> Are they SQL centric users from Tableau background or they may use
>>>> spark functional programming.
>>>>
>>>> It is best to describe the use case.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profi

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
;> This means you can import it from any source and let your users work 
>>>>>> only on the in-memory data in Tableau Server.
>>>>>> 
>>>>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> 
>>>>>>> wrote:
>>>>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich 
>>>>>>> provided already a good alternative. However, you should check if it 
>>>>>>> contains a recent version of Hbase and Phoenix. That being said, I just 
>>>>>>> wonder what is the dataflow, data model and the analysis you plan to 
>>>>>>> do. Maybe there are completely different solutions possible. Especially 
>>>>>>> these single inserts, upserts etc. should be avoided as much as 
>>>>>>> possible in the Big Data (analysis) world with any technology, because 
>>>>>>> they do not perform well. 
>>>>>>> 
>>>>>>> Hive with Llap will provide an in-memory cache for interactive 
>>>>>>> analytics. You can put full tables in-memory with Hive using Ignite 
>>>>>>> HDFS in-memory solution. All this does only make sense if you do not 
>>>>>>> use MR as an engine, the right input format (ORC, parquet) and a recent 
>>>>>>> Hive version.
>>>>>>> 
>>>>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Mich,
>>>>>>>> 
>>>>>>>> Unfortunately, we are moving away from Hive and unifying on Spark 
>>>>>>>> using CDH 5.8 as our distro. And, the Tableau released a Spark 
>>>>>>>> ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase 
>>>>>>>> or push to move faster to Kudu with Impala. We will use Impala as the 
>>>>>>>> JDBC in-between until the Kudu team completes Spark SQL support for 
>>>>>>>> JDBC.
>>>>>>>> 
>>>>>>>> Thanks for the advice.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
>>>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Sure. But essentially you are looking at batch data for analytics for 
>>>>>>>>> your tableau users so Hive may be a better choice with its rich SQL 
>>>>>>>>> and ODBC.JDBC connection to Tableau already.
>>>>>>>>> 
>>>>>>>>> I would go for Hive especially the new release will have an in-memory 
>>>>>>>>> offering as well for frequently accessed data :)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>  
>>>>>>>>> LinkedIn  
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>  
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>> 
>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for 
>>>>>>>>> any loss, damage or destruction of data or any other property which 
>>>>>>>>> may arise from relying on this email's technical content is 
>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any 
>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>>>>>>> Mich,
>>>>>>>>>> 
>>>>>>>>>> First and foremost, we have visualization servers that run Tableau 
>>>>>>>>>> for external user reports. Second, we have servers that are ad 
>>>>>>>>>> servers and REST endpoints for cookie sync and segmentation data 
>>>>>>>>>> exchange. These will use JDBC directly withi

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
ntil the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your 
tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC 
connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering 
as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external 
user reports. Second, we have servers that are ad servers and REST endpoints 
for cookie sync and segmentation data exchange. These will use JDBC directly 
within the same data-center. When not colocated in the same data-center, they 
will connected to a located database server using JDBC. Either way, by using 
JDBC everywhere, it simplifies and unifies the code on the JDBC industry 
standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark 
functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server 
- HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations 
of HBASE sources, each at varying level of development and different 
requirements (HBASE release version, Kerberos support etc)


_________
From: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that 
alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW   COLUMN+CELL
 TSCO-1-Apr-08
column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08
column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08
column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08
column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08
column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08
column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08
column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08
column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>>

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Skip Phoenix

On Oct 17, 2016, at 2:20 PM, Thakrar, Jayesh 
<jthak...@conversantmedia.com<mailto:jthak...@conversantmedia.com>> wrote:

Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) 
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>>
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>>
Cc: Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>>, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of t

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external 
user reports. Second, we have servers that are ad servers and REST endpoints 
for cookie sync and segmentation data exchange. These will use JDBC directly 
within the same data-center. When not colocated in the same data-center, they 
will connected to a located database server using JDBC. Either way, by using 
JDBC everywhere, it simplifies and unifies the code on the JDBC industry 
standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark 
functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server 
- HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations 
of HBASE sources, each at varying level of development and different 
requirements (HBASE release version, Kerberos support etc)


_
From: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that 
alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW   COLUMN+CELL
 TSCO-1-Apr-08
column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08
column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08
column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08
column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08
column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08
column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08
column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08
column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select 
substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS 
"Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's 
Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS 
"AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' 
and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','-MM-dd') order by  
to_date("Date",'dd-MMM-yy') lim

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between until the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your 
tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC 
connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering 
as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external 
user reports. Second, we have servers that are ad servers and REST endpoints 
for cookie sync and segmentation data exchange. These will use JDBC directly 
within the same data-center. When not colocated in the same data-center, they 
will connected to a located database server using JDBC. Either way, by using 
JDBC everywhere, it simplifies and unifies the code on the JDBC industry 
standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark 
functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server 
- HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations 
of HBASE sources, each at varying level of development and different 
requirements (HBASE release version, Kerberos support etc)


_____________
From: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that 
alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW   COLUMN+CELL
 TSCO-1-Apr-08
column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08
column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08
column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08
column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08
column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08
column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08
column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08
column=stock_daily:volume, timestamp=1475866783376, value=496644

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
s does only make sense if you do not use MR as an 
>>>>> engine, the right input format (ORC, parquet) and a recent Hive version.
>>>>> 
>>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>> 
>>>>>> Mich,
>>>>>> 
>>>>>> Unfortunately, we are moving away from Hive and unifying on Spark using 
>>>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC 
>>>>>> driver too. I will either try Phoenix JDBC Server for HBase or push to 
>>>>>> move faster to Kudu with Impala. We will use Impala as the JDBC 
>>>>>> in-between until the Kudu team completes Spark SQL support for JDBC.
>>>>>> 
>>>>>> Thanks for the advice.
>>>>>> 
>>>>>> Cheers,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Sure. But essentially you are looking at batch data for analytics for 
>>>>>>> your tableau users so Hive may be a better choice with its rich SQL and 
>>>>>>> ODBC.JDBC connection to Tableau already.
>>>>>>> 
>>>>>>> I would go for Hive especially the new release will have an in-memory 
>>>>>>> offering as well for frequently accessed data :)
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>>>>> Mich,
>>>>>>>> 
>>>>>>>> First and foremost, we have visualization servers that run Tableau for 
>>>>>>>> external user reports. Second, we have servers that are ad servers and 
>>>>>>>> REST endpoints for cookie sync and segmentation data exchange. These 
>>>>>>>> will use JDBC directly within the same data-center. When not colocated 
>>>>>>>> in the same data-center, they will connected to a located database 
>>>>>>>> server using JDBC. Either way, by using JDBC everywhere, it simplifies 
>>>>>>>> and unifies the code on the JDBC industry standard.
>>>>>>>> 
>>>>>>>> Does this make sense?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
>>>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Like any other design what is your presentation layer and end users?
>>>>>>>>> 
>>>>>>>>> Are they SQL centric users from Tableau background or they may use 
>>>>>>>>> spark functional programming.
>>>>>>>>> 
>>>>>>>>> It is best to describe the use case.
>>>>>>>>> 
>>>>>>>>> HTH
>>>>>>>>> 
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>  
>>>>>>>>> LinkedIn  
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>  
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>> 
>>>>>>>>> Disclaimer: Use i

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
> damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> Mich,
>>>>>> 
>>>>>> First and foremost, we have visualization servers that run Tableau for 
>>>>>> external user reports. Second, we have servers that are ad servers and 
>>>>>> REST endpoints for cookie sync and segmentation data exchange. These 
>>>>>> will use JDBC directly within the same data-center. When not colocated 
>>>>>> in the same data-center, they will connected to a located database 
>>>>>> server using JDBC. Either way, by using JDBC everywhere, it simplifies 
>>>>>> and unifies the code on the JDBC industry standard.
>>>>>> 
>>>>>> Does this make sense?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Like any other design what is your presentation layer and end users?
>>>>>>> 
>>>>>>> Are they SQL centric users from Tableau background or they may use 
>>>>>>> spark functional programming.
>>>>>>> 
>>>>>>> It is best to describe the use case.
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com 
>>>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>>>> loss, damage or destruction of data or any other property which may 
>>>>>>> arise from relying on this email's technical content is explicitly 
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com 
>>>>>>> <mailto:felixcheun...@hotmail.com>> wrote:
>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC 
>>>>>>> server - HBASE would work better.
>>>>>>> 
>>>>>>> Without naming specifics, there are at least 4 or 5 different 
>>>>>>> implementations of HBASE sources, each at varying level of development 
>>>>>>> and different requirements (HBASE release version, Kerberos support etc)
>>>>>>> 
>>>>>>> 
>>>>>>> _
>>>>>>> From: Benjamin Kim <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>
>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>>> <mailto:mich.talebza...@gmail.com>>
>>>>>>> Cc: <user@spark.apache.org <mailto:user@spark.apache.org>>, Felix 
>>>>>>> Cheung <felixcheun...@hotmail.com <mailto:felixcheun...@hotmail.com>>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Mich,
>>>>>>> 
>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about 
>>>>>>> that alternative.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
&g

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread ayan guha
ring as well for frequently accessed data :)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Mich,
>>>
>>> First and foremost, we have visualization servers that run Tableau for
>>> external user reports. Second, we have servers that are ad servers and REST
>>> endpoints for cookie sync and segmentation data exchange. These will use
>>> JDBC directly within the same data-center. When not colocated in the same
>>> data-center, they will connected to a located database server using JDBC.
>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>> the JDBC industry standard.
>>>
>>> Does this make sense?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Like any other design what is your presentation layer and end users?
>>>
>>> Are they SQL centric users from Tableau background or they may use spark
>>> functional programming.
>>>
>>> It is best to describe the use case.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>>>> server - HBASE would work better.
>>>>
>>>> Without naming specifics, there are at least 4 or 5 different
>>>> implementations of HBASE sources, each at varying level of development and
>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>
>>>>
>>>> _
>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com>
>>>>
>>>>
>>>>
>>>> Mich,
>>>>
>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>> that alternative.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think it will work
>>>>
>>>> you can use phoenix on top of hbase
>>>>
>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>> ROW   COLUMN+CELL
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:low, timestamp=1475866783376, value=3

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
Ben,



*Also look at Phoenix (Apache project) which provides a better (one of the
best) SQL/JDBC layer on top of HBase.*

*http://phoenix.apache.org/ <http://phoenix.apache.org/>*


I am afraid this does not work with Spark 2!

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 20:20, Thakrar, Jayesh <jthak...@conversantmedia.com>
wrote:

> Ben,
>
>
>
> Also look at Phoenix (Apache project) which provides a better (one of the
> best) SQL/JDBC layer on top of HBase.
>
> http://phoenix.apache.org/
>
>
>
> Cheers,
>
> Jayesh
>
>
>
>
>
> *From: *vincent gromakowski <vincent.gromakow...@gmail.com>
> *Date: *Monday, October 17, 2016 at 1:53 PM
> *To: *Benjamin Kim <bbuil...@gmail.com>
> *Cc: *Michael Segel <msegel_had...@hotmail.com>, Jörn Franke <
> jornfra...@gmail.com>, Mich Talebzadeh <mich.talebza...@gmail.com>, Felix
> Cheung <felixcheun...@hotmail.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
> *Subject: *Re: Spark SQL Thriftserver with HBase
>
>
>
> Instead of (or additionally to) saving results somewhere, you just start a
> thriftserver that expose the Spark tables of the SQLContext (or
> SparkSession now). That means you can implement any logic (and maybe use
> structured streaming) to expose your data. Today using the thriftserver
> means reading data from the persistent store every query, so if the data
> modeling doesn't fit the query it can be quite long.  What you generally do
> in a common spark job is to load the data and cache spark table in a
> in-memory columnar table which is quite efficient for any kind of query,
> the counterpart is that the cache isn't updated you have to implement a
> reload mechanism, and this solution isn't available using the thriftserver.
>
> What I propose is to mix the two world: periodically/delta load data in
> spark table cache and expose it through the thriftserver. But you have to
> implement the loading logic, it can be very simple to very complex
> depending on your needs.
>
>
>
>
>
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bbuil...@gmail.com>:
>
> Is this technique similar to what Kinesis is offering or what Structured
> Streaming is going to have eventually?
>
>
>
> Just curious.
>
>
>
> Cheers,
>
> Ben
>
>
>
>
>
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>
>
> I would suggest to code your own Spark thriftserver which seems to be very
> easy.
> http://stackoverflow.com/questions/27108863/accessing-
> spark-sql-rdd-tables-through-the-thrift-server
>
> I am starting to test it. The big advantage is that you can implement any
> logic because it's a spark job and then start a thrift server on temporary
> table. For example you can query a micro batch rdd from a kafka stream, or
> pre load some tables and implement a rolling cache to periodically update
> the spark in memory tables with persistent store...
>
> It's not part of the public API and I don't know yet what are the issues
> doing this but I think Spark community should look at this path: making the
> thriftserver be instantiable in any spark job.
>
>
>
> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com>:
>
> Guys,
>
> Sorry for jumping in late to the game…
>
>
>
> If memory serves (which may not be a good thing…) :
>
>
>
> You can use HiveServer2 as a connection point to HBase.
>
> While this doesn’t perform well, its probably the cleanest solution.
>
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
>
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
>
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
>
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
>
>
> With respect to Tableau… their entire interface in to the bi

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Thakrar, Jayesh
Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) 
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski <vincent.gromakow...@gmail.com>
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim <bbuil...@gmail.com>
Cc: Michael Segel <msegel_had...@hotmail.com>, Jörn Franke 
<jornfra...@gmail.com>, Mich Talebzadeh <mich.talebza...@gmail.com>, Felix 
Cheung <felixcheun...@hotmail.com>, "user@spark.apache.org" 
<user@spark.apache.org>
Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
<bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all th

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
ying on Spark using
>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>>>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>>>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>>>> Kudu team completes Spark SQL support for JDBC.
>>>>
>>>> Thanks for the advice.
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> Sure. But essentially you are looking at batch data for analytics for
>>>> your tableau users so Hive may be a better choice with its rich SQL and
>>>> ODBC.JDBC connection to Tableau already.
>>>>
>>>> I would go for Hive especially the new release will have an in-memory
>>>> offering as well for frequently accessed data :)
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>>
>>>>> Mich,
>>>>>
>>>>> First and foremost, we have visualization servers that run Tableau for
>>>>> external user reports. Second, we have servers that are ad servers and 
>>>>> REST
>>>>> endpoints for cookie sync and segmentation data exchange. These will use
>>>>> JDBC directly within the same data-center. When not colocated in the same
>>>>> data-center, they will connected to a located database server using JDBC.
>>>>> Either way, by using JDBC everywhere, it simplifies and unifies the code 
>>>>> on
>>>>> the JDBC industry standard.
>>>>>
>>>>> Does this make sense?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>> Like any other design what is your presentation layer and end users?
>>>>>
>>>>> Are they SQL centric users from Tableau background or they may use
>>>>> spark functional programming.
>>>>>
>>>>> It is best to describe the use case.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix
>>>>>> JDBC server - HBASE would work better.
>>>>>>
>>>>>> Without naming specifics, there are at least 4 or 5 different
>>>>>> implementations of HBASE sources, each at varying level 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
>> store data in-memory and refresh only when needed the in-memory data. This
>> means you can import it from any source and let your users work only on the
>> in-memory data in Tableau Server.
>>
>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>>> provided already a good alternative. However, you should check if it
>>> contains a recent version of Hbase and Phoenix. That being said, I just
>>> wonder what is the dataflow, data model and the analysis you plan to do.
>>> Maybe there are completely different solutions possible. Especially these
>>> single inserts, upserts etc. should be avoided as much as possible in the
>>> Big Data (analysis) world with any technology, because they do not perform
>>> well.
>>>
>>> Hive with Llap will provide an in-memory cache for interactive
>>> analytics. You can put full tables in-memory with Hive using Ignite HDFS
>>> in-memory solution. All this does only make sense if you do not use MR as
>>> an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>
>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>> Mich,
>>>
>>> Unfortunately, we are moving away from Hive and unifying on Spark using
>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>>> Kudu team completes Spark SQL support for JDBC.
>>>
>>> Thanks for the advice.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Sure. But essentially you are looking at batch data for analytics for
>>> your tableau users so Hive may be a better choice with its rich SQL and
>>> ODBC.JDBC connection to Tableau already.
>>>
>>> I would go for Hive especially the new release will have an in-memory
>>> offering as well for frequently accessed data :)
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>>> Mich,
>>>>
>>>> First and foremost, we have visualization servers that run Tableau for
>>>> external user reports. Second, we have servers that are ad servers and REST
>>>> endpoints for cookie sync and segmentation data exchange. These will use
>>>> JDBC directly within the same data-center. When not colocated in the same
>>>> data-center, they will connected to a located database server using JDBC.
>>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>>> the JDBC industry standard.
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> Like any other design what is your presentation layer and end users?
>>>>
>>>> Are they SQL centric users from Tableau background or they may use
>>>> spark functional programming.
>>>>
>>>> It is best to describe the use case.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>&g

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
 as much as possible in the Big Data 
>>> (analysis) world with any technology, because they do not perform well. 
>>> 
>>> Hive with Llap will provide an in-memory cache for interactive analytics. 
>>> You can put full tables in-memory with Hive using Ignite HDFS in-memory 
>>> solution. All this does only make sense if you do not use MR as an engine, 
>>> the right input format (ORC, parquet) and a recent Hive version.
>>> 
>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>>> Mich,
>>>> 
>>>> Unfortunately, we are moving away from Hive and unifying on Spark using 
>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver 
>>>> too. I will either try Phoenix JDBC Server for HBase or push to move 
>>>> faster to Kudu with Impala. We will use Impala as the JDBC in-between 
>>>> until the Kudu team completes Spark SQL support for JDBC.
>>>> 
>>>> Thanks for the advice.
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>> 
>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>> 
>>>>> Sure. But essentially you are looking at batch data for analytics for 
>>>>> your tableau users so Hive may be a better choice with its rich SQL and 
>>>>> ODBC.JDBC connection to Tableau already.
>>>>> 
>>>>> I would go for Hive especially the new release will have an in-memory 
>>>>> offering as well for frequently accessed data :)
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>  
>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>>> loss, damage or destruction of data or any other property which may arise 
>>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>>> The author will in no case be liable for any monetary damages arising 
>>>>> from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Mich,
>>>>> 
>>>>> First and foremost, we have visualization servers that run Tableau for 
>>>>> external user reports. Second, we have servers that are ad servers and 
>>>>> REST endpoints for cookie sync and segmentation data exchange. These will 
>>>>> use JDBC directly within the same data-center. When not colocated in the 
>>>>> same data-center, they will connected to a located database server using 
>>>>> JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the 
>>>>> code on the JDBC industry standard.
>>>>> 
>>>>> Does this make sense?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>> 
>>>>>> Like any other design what is your presentation layer and end users?
>>>>>> 
>>>>>> Are they SQL centric users from Tableau background or they may use spark 
>>>>>> functional programming.
>>>>>> 
>>>>>> It is best to describe the use case.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com 
>>>

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
tween until the
>> Kudu team completes Spark SQL support for JDBC.
>>
>> Thanks for the advice.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Sure. But essentially you are looking at batch data for analytics for
>> your tableau users so Hive may be a better choice with its rich SQL and
>> ODBC.JDBC connection to Tableau already.
>>
>> I would go for Hive especially the new release will have an in-memory
>> offering as well for frequently accessed data :)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 20:15, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Mich,
>>>
>>> First and foremost, we have visualization servers that run Tableau for
>>> external user reports. Second, we have servers that are ad servers and REST
>>> endpoints for cookie sync and segmentation data exchange. These will use
>>> JDBC directly within the same data-center. When not colocated in the same
>>> data-center, they will connected to a located database server using JDBC.
>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>> the JDBC industry standard.
>>>
>>> Does this make sense?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Like any other design what is your presentation layer and end users?
>>>
>>> Are they SQL centric users from Tableau background or they may use spark
>>> functional programming.
>>>
>>> It is best to describe the use case.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheun...@hotmail.com>
>>> wrote:
>>>
>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>>>> server - HBASE would work better.
>>>>
>>>> Without naming specifics, there are at least 4 or 5 different
>>>> implementations of HBASE sources, each at varying level of development and
>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>
>>>>
>>>> _
>>>> From: Benjamin Kim <bbuil...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> Cc: <user@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com>
>>>>
>>>>
>>>>
>>>> Mich,
>>>>
>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>> that alternative.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think it will work
>>&g

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
lt;bbuil...@gmail.com<mailto:bbuil...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external 
user reports. Second, we have servers that are ad servers and REST endpoints 
for cookie sync and segmentation data exchange. These will use JDBC directly 
within the same data-center. When not colocated in the same data-center, they 
will connected to a located database server using JDBC. Either way, by using 
JDBC everywhere, it simplifies and unifies the code on the JDBC industry 
standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark 
functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server 
- HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations 
of HBASE sources, each at varying level of development and different 
requirements (HBASE release version, Kerberos support etc)


_
From: Benjamin Kim <bbuil...@gmail.com<mailto:bbuil...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that 
alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW   COLUMN+CELL
 TSCO-1-Apr-08
column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08
column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08
column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08
column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08
column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08
column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08
column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08
column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select 
substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS 
"Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's 
Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS 
"AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' 
and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','-MM-dd') order by  
to_date("Date",'dd-MMM-yy') limit 1;
+-+--+-++-+-+---++
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  
|  volume   | AverageDailyPrice  |
+-+--+-++-+-+---++
| 2015-10-07  | 197.00   | 198.05  | 184.84 | 192.20  | TSCO
| 30046994  | 191.445|


HTH




Dr Mic

Re: Spark SQL Thriftserver

2016-09-14 Thread Mich Talebzadeh
Actually this is what it says

Connecting to jdbc:hive2://rhes564:10055
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive

So it uses Spark SQL. However, they do not seem to have upgraded Beeline
version from 1.2.1

HTH

It is a useful tool with Zeppelin.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 00:55, ayan guha  wrote:

> Hi
>
> AFAIK STS uses Spark SQL and not Map Reduce. Is that not correct?
>
> Best
> Ayan
>
> On Wed, Sep 14, 2016 at 8:51 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> STS will rely on Hive execution engine. My Hive uses Spark execution
>> engine so STS will pass the SQL to Hive and let it do the work and return
>> the result set
>>
>>  which beeline
>> /usr/lib/spark-2.0.0-bin-hadoop2.6/bin/beeline
>> ${SPARK_HOME}/bin/beeline -u jdbc:hive2://rhes564:10055 -n hduser -p
>> 
>> Connecting to jdbc:hive2://rhes564:10055
>> Connected to: Spark SQL (version 2.0.0)
>> Driver: Hive JDBC (version 1.2.1.spark2)
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>> Beeline version 1.2.1.spark2 by Apache Hive
>> 0: jdbc:hive2://rhes564:10055>
>>
>> jdbc:hive2://rhes564:10055> select count(1) from test.prices;
>> Ok I did a simple query in STS, You will this in hive.log
>>
>> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
>> get_database: test
>> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
>> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
>> get_database: test
>> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
>> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>>
>> I think it is a good idea to switch to Spark engine (as opposed to MR).
>> My tests proved that Hive on Spark using DAG and in-memory offering runs at
>> least by order of magnitude faster compared to map-reduce.
>>
>> You can either connect to beeline from $HIVE_HOME/... or beeline from
>> $SPARK_HOME
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this 

Re: Spark SQL Thriftserver

2016-09-13 Thread Takeshi Yamamuro
Hi, all

Spark STS just uses HiveContext inside and does not use MR.
Anyway, Spark STS misses some HiveServer2 functionalities such as HA (See:
https://issues.apache.org/jira/browse/SPARK-11100) and has some known
issues there.
So, you'd better off checking all the jira issues related to STS for
considering the replacement.

// maropu

On Wed, Sep 14, 2016 at 8:55 AM, ayan guha  wrote:

> Hi
>
> AFAIK STS uses Spark SQL and not Map Reduce. Is that not correct?
>
> Best
> Ayan
>
> On Wed, Sep 14, 2016 at 8:51 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> STS will rely on Hive execution engine. My Hive uses Spark execution
>> engine so STS will pass the SQL to Hive and let it do the work and return
>> the result set
>>
>>  which beeline
>> /usr/lib/spark-2.0.0-bin-hadoop2.6/bin/beeline
>> ${SPARK_HOME}/bin/beeline -u jdbc:hive2://rhes564:10055 -n hduser -p
>> 
>> Connecting to jdbc:hive2://rhes564:10055
>> Connected to: Spark SQL (version 2.0.0)
>> Driver: Hive JDBC (version 1.2.1.spark2)
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>> Beeline version 1.2.1.spark2 by Apache Hive
>> 0: jdbc:hive2://rhes564:10055>
>>
>> jdbc:hive2://rhes564:10055> select count(1) from test.prices;
>> Ok I did a simple query in STS, You will this in hive.log
>>
>> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
>> get_database: test
>> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
>> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
>> get_database: test
>> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
>> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>>
>> I think it is a good idea to switch to Spark engine (as opposed to MR).
>> My tests proved that Hive on Spark using DAG and in-memory offering runs at
>> least by order of magnitude faster compared to map-reduce.
>>
>> You can either connect to beeline from $HIVE_HOME/... or beeline from
>> $SPARK_HOME
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 13 September 2016 at 23:28, Benjamin Kim  wrote:
>>
>>> Mich,
>>>
>>> It sounds like that there would be no harm in changing then. Are you
>>> saying that using STS would still use MapReduce to run the SQL statements?
>>> What our users are doing in our CDH 5.7.2 installation is changing the
>>> execution engine to Spark when connected to HiveServer2 to get faster
>>> results. Would they still have to do this using 

Re: Spark SQL Thriftserver

2016-09-13 Thread ayan guha
Hi

AFAIK STS uses Spark SQL and not Map Reduce. Is that not correct?

Best
Ayan

On Wed, Sep 14, 2016 at 8:51 AM, Mich Talebzadeh 
wrote:

> STS will rely on Hive execution engine. My Hive uses Spark execution
> engine so STS will pass the SQL to Hive and let it do the work and return
> the result set
>
>  which beeline
> /usr/lib/spark-2.0.0-bin-hadoop2.6/bin/beeline
> ${SPARK_HOME}/bin/beeline -u jdbc:hive2://rhes564:10055 -n hduser -p
> 
> Connecting to jdbc:hive2://rhes564:10055
> Connected to: Spark SQL (version 2.0.0)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1.spark2 by Apache Hive
> 0: jdbc:hive2://rhes564:10055>
>
> jdbc:hive2://rhes564:10055> select count(1) from test.prices;
> Ok I did a simple query in STS, You will this in hive.log
>
> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
> get_database: test
> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
> db=test tbl=prices
> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
> tbl=prices
> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
> db=test tbl=prices
> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
> tbl=prices
> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
> get_database: test
> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
> db=test tbl=prices
> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
> tbl=prices
> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
> db=test tbl=prices
> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: HiveMetaStore.audit
> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
> tbl=prices
>
> I think it is a good idea to switch to Spark engine (as opposed to MR). My
> tests proved that Hive on Spark using DAG and in-memory offering runs at
> least by order of magnitude faster compared to map-reduce.
>
> You can either connect to beeline from $HIVE_HOME/... or beeline from
> $SPARK_HOME
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 13 September 2016 at 23:28, Benjamin Kim  wrote:
>
>> Mich,
>>
>> It sounds like that there would be no harm in changing then. Are you
>> saying that using STS would still use MapReduce to run the SQL statements?
>> What our users are doing in our CDH 5.7.2 installation is changing the
>> execution engine to Spark when connected to HiveServer2 to get faster
>> results. Would they still have to do this using STS? Lastly, we are seeing
>> zombie YARN jobs left behind even after a user disconnects. Are you seeing
>> this happen with STS? If not, then this would be even better.
>>
>> Thanks for your fast reply.
>>
>> Cheers,
>> Ben
>>
>> On Sep 13, 2016, at 3:15 PM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Spark Thrift server (STS) still uses hive thrift server. If you look at
>> $SPARK_HOME/sbin/start-thriftserver.sh you will see (mine is Spark 2)
>>
>> function usage {
>>   echo "Usage: 

Re: Spark SQL Thriftserver

2016-09-13 Thread Mich Talebzadeh
STS will rely on Hive execution engine. My Hive uses Spark execution engine
so STS will pass the SQL to Hive and let it do the work and return the
result set

 which beeline
/usr/lib/spark-2.0.0-bin-hadoop2.6/bin/beeline
${SPARK_HOME}/bin/beeline -u jdbc:hive2://rhes564:10055 -n hduser -p

Connecting to jdbc:hive2://rhes564:10055
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://rhes564:10055>

jdbc:hive2://rhes564:10055> select count(1) from test.prices;
Ok I did a simple query in STS, You will this in hive.log

2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_database:
test
2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
db=test tbl=prices
2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
tbl=prices
2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
db=test tbl=prices
2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
tbl=prices
2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_database:
test
2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
db=test tbl=prices
2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
tbl=prices
2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
db=test tbl=prices
2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
tbl=prices

I think it is a good idea to switch to Spark engine (as opposed to MR). My
tests proved that Hive on Spark using DAG and in-memory offering runs at
least by order of magnitude faster compared to map-reduce.

You can either connect to beeline from $HIVE_HOME/... or beeline from
$SPARK_HOME

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 September 2016 at 23:28, Benjamin Kim  wrote:

> Mich,
>
> It sounds like that there would be no harm in changing then. Are you
> saying that using STS would still use MapReduce to run the SQL statements?
> What our users are doing in our CDH 5.7.2 installation is changing the
> execution engine to Spark when connected to HiveServer2 to get faster
> results. Would they still have to do this using STS? Lastly, we are seeing
> zombie YARN jobs left behind even after a user disconnects. Are you seeing
> this happen with STS? If not, then this would be even better.
>
> Thanks for your fast reply.
>
> Cheers,
> Ben
>
> On Sep 13, 2016, at 3:15 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> Spark Thrift server (STS) still uses hive thrift server. If you look at
> $SPARK_HOME/sbin/start-thriftserver.sh you will see (mine is Spark 2)
>
> function usage {
>   echo "Usage: ./sbin/start-thriftserver [options] [thrift server options]"
>   pattern="usage"
>   *pattern+="\|Spark assembly has been built with Hive"*
>   pattern+="\|NOTE: SPARK_PREPEND_CLASSES is set"
>   pattern+="\|Spark Command: "
>   pattern+="\|==="
>   pattern+="\|--help"
>
>
> Indeed when you start STS, you pass hiveconf parameter to it
>
> 

Re: Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim
Mich,

It sounds like that there would be no harm in changing then. Are you saying 
that using STS would still use MapReduce to run the SQL statements? What our 
users are doing in our CDH 5.7.2 installation is changing the execution engine 
to Spark when connected to HiveServer2 to get faster results. Would they still 
have to do this using STS? Lastly, we are seeing zombie YARN jobs left behind 
even after a user disconnects. Are you seeing this happen with STS? If not, 
then this would be even better.

Thanks for your fast reply.

Cheers,
Ben

> On Sep 13, 2016, at 3:15 PM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> Spark Thrift server (STS) still uses hive thrift server. If you look at 
> $SPARK_HOME/sbin/start-thriftserver.sh you will see (mine is Spark 2)
> 
> function usage {
>   echo "Usage: ./sbin/start-thriftserver [options] [thrift server options]"
>   pattern="usage"
>   pattern+="\|Spark assembly has been built with Hive"
>   pattern+="\|NOTE: SPARK_PREPEND_CLASSES is set"
>   pattern+="\|Spark Command: "
>   pattern+="\|==="
>   pattern+="\|--help"
> 
> 
> Indeed when you start STS, you pass hiveconf parameter to it
> 
> ${SPARK_HOME}/sbin/start-thriftserver.sh \
> --master  \
> --hiveconf hive.server2.thrift.port=10055 \
> 
> and STS bypasses Spark optimiser and uses Hive optimizer and execution 
> engine. You will see this in hive.log file
> 
> So I don't think it is going to give you much difference. Unless they have 
> recently changed the design of STS.
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 13 September 2016 at 22:32, Benjamin Kim  > wrote:
> Does anyone have any thoughts about using Spark SQL Thriftserver in Spark 
> 1.6.2 instead of HiveServer2? We are considering abandoning HiveServer2 for 
> it. Some advice and gotcha’s would be nice to know.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Re: Spark SQL Thriftserver

2016-09-13 Thread Mich Talebzadeh
Hi,

Spark Thrift server (STS) still uses hive thrift server. If you look at
$SPARK_HOME/sbin/start-thriftserver.sh you will see (mine is Spark 2)

function usage {
  echo "Usage: ./sbin/start-thriftserver [options] [thrift server options]"
  pattern="usage"
  *pattern+="\|Spark assembly has been built with Hive"*
  pattern+="\|NOTE: SPARK_PREPEND_CLASSES is set"
  pattern+="\|Spark Command: "
  pattern+="\|==="
  pattern+="\|--help"


Indeed when you start STS, you pass hiveconf parameter to it

${SPARK_HOME}/sbin/start-thriftserver.sh \
--master  \
--hiveconf hive.server2.thrift.port=10055 \

and STS bypasses Spark optimiser and uses Hive optimizer and execution
engine. You will see this in hive.log file

So I don't think it is going to give you much difference. Unless they have
recently changed the design of STS.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 September 2016 at 22:32, Benjamin Kim  wrote:

> Does anyone have any thoughts about using Spark SQL Thriftserver in Spark
> 1.6.2 instead of HiveServer2? We are considering abandoning HiveServer2 for
> it. Some advice and gotcha’s would be nice to know.
>
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Deenar Toraskar
Reece

You can do the following. Start the spark-shell. Register the UDFs in the
shell using sqlContext, then start the Thrift Server using startWithContext
from the spark shell: https://github.com/apache/spark/blob/master/sql/hive-
thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver
/HiveThriftServer2.scala#L56



Regards
Deenar

On 19 October 2015 at 04:42, Mohammed Guller  wrote:

> Have you tried registering the function using the Beeline client?
>
> Another alternative would be to create a Spark SQL UDF and launch the
> Spark SQL Thrift server programmatically.
>
> Mohammed
>
> -Original Message-
> From: ReeceRobinson [mailto:re...@therobinsons.gen.nz]
> Sent: Sunday, October 18, 2015 8:05 PM
> To: user@spark.apache.org
> Subject: Spark SQL Thriftserver and Hive UDF in Production
>
> Does anyone have some advice on the best way to deploy a Hive UDF for use
> with a Spark SQL Thriftserver where the client is Tableau using Simba ODBC
> Spark SQL driver.
>
> I have seen the hive documentation that provides an example of creating
> the function using a hive client ie: CREATE FUNCTION myfunc AS 'myclass'
> USING JAR 'hdfs:///path/to/jar';
>
> However using Tableau I can't run this create function statement to
> register my UDF. Ideally there is a configuration setting that will load my
> UDF jar and register it at start-up of the thriftserver.
>
> Can anyone tell me what the best option if it is possible?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thriftserver-and-Hive-UDF-in-Production-tp25114.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Todd Nist
>From tableau, you should be able to use the Initial SQL option to support
this:

So in Tableau add the following to the “Initial SQL”

create function myfunc AS 'myclass'
using jar 'hdfs:///path/to/jar';



HTH,
Todd


On Mon, Oct 19, 2015 at 11:22 AM, Deenar Toraskar  wrote:

> Reece
>
> You can do the following. Start the spark-shell. Register the UDFs in the
> shell using sqlContext, then start the Thrift Server using startWithContext
> from the spark shell:
> https://github.com/apache/spark/blob/master/sql/hive-thriftserver
> /src/main/scala/org/apache/spark/sql/hive/thriftserver
> /HiveThriftServer2.scala#L56
>
>
>
> Regards
> Deenar
>
> On 19 October 2015 at 04:42, Mohammed Guller 
> wrote:
>
>> Have you tried registering the function using the Beeline client?
>>
>> Another alternative would be to create a Spark SQL UDF and launch the
>> Spark SQL Thrift server programmatically.
>>
>> Mohammed
>>
>> -Original Message-
>> From: ReeceRobinson [mailto:re...@therobinsons.gen.nz]
>> Sent: Sunday, October 18, 2015 8:05 PM
>> To: user@spark.apache.org
>> Subject: Spark SQL Thriftserver and Hive UDF in Production
>>
>> Does anyone have some advice on the best way to deploy a Hive UDF for use
>> with a Spark SQL Thriftserver where the client is Tableau using Simba ODBC
>> Spark SQL driver.
>>
>> I have seen the hive documentation that provides an example of creating
>> the function using a hive client ie: CREATE FUNCTION myfunc AS 'myclass'
>> USING JAR 'hdfs:///path/to/jar';
>>
>> However using Tableau I can't run this create function statement to
>> register my UDF. Ideally there is a configuration setting that will load my
>> UDF jar and register it at start-up of the thriftserver.
>>
>> Can anyone tell me what the best option if it is possible?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thriftserver-and-Hive-UDF-in-Production-tp25114.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
>> commands, e-mail: user-h...@spark.apache.org
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


RE: Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread Mohammed Guller
Have you tried registering the function using the Beeline client?

Another alternative would be to create a Spark SQL UDF and launch the Spark SQL 
Thrift server programmatically.

Mohammed

-Original Message-
From: ReeceRobinson [mailto:re...@therobinsons.gen.nz] 
Sent: Sunday, October 18, 2015 8:05 PM
To: user@spark.apache.org
Subject: Spark SQL Thriftserver and Hive UDF in Production

Does anyone have some advice on the best way to deploy a Hive UDF for use with 
a Spark SQL Thriftserver where the client is Tableau using Simba ODBC Spark SQL 
driver.

I have seen the hive documentation that provides an example of creating the 
function using a hive client ie: CREATE FUNCTION myfunc AS 'myclass' USING JAR 
'hdfs:///path/to/jar';

However using Tableau I can't run this create function statement to register my 
UDF. Ideally there is a configuration setting that will load my UDF jar and 
register it at start-up of the thriftserver.

Can anyone tell me what the best option if it is possible?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thriftserver-and-Hive-UDF-in-Production-tp25114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL ThriftServer Impersonation Support

2015-05-03 Thread Night Wolf
Thanks Andrew. What version of HS2 is the SparkSQL thrift server using?
What would be involved in updating? Is it a simple case of increasing the
deep version in one of the project POMs?

Cheers,
~N

On Sat, May 2, 2015 at 11:38 AM, Andrew Lee alee...@hotmail.com wrote:

 Hi N,

 See: https://issues.apache.org/jira/browse/SPARK-5159

 I don't think it is yet supported until the HS2 code base is updated in
 Spark hive-thriftserver project.

 --
 Date: Fri, 1 May 2015 15:56:30 +1000
 Subject: Spark SQL ThriftServer Impersonation Support
 From: nightwolf...@gmail.com
 To: user@spark.apache.org


 Hi guys,


 Trying to use the SparkSQL Thriftserver with hive metastore. It seems that
 hive meta impersonation works fine (when running Hive tasks). However
 spinning up SparkSQL thrift server, impersonation doesn't seem to work...

 What settings do I need to enable impersonation?

 I've copied the same config as in my hive-site. Here is my launch command
 for the spark thrift server;

 --hiveconf hive.server2.enable.impersonation=true --hiveconf
 hive.server2.enable.doAs=true --hiveconf hive.metastore.execute.setugi=true

 Here is my full run script:

 export HIVE_SERVER2_THRIFT_BIND_HOST=0.0.0.0
 export HIVE_SERVER2_THRIFT_PORT=1

 export HIVE_CONF_DIR=/opt/mapr/hive/hive-0.13/conf/
 export HIVE_HOME=/opt/mapr/hive/hive-0.13/
 export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/
 export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop

 export EXECUTOR_MEMORY=30g
 export DRIVER_MEMORY=4g
 export EXECUTOR_CORES=15
 export NUM_EXECUTORS=20
 export KRYO_BUFFER=512
 export SPARK_DRIVER_MAXRESULTSIZE=4096

 export HIVE_METASTORE_URIS=thrift://localhost:9083
 export HIVE_METASTORE_WAREHOUSE_DIR=/user/hive/warehouse

 export
 SPARK_DIST_CLASSPATH=/opt/mapr/lib/*:/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/common/lib/*:/opt/mapr/hive/hive-current/lib/*
 export SPARK_LOG_DIR=/tmp/spark-log
 export
 SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs:///log/spark-events
 export SPARK_CONF_DIR=/apps/spark/global-conf/

 export SPARK_HOME=/apps/spark/spark-1.3.1-bin-mapr4.0.2_yarn_j6_2.10

 export SPARK_LIBRARY_PATH=/opt/mapr/lib/*
 export SPARK_JAVA_OPTS=-Djava.library.path=/opt/mapr/lib


 $SPARK_HOME/*sbin/start-thriftserver.sh* --master yarn-client --jars
 /opt/mapr/lib/libjpam.so --executor-memory $EXECUTOR_MEMORY --driver-memory
 $DRIVER_MEMORY --executor-cores $EXECUTOR_CORES --num-executors
 $NUM_EXECUTORS --conf spark.scheduler.mode=FAIR --conf
 spark.kryoserializer.buffer.mb=$KRYO_BUFFER --conf
 spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
 spark.files.useFetchCache=false --conf
 spark.driver.maxResultSize=$SPARK_DRIVER_MAXRESULTSIZE --hiveconf
 hive.metastore.uris=$HIVE_METASTORE_URIS --hiveconf
 hive.metastore.warehouse.dir=$HIVE_METASTORE_WAREHOUSE_DIR --hiveconf
 hive.server2.enable.impersonation=true --hiveconf
 hive.server2.enable.doAs=true --hiveconf hive.metastore.execute.setugi=true


 Cheers,
 N