Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Gopal Vijayaraghavan

> That being said all systems are evolving. Hive supports tez+llap which
>is basically the in-memory support.

There is a big difference between where LLAP & SparkSQL, which has to do
with access pattern needs.

The first one is related to the lifetime of the cache - the Spark RDD
cache is per-user-session which allows for further operation in that
session to be optimized.

LLAP is designed to be hammered by multiple user sessions running
different queries, designed to automate the cache eviction & selection
process. There's no user visible explicit .cache() to remember - it's
automatic and concurrent.

My team works with both engines, trying to improve it for ORC, but the
goals of both are different.

I will probably have to write a proper academic paper & get it
edited/reviewed instead of send my ramblings to the user lists like this.
Still, this needs an example to talk about.

To give a qualified example, let's leave the world of single use clusters
and take the use-case detailed here

http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/


There are two distinct problems there - one is that a single day sees upto
100k independent user sessions running queries and that most queries cover
the last hour (& possibly join/compare against a similar hour aggregate
from the past).

The problem with having independent 100k user-sessions from different
connections was that the SparkSQL layer drops the RDD lineage & cache
whenever a user ends a session.

The scale problem in general for Impala was that even though the data size
was in multiple terabytes, the actual hot data was approx <20Gb, which
resides on <10 machines with locality.

The same problem applies when you apply RDD caching with something like
un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
popular that the machines which hold those blocks run extra hot.

A cache model per-user session is entirely wasteful and a common cache +
MPP model effectively overloads 2-3% of cluster, while leaving the other
machines idle.

LLAP was designed specifically to prevent that hotspotting, while
maintaining the common cache model - within a few minutes after an hour
ticks over, the whole cluster develops temporal popularity for the hot
data and nearly every rack has at least one cached copy of the same data
for availability/performance.

Since data stream tend to be extremely wide table (Omniture) comes to
mine, so the cache actually does not hold all columns in a table and since
Zipf distributions are extremely common in these real data sets, the cache
does not hold all rows either.

select count(clicks) from table where zipcode = 695506;

with ORC data bucketed + *sorted* by zipcode, the row-groups which are in
the cache will be the only 2 columns (clicks & zipcode) & all bloomfilter
indexes for all files will be loaded into memory, all misses on the bloom
will not even feature in the cache.

A subsequent query for

select count(clicks) from table where zipcode = 695586;

will run against the collected indexes, before deciding which files need
to be loaded into cache.


Then again, 

select count(clicks)/count(impressions) from table where zipcode = 695586;

will load only impressions out of the table into cache, to add it to the
columnar cache without producing another complete copy (RDDs are not
mutable, but LLAP cache is additive).

The column split cache & index-cache separation allows for this to be
cheaper than a full rematerialization - both are evicted as they fill up,
with different priorities.

Following the same vein, LLAP can do a bit of clairvoyant pre-processing,
with a bit of input from UX patterns observed from Tableau/Microstrategy
users to give it the impression of being much faster than the engine
really can be.

Illusion of performance is likely to be indistinguishable from actual -
I'm actually looking for subjects for that experiment :)

Cheers,
Gopal




Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Marcin Tustin
Mich - it sounds like maybe you should try these benchmarks with alluxio
abstracting the storage layer, and see how much it makes a difference.
Alluxio should (if I understand it right) provide a lot of the optimisation
you're looking for with in memory work.

I've never used it, but I would love to hear the experiences of people who
have.

On Mon, May 30, 2016 at 5:32 PM, Mich Talebzadeh 
wrote:

> I think we are going to move to a model that the computation stack will be
> separate from storage stack and moreover something like Hive that provides
> the means for persistent storage (well HDFS is the one that stores all the
> data) will have an in-memory type capability much like what Oracle TimesTen
> IMDB does with its big brother Oracle. Now TimesTen is effectively designed
> to provide in-memory capability for analytics for Oracle 12c. These two work 
> like
> an index or materialized view.  You write queries against tables -
> optimizer figures out whether to use row oriented storage and indexes to
> access (Oracle classic) or column non-indexed storage to answer (TimesTen).
> just one optimizer.
>
> I gather Hive will be like that eventually. it will decide based on the
> frequency of access where to look for data. Yes we may have 10 TB of data
> on disk but how much of it is frequently accessed (hot data). 80-20 rule?
> In reality may be just 2TB or most recent partitions etc. The rest is cold
> data.
>
> cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 21:59, Michael Segel  wrote:
>
>> And you have MapR supporting Apache Drill.
>>
>> So these are all alternatives to Spark, and its not necessarily an either
>> or scenario. You can have both.
>>
>> On May 30, 2016, at 12:49 PM, Mich Talebzadeh 
>> wrote:
>>
>> yep Hortonworks supports Tez for one reason or other which I am going
>> hopefully to test it as the query engine for hive. Tthough I think Spark
>> will be faster because of its in-memory support.
>>
>> Also if you are independent then you better off dealing with Spark and
>> Hive without the need to support another stack like Tez.
>>
>> Cloudera support Impala instead of Hive but it is not something I have
>> used. .
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 30 May 2016 at 20:19, Michael Segel  wrote:
>>
>>> Mich,
>>>
>>> Most people use vendor releases because they need to have the support.
>>> Hortonworks is the vendor who has the most skin in the game when it
>>> comes to Tez.
>>>
>>> If memory serves, Tez isn’t going to be M/R but a local execution
>>> engine? Then LLAP is the in-memory piece to speed up Tez?
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> thanks I think the problem is that the TEZ user group is exceptionally
>>> quiet. Just sent an email to Hive user group to see anyone has managed to
>>> built a vendor independent version.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>>>
 Well I think it is different from MR. It has some optimizations which
 you do not find in MR. Especially the LLAP option in Hive2 makes it
 interesting.

 I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it
 is integrated in the Hortonworks distribution.


 On 29 May 2016, at 21:43, Mich Talebzadeh 
 wrote:

 Hi Jorn,

 I started building apache-tez-0.8.2 but got few errors. Couple of guys
 from TEZ user group kindly gave a hand but I could not go very far (or may
 be I did not make enough efforts) making it work.

 That TEZ user group is very quiet as well.

 My understanding is TEZ is MR with DAG but of course Spark has both
 plus in-memory capability.

 It would be interesting to see what version of TEZ works as execution
 engine with Hive.

 Vendors are divided on this (use Hive with TEZ) or use Impala instead
 of Hive etc as I am sure you already know.

 Cheers,




 Dr Mich Talebzadeh


 LinkedIn * 
 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
I think we are going to move to a model that the computation stack will be
separate from storage stack and moreover something like Hive that provides
the means for persistent storage (well HDFS is the one that stores all the
data) will have an in-memory type capability much like what Oracle TimesTen
IMDB does with its big brother Oracle. Now TimesTen is effectively designed
to provide in-memory capability for analytics for Oracle 12c. These
two work like
an index or materialized view.  You write queries against tables -
optimizer figures out whether to use row oriented storage and indexes to
access (Oracle classic) or column non-indexed storage to answer
(TimesTen). just
one optimizer.

I gather Hive will be like that eventually. it will decide based on the
frequency of access where to look for data. Yes we may have 10 TB of data
on disk but how much of it is frequently accessed (hot data). 80-20 rule?
In reality may be just 2TB or most recent partitions etc. The rest is cold
data.

cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 21:59, Michael Segel  wrote:

> And you have MapR supporting Apache Drill.
>
> So these are all alternatives to Spark, and its not necessarily an either
> or scenario. You can have both.
>
> On May 30, 2016, at 12:49 PM, Mich Talebzadeh 
> wrote:
>
> yep Hortonworks supports Tez for one reason or other which I am going
> hopefully to test it as the query engine for hive. Tthough I think Spark
> will be faster because of its in-memory support.
>
> Also if you are independent then you better off dealing with Spark and
> Hive without the need to support another stack like Tez.
>
> Cloudera support Impala instead of Hive but it is not something I have
> used. .
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 20:19, Michael Segel  wrote:
>
>> Mich,
>>
>> Most people use vendor releases because they need to have the support.
>> Hortonworks is the vendor who has the most skin in the game when it comes
>> to Tez.
>>
>> If memory serves, Tez isn’t going to be M/R but a local execution engine?
>> Then LLAP is the in-memory piece to speed up Tez?
>>
>> HTH
>>
>> -Mike
>>
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh 
>> wrote:
>>
>> thanks I think the problem is that the TEZ user group is exceptionally
>> quiet. Just sent an email to Hive user group to see anyone has managed to
>> built a vendor independent version.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>>
>>> Well I think it is different from MR. It has some optimizations which
>>> you do not find in MR. Especially the LLAP option in Hive2 makes it
>>> interesting.
>>>
>>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it
>>> is integrated in the Hortonworks distribution.
>>>
>>>
>>> On 29 May 2016, at 21:43, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi Jorn,
>>>
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>>> from TEZ user group kindly gave a hand but I could not go very far (or may
>>> be I did not make enough efforts) making it work.
>>>
>>> That TEZ user group is very quiet as well.
>>>
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus
>>> in-memory capability.
>>>
>>> It would be interesting to see what version of TEZ works as execution
>>> engine with Hive.
>>>
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
>>> Hive etc as I am sure you already know.
>>>
>>> Cheers,
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>>>
 Very interesting do you plan also a test with TEZ?

 On 29 May 2016, at 13:40, Mich Talebzadeh 
 wrote:

 Hi,

 I did another study of Hive using Spark engine compared to Hive with MR.

 Basically took the original table imported using Sqoop and created 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
And you have MapR supporting Apache Drill. 

So these are all alternatives to Spark, and its not necessarily an either or 
scenario. You can have both. 

> On May 30, 2016, at 12:49 PM, Mich Talebzadeh  
> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 30 May 2016 at 20:19, Michael Segel  > wrote:
> Mich, 
> 
> Most people use vendor releases because they need to have the support. 
> Hortonworks is the vendor who has the most skin in the game when it comes to 
> Tez. 
> 
> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
> Then LLAP is the in-memory piece to speed up Tez? 
> 
> HTH
> 
> -Mike
> 
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh > > wrote:
>> 
>> thanks I think the problem is that the TEZ user group is exceptionally 
>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>> built a vendor independent version.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 21:23, Jörn Franke > > wrote:
>> Well I think it is different from MR. It has some optimizations which you do 
>> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>> 
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
>> integrated in the Hortonworks distribution. 
>> 
>> 
>> On 29 May 2016, at 21:43, Mich Talebzadeh > > wrote:
>> 
>>> Hi Jorn,
>>> 
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>>> did not make enough efforts) making it work.
>>> 
>>> That TEZ user group is very quiet as well.
>>> 
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>>> in-memory capability.
>>> 
>>> It would be interesting to see what version of TEZ works as execution 
>>> engine with Hive.
>>> 
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>>> Hive etc as I am sure you already know.
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 29 May 2016 at 20:19, Jörn Franke >> > wrote:
>>> Very interesting do you plan also a test with TEZ?
>>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh >> > wrote:
>>> 
 Hi,
 
 I did another study of Hive using Spark engine compared to Hive with MR.
 
 Basically took the original table imported using Sqoop and created and 
 populated a new ORC table partitioned by year and month into 48 partitions 
 as follows:
 
 
 ​ 
 Connections use JDBC via beeline. Now for each partition using MR it takes 
 an average of 17 minutes as seen below for each PARTITION..  Now that is 
 just an individual partition and there are 48 partitions.
 
 In contrast doing the same operation with Spark engine took 10 minutes all 
 inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
 from below
 
 
 
 This is by no means indicate that Spark is much better than MR but shows 
 that some very good results can ve achieved using Spark engine.
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 
  
 http://talebzadehmich.wordpress.com 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Jörn Franke
I do not think that in-memory itself will make things faster in all cases. 
Especially if you use Tez with Orc or parquet. 
Especially for ad hoc queries on large dataset (indecently if they fit 
in-memory or not) this will have a significant impact. This is an experience I 
have also with the in-memory databases with Oracle or SQL server. It might 
sound surprising, but has some explanations. Orc and parquet have the min/max 
indexes, store and process data (important choose the right datatype, if 
everything is varchar then it is your fault that the database is not 
performing) very efficiently, only load into memory what is needed. This is not 
the case for in-memory systems. Usually everything is loaded in memory and not 
only the parts which are needed. This means due to the absence of min max 
indexes you have to go through everything. Let us assume the table has a size 
of 10 TB. There are different ad hoc queries that only process 1 gb (each one 
addresses different areas). In hive+tez this is currently rather efficient: you 
load 1 gb (negligible in a cluster) and process 1 gb.  In spark you would cache 
10 tb (you do not know which can part will be addressed) which takes a lot of 
time to first load and each query needs to go in memory through 10 tb. This 
might be an extreme case, but it is not uncommon. An exception are of course 
machine learning algorithms (the original purpose of Spark), where I see more 
advantages for Spark. Most of the traditional companies have probably both use 
cases (maybe with a bias towards the first). Internet companies have more 
towards the last.

That being said all systems are evolving. Hive supports tez+llap which is 
basically the in-memory support. Spark stores the data more efficient in 1.5 
and 1.6 (in the dataset Api and dataframe - issue here that it is not the same 
format as the files from disk). Let's see if there will be a convergence - my 
bet is that both systems will be used optimized for their use cases.

The bottom line is you have to first optimize and think what you need to do 
before going in-memory. Never load everything in-memory. You will be surprised. 
Have multiple technologies in your ecosystem. Understand them. Unfortunately 
most of the consultant companies have only poor experience and understanding of 
the complete picture and thus they fail with both technologies, which is sad, 
because both can be extremely powerful and a competitive  advantage.

> On 30 May 2016, at 21:49, Mich Talebzadeh  wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 30 May 2016 at 20:19, Michael Segel  wrote:
>> Mich, 
>> 
>> Most people use vendor releases because they need to have the support. 
>> Hortonworks is the vendor who has the most skin in the game when it comes to 
>> Tez. 
>> 
>> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
>> Then LLAP is the in-memory piece to speed up Tez? 
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh  
>>> wrote:
>>> 
>>> thanks I think the problem is that the TEZ user group is exceptionally 
>>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>>> built a vendor independent version.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 29 May 2016 at 21:23, Jörn Franke  wrote:
 Well I think it is different from MR. It has some optimizations which you 
 do not find in MR. Especially the LLAP option in Hive2 makes it 
 interesting. 
 
 I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
 integrated in the Hortonworks distribution. 
 
 
> On 29 May 2016, at 21:43, Mich Talebzadeh  
> wrote:
> 
> Hi Jorn,
> 
> I started building apache-tez-0.8.2 but got few errors. Couple of guys 
> from TEZ user group kindly gave a hand but I could not go very far (or 
> may be I did not make enough efforts) making it work.
> 
> That TEZ user group is very quiet as well.
> 
> My understanding is TEZ is MR with DAG but of course Spark has both plus 
> in-memory capability.
> 
> It would be interesting to see what 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
yep Hortonworks supports Tez for one reason or other which I am going
hopefully to test it as the query engine for hive. Tthough I think Spark
will be faster because of its in-memory support.

Also if you are independent then you better off dealing with Spark and Hive
without the need to support another stack like Tez.

Cloudera support Impala instead of Hive but it is not something I have
used. .

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 20:19, Michael Segel  wrote:

> Mich,
>
> Most people use vendor releases because they need to have the support.
> Hortonworks is the vendor who has the most skin in the game when it comes
> to Tez.
>
> If memory serves, Tez isn’t going to be M/R but a local execution engine?
> Then LLAP is the in-memory piece to speed up Tez?
>
> HTH
>
> -Mike
>
> On May 29, 2016, at 1:35 PM, Mich Talebzadeh 
> wrote:
>
> thanks I think the problem is that the TEZ user group is exceptionally
> quiet. Just sent an email to Hive user group to see anyone has managed to
> built a vendor independent version.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>
>> Well I think it is different from MR. It has some optimizations which you
>> do not find in MR. Especially the LLAP option in Hive2 makes it
>> interesting.
>>
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is
>> integrated in the Hortonworks distribution.
>>
>>
>> On 29 May 2016, at 21:43, Mich Talebzadeh 
>> wrote:
>>
>> Hi Jorn,
>>
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>> from TEZ user group kindly gave a hand but I could not go very far (or may
>> be I did not make enough efforts) making it work.
>>
>> That TEZ user group is very quiet as well.
>>
>> My understanding is TEZ is MR with DAG but of course Spark has both plus
>> in-memory capability.
>>
>> It would be interesting to see what version of TEZ works as execution
>> engine with Hive.
>>
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
>> Hive etc as I am sure you already know.
>>
>> Cheers,
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>>
>>> Very interesting do you plan also a test with TEZ?
>>>
>>> On 29 May 2016, at 13:40, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>
>>> Basically took the original table imported using Sqoop and created and
>>> populated a new ORC table partitioned by year and month into 48 partitions
>>> as follows:
>>>
>>> 
>>> ​
>>> Connections use JDBC via beeline. Now for each partition using MR it
>>> takes an average of 17 minutes as seen below for each PARTITION..  Now that
>>> is just an individual partition and there are 48 partitions.
>>>
>>> In contrast doing the same operation with Spark engine took 10 minutes
>>> all inclusive. I just gave up on MR. You can see the StartTime and
>>> FinishTime from below
>>>
>>> 
>>>
>>> This is by no means indicate that Spark is much better than MR but shows
>>> that some very good results can ve achieved using Spark engine.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 24 May 2016 at 08:03, Mich Talebzadeh 
>>> wrote:
>>>
 Hi,

 We use Hive as the database and use Spark as an all purpose query tool.

 Whether Hive is the write database for purpose or one is better off
 with something like Phoenix on Hbase, well the answer is it depends and
 your mileage varies.

 So fit for purpose.

 Ideally what wants is to use the fastest  method to get the results.
 How fast we confine it to our SLA agreements in production and that helps
 us from unnecessary further work as we technologists like to play around.

 So in short, we use Spark most of the time and use Hive as the backend
 engine for data storage, mainly ORC 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, 

Most people use vendor releases because they need to have the support. 
Hortonworks is the vendor who has the most skin in the game when it comes to 
Tez. 

If memory serves, Tez isn’t going to be M/R but a local execution engine? Then 
LLAP is the in-memory piece to speed up Tez? 

HTH

-Mike

> On May 29, 2016, at 1:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions. 
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, mainly ORC tables.
>>> 
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>> at the moment it is one of my projects.
>>> 
>>> We do not use any vendor's products as it enables us to move away  from 
>>> being tied down after years of SAP, Oracle and MS dependency to yet another 
>>> vendor. Besides there is some politics going on 

Re: SHOW DATABASES/TABLES with SQL standard authorization

2016-05-30 Thread Mich Talebzadeh
ok that is different from seeing the list of databases. That is just schema

case in point in SAP ASE a normal RDBMS

> sp_addlogin someuser, someuser123, scratchpad
2> go
Password correctly set.
Account unlocked.
New login created.
(return status = 0)
1> exit
 isql -U someuser -w1000
Password:
-- Show me list of databases. full list is displayed
1> sp_helpdb
2> go
 name db_size   owner dbid  created  durability
lobcomplvl inrowlen status
  - - -  ---
-- 
-
 ASEIMDB  5000.0 MB sa6 Mar 05, 2012
no_recovery  0
 ASEIMDB_template 5000.0 MB sa9 Apr 26, 2016
full 0
 DBA_CONTROL_20150613  150.0 MB sa   10 Apr 26, 2016
full
 DBA_CONTROL_old   150.0 MB sa   11 Apr 26, 2016
full 0
 DBHDD   12000.0 MB sa4 Oct 10, 2011
full 0
 DBSSD   27690.0 MB sa7 Apr 26, 2016
full 0
  master100.0 MB sa1 Oct 10, 2011
full 0
 mda_analysis 1200.0 MB sa   17 Jul 06, 2013
full 0
 model  12.0 MB sa3 Oct 10, 2011
full 0
 scratchpad  77756.0 MB sa5 Apr 25, 2016
full 0

-- Can I use ASEIMDB with no access right given?
--
1> use ASEIMDB
2> go
Msg 10351, Level 14, State 1:
Server 'SYB_157', Line 1:
Server user id 24 is not a valid user in database 'ASEIMDB'

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 18:52, Lukas Lalinsky  wrote:

> I realize it's just a list, but it's still not something I'd expect. The
> difference compared to a normal RDBMS is that they typically have a CONNECT
> privilege, which I can use to restrict the user to connecting to other
> databases.
>
> I'm also more concerned about SHOW TABLES. It just seems strange that I
> can do this for any database:
>
> USE any_db;
> SHOW TABLES;
>
> Regards,
>
> Lukas
>
>
> On Mon, May 30, 2016 at 7:34 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> the behaviour is no different from a normal RDBMS.
>>
>> show databases actually inquires Hive  metadata table DBS.
>>
>>  select NAME, OWNER_NAME from DBS order by 1,2;
>> NAME   OWNER_NAME
>> -- --
>> accounts   hduser
>> asehadoop  hduser
>> defaultpublic
>> iqhadoop   hduser
>> mytable_db hduser
>> oraclehadoop   hduser
>> test   hduser
>> 7 rows selected.
>>
>> However, that is just a list. It does not mean you have access rights to
>> that database.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 30 May 2016 at 18:20, Lukas Lalinsky 
>> wrote:
>>
>>> I'm setting up a Hive instance with the SQL standard authorization and
>>> it seems to be working great for all normal operations, but for SHOW
>>> DATABASES/TABLES it's behaving differently from what I would expect.
>>>
>>> It always shows all databases/tables, even though I do not have access
>>> to those tables. Is that the intended behavior? Or is there something that
>>> can be done to filter out items which I can't access?
>>>
>>> Regards,
>>>
>>> Lukas
>>>
>>
>>
>


Is it possible to use external table on top of Elasticsearch index for arbitrary FTS

2016-05-30 Thread Igor Kravzov
I know that external table can be defined like this

CREATE EXTERNAL TABLE artists (
id  BIGINT,
nameSTRING,
links   STRUCT)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*');


But can I just define table without es.query parameter and later run
arbitrary FTS searches?

Thanks in advance.


Re: SHOW DATABASES/TABLES with SQL standard authorization

2016-05-30 Thread Lukas Lalinsky
I realize it's just a list, but it's still not something I'd expect. The
difference compared to a normal RDBMS is that they typically have a CONNECT
privilege, which I can use to restrict the user to connecting to other
databases.

I'm also more concerned about SHOW TABLES. It just seems strange that I can
do this for any database:

USE any_db;
SHOW TABLES;

Regards,

Lukas


On Mon, May 30, 2016 at 7:34 PM, Mich Talebzadeh 
wrote:

> the behaviour is no different from a normal RDBMS.
>
> show databases actually inquires Hive  metadata table DBS.
>
>  select NAME, OWNER_NAME from DBS order by 1,2;
> NAME   OWNER_NAME
> -- --
> accounts   hduser
> asehadoop  hduser
> defaultpublic
> iqhadoop   hduser
> mytable_db hduser
> oraclehadoop   hduser
> test   hduser
> 7 rows selected.
>
> However, that is just a list. It does not mean you have access rights to
> that database.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 18:20, Lukas Lalinsky 
> wrote:
>
>> I'm setting up a Hive instance with the SQL standard authorization and it
>> seems to be working great for all normal operations, but for SHOW
>> DATABASES/TABLES it's behaving differently from what I would expect.
>>
>> It always shows all databases/tables, even though I do not have access to
>> those tables. Is that the intended behavior? Or is there something that can
>> be done to filter out items which I can't access?
>>
>> Regards,
>>
>> Lukas
>>
>
>


Re: SHOW DATABASES/TABLES with SQL standard authorization

2016-05-30 Thread Mich Talebzadeh
the behaviour is no different from a normal RDBMS.

show databases actually inquires Hive  metadata table DBS.

 select NAME, OWNER_NAME from DBS order by 1,2;
NAME   OWNER_NAME
-- --
accounts   hduser
asehadoop  hduser
defaultpublic
iqhadoop   hduser
mytable_db hduser
oraclehadoop   hduser
test   hduser
7 rows selected.

However, that is just a list. It does not mean you have access rights to
that database.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 18:20, Lukas Lalinsky  wrote:

> I'm setting up a Hive instance with the SQL standard authorization and it
> seems to be working great for all normal operations, but for SHOW
> DATABASES/TABLES it's behaving differently from what I would expect.
>
> It always shows all databases/tables, even though I do not have access to
> those tables. Is that the intended behavior? Or is there something that can
> be done to filter out items which I can't access?
>
> Regards,
>
> Lukas
>


SHOW DATABASES/TABLES with SQL standard authorization

2016-05-30 Thread Lukas Lalinsky
I'm setting up a Hive instance with the SQL standard authorization and it
seems to be working great for all normal operations, but for SHOW
DATABASES/TABLES it's behaving differently from what I would expect.

It always shows all databases/tables, even though I do not have access to
those tables. Is that the intended behavior? Or is there something that can
be done to filter out items which I can't access?

Regards,

Lukas


RE: Does hive need exact schema in Hive Export/Import?

2016-05-30 Thread Markovitz, Dudu
Hi

1)
I was able to do the import by doing the following manipulation:


· Export table dev101

· Create an empty table dev102

· Export table dev102

· replace the _metadata file of dev101 with the _metadata file of dev102

· import table dev101 to table dev102

2)
Another option is not to create dev102 in advance but let the import from 
dev101 to create it.
After the import you can alter the table, e.g.:

Alter table dev102 change column col2 col2 varchar(10);


Dudu

From: Devender Yadav [mailto:devender.ya...@impetus.co.in]
Sent: Monday, May 30, 2016 2:38 PM
To: user@hive.apache.org
Subject: Does hive need exact schema in Hive Export/Import?


Hi All,


I am using HDP 2.3

- Hadoop version - 2.7.1

- Hive version - 1.2.1


I created a table dev101 in hive using

create table dev101 (col1 int, col2 char(10));

I inserted two records using

insert into dev101 values (1, 'value1');
insert into dev101 values (2, 'value2');

I exported data to HDFS using

export table dev101 to '/tmp/dev101';


Then, I created a new table dev102 using

create table dev102 (col1 int, col2 String);


I imported data from `/tmp/dev10` in `dev102` using

import table dev102 from '/tmp/dev101';

I got error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match


Then I created another table `dev103` using

create table dev103 (col1 int, col2 char(50));

Again imported:

import table dev103 from '/tmp/dev101';

Same error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match

Finally, I create table with **exactly same schema**

create table dev104 (col1 int, col2 char(10));

And imported

import table dev104 from '/tmp/dev101';

Imported Successfully.

Does hive need exact schema in Hive Export/Import? ​




Regards,
Devender​








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Permission denied with select *

2016-05-30 Thread Al Pivonka
Access control lists...
Who owns the system, database, tables, & files?
Not only owner, but also do you belong to the group?
What are the permissions on the files behind the table?

Is Sentry enabled?
On May 30, 2016 9:52 AM, "kishore kumar"  wrote:

Hi,

If i run "select * from table"

permission denied error we are encountering, where as

select * from table limit 10; or

select count(*) from table;

working fine, what could be the reason any guess ?

-- 
Thanks,
Kishore.


Permission denied with select *

2016-05-30 Thread kishore kumar
Hi,

If i run "select * from table"

permission denied error we are encountering, where as

select * from table limit 10; or

select count(*) from table;

working fine, what could be the reason any guess ?

-- 
Thanks,
Kishore.


Re: Does hive need exact schema in Hive Export/Import?

2016-05-30 Thread Mich Talebzadeh
I guess one alternative is to import it AS IS (the same column type) to a
staging table and then do insert/select into the target table from the
staging table.

import/export is for coping data from say prod to dev like to like.

the problem is that it does two things. it exports both data and metadata.
see below


hduser@rhes564:: :/home/hduser/dba/bin> hdfs dfs -ls
hdfs://rhes564:9000/export
16/05/30 13:07:02 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Found 2 items
-rwxr-xr-x   2 hduser supergroup   1588 2016-05-25 16:46
hdfs://rhes564:9000/export/
*_metadata*drwxr-xr-x   - hduser supergroup  0 2016-05-25 16:46
hdfs://rhes564:9000/export/data

and uses the metadata file to create the target table which somehow does
not work in this case!

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 13:00, Devender Yadav 
wrote:

> Hi Mich,
>
>
> you did not get my question I guess .
>
>
> I am able to use import export.
>
>
> I am exporting data from dev101 (col1 int, col2 char(10)) and importing
> in dev102 (col1 int, col2 string)
>
>
>
> I am getting issue :
>
>
> >FAILED: SemanticException [Error 10120]: The existing table is not
> compatible with the import spec.   Column Schema does not match
>
>
>
> Is it possible to import char(10) field in string or char(20) ?
>
>
>
> Because I tried and got above mentioned exception.
>
>
>
> Regards,
> Devender
> --
> *From:* Mich Talebzadeh 
> *Sent:* Monday, May 30, 2016 5:19 PM
> *To:* user
> *Subject:* Re: Does hive need exact schema in Hive Export/Import?
>
> it is pretty straight forward
>
> !hdfs dfs -rm -r hdfs://rhes564:9000/export;
> EXPORT TABLE oraclehadoop.sales_staging to  "hdfs://rhes564:9000/export";
> --
> DROP TABLE IF EXISTS test.sales_staging;
> IMPORT TABLE test.sales_staging FROM  "hdfs://rhes564:9000/export";
> select count(1) from test.sales_staging;
> exit;
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 May 2016 at 12:38, Devender Yadav 
> wrote:
>
>> Hi All,
>>
>>
>> I am using HDP 2.3
>>
>> - Hadoop version - 2.7.1
>>
>> - Hive version - 1.2.1
>>
>>
>> I created a table dev101 in hive using
>>
>> create table dev101 (col1 int, col2 char(10));
>>
>> I inserted two records using
>>
>> insert into dev101 values (1, 'value1');
>> insert into dev101 values (2, 'value2');
>>
>> I exported data to HDFS using
>>
>> export table dev101 to '/tmp/dev101';
>>
>>
>> Then, I created a new table dev102 using
>>
>> create table dev102 (col1 int, col2 String);
>>
>>
>> I imported data from `/tmp/dev10` in `dev102` using
>>
>> import table dev102 from '/tmp/dev101';
>>
>> I got error:
>>
>> >FAILED: SemanticException [Error 10120]: The existing table is not
>> compatible with the import spec.   Column Schema does not match
>>
>>
>> Then I created another table `dev103` using
>>
>> create table dev103 (col1 int, col2 char(50));
>>
>> Again imported:
>>
>> import table dev103 from '/tmp/dev101';
>>
>> Same error:
>>
>> >FAILED: SemanticException [Error 10120]: The existing table is not
>> compatible with the import spec.   Column Schema does not match
>>
>> Finally, I create table with **exactly same schema**
>>
>> create table dev104 (col1 int, col2 char(10));
>>
>> And imported
>>
>> import table dev104 from '/tmp/dev101';
>>
>> Imported Successfully.
>>
>> Does hive need exact schema in Hive Export/Import? ​
>>
>>
>>
>> Regards,
>> Devender​
>>
>> --
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that the
>> communication is free of errors, virus, interception or interference.
>>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, 

Re: My first TEZ job fails

2016-05-30 Thread Gopal Vijayaraghavan
> hduser@rhes564: /usr/lib/apache-tez-0.7.1-bin> hadoop jar
>./tez-examples-0.7.1.jar orderedwordcount /tmp/input/test.txt
>/tmp/out/test.log

Sure, you're missing file:/// - the defaultFS is most like
hdfs://:/

The inputs and outputs without a scheme prefix will go the defaultFS
configured in core-site.xml.

Cheers,
Gopal













Re: Does hive need exact schema in Hive Export/Import?

2016-05-30 Thread Devender Yadav
Hi Mich,


you did not get my question I guess .


I am able to use import export.


I am exporting data from dev101 (col1 int, col2 char(10)) and importing in 
dev102 (col1 int, col2 string)



I am getting issue :


>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match



Is it possible to import char(10) field in string or char(20) ?



Because I tried and got above mentioned exception.



Regards,
Devender

From: Mich Talebzadeh 
Sent: Monday, May 30, 2016 5:19 PM
To: user
Subject: Re: Does hive need exact schema in Hive Export/Import?

it is pretty straight forward

!hdfs dfs -rm -r hdfs://rhes564:9000/export;
EXPORT TABLE oraclehadoop.sales_staging to  "hdfs://rhes564:9000/export";
--
DROP TABLE IF EXISTS test.sales_staging;
IMPORT TABLE test.sales_staging FROM  "hdfs://rhes564:9000/export";
select count(1) from test.sales_staging;
exit;



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 30 May 2016 at 12:38, Devender Yadav 
> wrote:

Hi All,


I am using HDP 2.3

- Hadoop version - 2.7.1

- Hive version - 1.2.1


I created a table dev101 in hive using

create table dev101 (col1 int, col2 char(10));

I inserted two records using

insert into dev101 values (1, 'value1');
insert into dev101 values (2, 'value2');

I exported data to HDFS using

export table dev101 to '/tmp/dev101';


Then, I created a new table dev102 using

create table dev102 (col1 int, col2 String);


I imported data from `/tmp/dev10` in `dev102` using

import table dev102 from '/tmp/dev101';

I got error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match


Then I created another table `dev103` using

create table dev103 (col1 int, col2 char(50));

Again imported:

import table dev103 from '/tmp/dev101';

Same error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match

Finally, I create table with **exactly same schema**

create table dev104 (col1 int, col2 char(10));

And imported

import table dev104 from '/tmp/dev101';

Imported Successfully.

Does hive need exact schema in Hive Export/Import? ?



Regards,
Devender?








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Mich Talebzadeh
Hi Gopal,

please see my correspondence about Tez in tez user group. I forwarded to
hive user group.

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 12:30, Gopal Vijayaraghavan  wrote:

> > I do not use any vendor's product., All my own set up, build and
> >configure.
>
> My autobuild scripts should serve as readable documentation for this,
> since nearly everything's in a single Makefile with an install: target.
>
> Or take the easy route with
>
> $ make dist install
>
> In case you use the llap branch, just do "set
> hive.llap.execution.mode=none;" to use Tez.
>
> > java version "1.8.0_77"
> > Hadoop 2.6.0
> ...
> > https://tez.apache.org/install.html
>
> Looks good so far.
>
> > Ok I just need to make it work as I have hive on spark engine as well.
>
> You're missing 3 things approximately - if you read through the Makefile
> in github.
>
> First, a good tez-site.xml in the classpath (remember, tez.lib.uris needs
> to be an HDFS path - for the rest, see the base file from autobuild).
>
> I usually update Tez to ${fs.default.name}/user/gopal/tez/tez.tar.gz and I
> do not use the minimal tarball, but the full dist tarball.
>
> The fixed tarball means it hits all the good localization characteristics
> of YARN, which can add up to minutes on a >250+ node cluster.
>
> Second, put that in the classpath for Hive (append to
> $INSTALL_ROOT/hive/bin/hive-config.sh)
>
> > export
> >HADOOP_CLASSPATH="$INSTALL_ROOT/tez/*:$INSTALL_ROOT/tez/lib/*:$INSTALL_ROO
> >T/tez/conf/:$HADOOP_CLASSPATH"
>
> > export HADOOP_USER_CLASSPATH_FIRST=true
>
>
> Replace $INSTALL_ROOT with wherever Tez is located.
>
> Third, disable the hive-1.x jars coming from SparkSQL (append/create in
> $INSTALL_ROOT/hive/conf/hive-env.sh)
>
> > export HIVE_SKIP_SPARK_ASSEMBLY=true
>
>
> After that, you can do
>
> > hive --hiveconf hive.execution.engine=tez
>
> to get Tez working (add --hiveconf tez.queue.name= to use queues).
>
> Cheers,
> Gopal
>
>
>


Re: Does hive need exact schema in Hive Export/Import?

2016-05-30 Thread Mich Talebzadeh
it is pretty straight forward

!hdfs dfs -rm -r hdfs://rhes564:9000/export;
EXPORT TABLE oraclehadoop.sales_staging to  "hdfs://rhes564:9000/export";
--
DROP TABLE IF EXISTS test.sales_staging;
IMPORT TABLE test.sales_staging FROM  "hdfs://rhes564:9000/export";
select count(1) from test.sales_staging;
exit;


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 12:38, Devender Yadav 
wrote:

> Hi All,
>
>
> I am using HDP 2.3
>
> - Hadoop version - 2.7.1
>
> - Hive version - 1.2.1
>
>
> I created a table dev101 in hive using
>
> create table dev101 (col1 int, col2 char(10));
>
> I inserted two records using
>
> insert into dev101 values (1, 'value1');
> insert into dev101 values (2, 'value2');
>
> I exported data to HDFS using
>
> export table dev101 to '/tmp/dev101';
>
>
> Then, I created a new table dev102 using
>
> create table dev102 (col1 int, col2 String);
>
>
> I imported data from `/tmp/dev10` in `dev102` using
>
> import table dev102 from '/tmp/dev101';
>
> I got error:
>
> >FAILED: SemanticException [Error 10120]: The existing table is not
> compatible with the import spec.   Column Schema does not match
>
>
> Then I created another table `dev103` using
>
> create table dev103 (col1 int, col2 char(50));
>
> Again imported:
>
> import table dev103 from '/tmp/dev101';
>
> Same error:
>
> >FAILED: SemanticException [Error 10120]: The existing table is not
> compatible with the import spec.   Column Schema does not match
>
> Finally, I create table with **exactly same schema**
>
> create table dev104 (col1 int, col2 char(10));
>
> And imported
>
> import table dev104 from '/tmp/dev101';
>
> Imported Successfully.
>
> Does hive need exact schema in Hive Export/Import? ​
>
>
>
> Regards,
> Devender​
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Does hive need exact schema in Hive Export/Import?

2016-05-30 Thread Devender Yadav
Hi All,


I am using HDP 2.3

- Hadoop version - 2.7.1

- Hive version - 1.2.1


I created a table dev101 in hive using

create table dev101 (col1 int, col2 char(10));

I inserted two records using

insert into dev101 values (1, 'value1');
insert into dev101 values (2, 'value2');

I exported data to HDFS using

export table dev101 to '/tmp/dev101';


Then, I created a new table dev102 using

create table dev102 (col1 int, col2 String);


I imported data from `/tmp/dev10` in `dev102` using

import table dev102 from '/tmp/dev101';

I got error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match


Then I created another table `dev103` using

create table dev103 (col1 int, col2 char(50));

Again imported:

import table dev103 from '/tmp/dev101';

Same error:

>FAILED: SemanticException [Error 10120]: The existing table is not compatible 
>with the import spec.   Column Schema does not match

Finally, I create table with **exactly same schema**

create table dev104 (col1 int, col2 char(10));

And imported

import table dev104 from '/tmp/dev101';

Imported Successfully.

Does hive need exact schema in Hive Export/Import? ?



Regards,
Devender?








NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Gopal Vijayaraghavan
> I do not use any vendor's product., All my own set up, build and
>configure.

My autobuild scripts should serve as readable documentation for this,
since nearly everything's in a single Makefile with an install: target.

Or take the easy route with

$ make dist install

In case you use the llap branch, just do "set
hive.llap.execution.mode=none;" to use Tez.

> java version "1.8.0_77"
> Hadoop 2.6.0
...
> https://tez.apache.org/install.html

Looks good so far.

> Ok I just need to make it work as I have hive on spark engine as well.

You're missing 3 things approximately - if you read through the Makefile
in github.

First, a good tez-site.xml in the classpath (remember, tez.lib.uris needs
to be an HDFS path - for the rest, see the base file from autobuild).

I usually update Tez to ${fs.default.name}/user/gopal/tez/tez.tar.gz and I
do not use the minimal tarball, but the full dist tarball.

The fixed tarball means it hits all the good localization characteristics
of YARN, which can add up to minutes on a >250+ node cluster.

Second, put that in the classpath for Hive (append to
$INSTALL_ROOT/hive/bin/hive-config.sh)

> export 
>HADOOP_CLASSPATH="$INSTALL_ROOT/tez/*:$INSTALL_ROOT/tez/lib/*:$INSTALL_ROO
>T/tez/conf/:$HADOOP_CLASSPATH"

> export HADOOP_USER_CLASSPATH_FIRST=true


Replace $INSTALL_ROOT with wherever Tez is located.

Third, disable the hive-1.x jars coming from SparkSQL (append/create in
$INSTALL_ROOT/hive/conf/hive-env.sh)

> export HIVE_SKIP_SPARK_ASSEMBLY=true


After that, you can do

> hive --hiveconf hive.execution.engine=tez

to get Tez working (add --hiveconf tez.queue.name= to use queues).

Cheers,
Gopal




Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Mich Talebzadeh
Thanks Gopal.

I do not use any vendor's product., All my own set up, build and configure.
No CDH no HDL etc.

This the current stack that I have:

Java

*java -version*
java version "1.8.0_77"
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)

HDFS version


*hadoop version*Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0

YARN version


*yarn version*Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0

HIVE version

hive --version
Hive 2.0.0
Subversion git://reznor-mbp-2.local/Users/sergey/git/hivegit -r
7f9f1fcb8697fb33f0edc2c391930a3728d247d7
Compiled by sergey on Tue Feb 9 18:12:08 PST 2016


Spark version

version 1.6.1
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_77)


TEZ I downloaded

tez-0.8.3

And built it using the following instructions

https://tez.apache.org/install.html


Ok I just need to make it work as I have hive on spark engine as well.

please tell me what version of tez and yarn etc. I

thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 10:16, Gopal Vijayaraghavan  wrote:

>
> > In short at the simplest set up what Resource Manager it works with?
>
> Tez+Hive needs HDFS and YARN 2.6.0+ (preferably as close to an Apache
> build as possible - CDH clusters need more work).
>
> Hive2 needs Apache Slider 0.91 right now, to start the cache daemons on
> YARN (see SLIDER-82).
>
> > If so kindly specify both Hive and TEZ versions.
>
> I maintain build scripts & configuration setups for Hive+Tez, for work
>
> https://github.com/t3rmin4t0r/tez-autobuild/tree/llap
>
>
> Both that & the master there builds Hive (2.1.0-SNAPSHOT) + Tez
> (0.8.4-SNAPSHOT), this one has the
> LLAP cache configurations turned on.
>
> This is what I use to develop Hive, before there are releases and it will
> allow each user
> on a shared cluster to maintain their own independent private install of
> hive - if you look at
> something like the old Spotify Hive query presentations, you'll see that
> more people have
> used that to run their own private builds successfully :)
>
> Purely out of laziness, the LLAP configurations in slider-gen.sh (i.e the
> Xmx & cache values)
> are configured exactly to match my dev cluster - 32 vcore + 256Gb RAM.
>
> Cheers,
> Gopal
>
>
>


Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Gopal Vijayaraghavan

> In short at the simplest set up what Resource Manager it works with?

Tez+Hive needs HDFS and YARN 2.6.0+ (preferably as close to an Apache
build as possible - CDH clusters need more work).

Hive2 needs Apache Slider 0.91 right now, to start the cache daemons on
YARN (see SLIDER-82).

> If so kindly specify both Hive and TEZ versions.

I maintain build scripts & configuration setups for Hive+Tez, for work

https://github.com/t3rmin4t0r/tez-autobuild/tree/llap


Both that & the master there builds Hive (2.1.0-SNAPSHOT) + Tez
(0.8.4-SNAPSHOT), this one has the
LLAP cache configurations turned on.

This is what I use to develop Hive, before there are releases and it will
allow each user
on a shared cluster to maintain their own independent private install of
hive - if you look at
something like the old Spotify Hive query presentations, you'll see that
more people have 
used that to run their own private builds successfully :)

Purely out of laziness, the LLAP configurations in slider-gen.sh (i.e the
Xmx & cache values)
are configured exactly to match my dev cluster - 32 vcore + 256Gb RAM.

Cheers,
Gopal




Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Mich Talebzadeh
thanks Damien.

I tried TEZ 0.82 with Hive 2 although I did not persevere.

When you say "Not stable" are you referring to using it with YARN etc.

In short at the simplest set up what Resource Manager it works with?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 30 May 2016 at 08:59, Damien Carol  wrote:

> HIVE 1.2.1 and Tez 0.5.2 or 0.7.0 works pretty well
>
> beginning to use HIVE 2.0.0 and 0.8.x but not stable :/
>
> 2016-05-29 22:26 GMT+02:00 Mich Talebzadeh :
>
>>
>> Please bear in mind that I am talking about your own build not anything
>> comes as part of Vendor's package.
>>
>> If so kindly specify both Hive and TEZ versions.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Damien Carol
HIVE 1.2.1 and Tez 0.5.2 or 0.7.0 works pretty well

beginning to use HIVE 2.0.0 and 0.8.x but not stable :/

2016-05-29 22:26 GMT+02:00 Mich Talebzadeh :

>
> Please bear in mind that I am talking about your own build not anything
> comes as part of Vendor's package.
>
> If so kindly specify both Hive and TEZ versions.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>