Re: Why does the user need write permission on the location of external hive table?

2016-05-31 Thread Mich Talebzadeh
right that directly belongs to hdfs:hdfs and nonone else bar that user can
write to it.

if you are connecting via beeline you need to specify the user and password

beeline -u jdbc:hive2://rhes564:10010/default
org.apache.hive.jdbc.HiveDriver -n hduser -p 

When I look at permissioning I see only hdfs can write to it not user
Sandeep?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 09:20, Sandeep Giri  wrote:

> Yes, when I run hadoop fs it gives results correctly.
>
> *hadoop fs -ls /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
> *Found 30 items*
> *-rw-r--r--   3 hdfs hdfs   6148 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
> *-rw-r--r--   3 hdfs hdfs 803323 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
> *-rw-r--r--   3 hdfs hdfs 284355 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
> **
>
>
>
>
> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> is this location correct and valid?
>>
>> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/tweets_raw/'
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 31 May 2016 at 08:50, Sandeep Giri  wrote:
>>
>>> Hi Hive Team,
>>>
>>> As per my understanding, in Hive, you can create two kinds of tables:
>>> Managed and External.
>>>
>>> In case of managed table, you own the data and hence when you drop the
>>> table the data is deleted.
>>>
>>> In case of external table, you don't have ownership of the data and
>>> hence when you delete such a table, the underlying data is not deleted.
>>> Only metadata is deleted.
>>>
>>> Now, recently i have observed that you can not create an external table
>>> over a location on which you don't have write (modification) permissions in
>>> HDFS. I completely fail to understand this.
>>>
>>> Use case: It is quite common that the data you are churning is huge and
>>> read-only. So, to churn such data via Hive, will you have to copy this huge
>>> data to a location on which you have write permissions?
>>>
>>> Please help.
>>>
>>> My data is located in a hdfs folder
>>> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
>>> only have readonly permission. And I am trying to execute the following
>>> command
>>>
>>> *CREATE EXTERNAL TABLE tweets_raw (*
>>> *id BIGINT,*
>>> *created_at STRING,*
>>> *source STRING,*
>>> *favorited BOOLEAN,*
>>> *retweet_count INT,*
>>> *retweeted_status STRUCT<*
>>> *text:STRING,*
>>> *users:STRUCT>,*
>>> *entities STRUCT<*
>>> *urls:ARRAY>,*
>>> *user_mentions:ARRAY>,*
>>> *hashtags:ARRAY>>,*
>>> *text STRING,*
>>> *user1 STRUCT<*
>>> *screen_name:STRING,*
>>> *name:STRING,*
>>> *friends_count:INT,*
>>> *followers_count:INT,*
>>> *statuses_count:INT,*
>>> *verified:BOOLEAN,*
>>> *utc_offset:STRING, -- was INT but nulls are strings*
>>> *time_zone:STRING>,*
>>> *in_reply_to_screen_name STRING,*
>>> *year int,*
>>> *month int,*
>>> *day int,*
>>> *hour int*
>>> *)*
>>> *ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'*
>>> *WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")*
>>> *LOCATION
>>> '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'*
>>> *;*
>>>
>>> It throws the following error:
>>>
>>> FAILED: Execution Error, 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Couple of points if I may and kindly bear with my remarks.

Whilst it will be very interesting to try TEZ with LLAP. As I read from LLAP

"Sub-second queries require fast query execution and low setup cost. The
challenge for Hive is to achieve this without giving up on the scale and
flexibility that users depend on. This requires a new approach using a
hybrid engine that leverages Tez and something new called  LLAP (Live Long
and Process, #llap online).

LLAP is an optional daemon process running on multiple nodes, that provides
the following:

   - Caching and data reuse across queries with compressed columnar data
   in-memory (off-heap)
   - Multi-threaded execution including reads with predicate pushdown and
   hash joins
   - High throughput IO using Async IO Elevator with dedicated thread and
   core per disk
   - Granular column level security across applications
   - "

OK so we have added an in-memory capability to TEZ by way of LLAP, In other
words what Spark does already and BTW it does not require a daemon running
on any host. Don't take me wrong. It is interesting but this sounds to me
(without testing myself) adding caching capability to TEZ to bring it on
par with SPARK.

Remember:

Spark -> DAG + in-memory caching
TEZ = MR on DAG
TEZ + LLAP => DAG + in-memory caching

OK it is another way getting the same result. However, my concerns:


   - Spark has a wide user base. I judge this from Spark user group traffic
   - TEZ user group has no traffic I am afraid
   - LLAP I don't know

Sounds like Hortonworks promote TEZ and Cloudera does not want to know
anything about Hive. and they promote Impala but that sounds like a sinking
ship these days.

Having said that I will try TEZ + LLAP :) No pun intended

Regards

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 08:19, Jörn Franke  wrote:

> Thanks very interesting explanation. Looking forward to test it.
>
> > On 31 May 2016, at 07:51, Gopal Vijayaraghavan 
> wrote:
> >
> >
> >> That being said all systems are evolving. Hive supports tez+llap which
> >> is basically the in-memory support.
> >
> > There is a big difference between where LLAP & SparkSQL, which has to do
> > with access pattern needs.
> >
> > The first one is related to the lifetime of the cache - the Spark RDD
> > cache is per-user-session which allows for further operation in that
> > session to be optimized.
> >
> > LLAP is designed to be hammered by multiple user sessions running
> > different queries, designed to automate the cache eviction & selection
> > process. There's no user visible explicit .cache() to remember - it's
> > automatic and concurrent.
> >
> > My team works with both engines, trying to improve it for ORC, but the
> > goals of both are different.
> >
> > I will probably have to write a proper academic paper & get it
> > edited/reviewed instead of send my ramblings to the user lists like this.
> > Still, this needs an example to talk about.
> >
> > To give a qualified example, let's leave the world of single use clusters
> > and take the use-case detailed here
> >
> > http://hortonworks.com/blog/impala-vs-hive-performance-benchmark/
> >
> >
> > There are two distinct problems there - one is that a single day sees
> upto
> > 100k independent user sessions running queries and that most queries
> cover
> > the last hour (& possibly join/compare against a similar hour aggregate
> > from the past).
> >
> > The problem with having independent 100k user-sessions from different
> > connections was that the SparkSQL layer drops the RDD lineage & cache
> > whenever a user ends a session.
> >
> > The scale problem in general for Impala was that even though the data
> size
> > was in multiple terabytes, the actual hot data was approx <20Gb, which
> > resides on <10 machines with locality.
> >
> > The same problem applies when you apply RDD caching with something like
> > un-replicated like Tachyon/Alluxio, since the same RDD will be exceeding
> > popular that the machines which hold those blocks run extra hot.
> >
> > A cache model per-user session is entirely wasteful and a common cache +
> > MPP model effectively overloads 2-3% of cluster, while leaving the other
> > machines idle.
> >
> > LLAP was designed specifically to prevent that hotspotting, while
> > maintaining the common cache model - within a few minutes after an hour
> > ticks o

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Thanks for that Gopal.

Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS.

In that case does it use JDBC to get the data out from the underlying DB?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 21:48, Gopal Vijayaraghavan  wrote:

>
> > but this sounds to me (without testing myself) adding caching capability
> >to TEZ to bring it on par with SPARK.
>
> Nope, that was the crux of the earlier email.
>
> "Caching" seems to be catch-all term misused in that comparison.
>
> >> There is a big difference between where LLAP & SparkSQL, which has to do
> >> with access pattern needs.
>
> On another note, LLAP can actually be used inside Spark as well, just use
> LlapContext instead of HiveContext.
>
>
> <
> http://www.slideshare.net/HadoopSummit/llap-subsecond-analytical-queries-i
> n-hive/30>
>
>
> I even have a Postgres FDW for LLAP, which is mostly used for analytics
> web dashboards which are hooked into Hive.
>
> https://github.com/t3rmin4t0r/llap_fdw
>
>
> LLAP can do 200-400ms queries, but Postgres can get to the sub 10ms when
> it comes to slicing-dicing result sets <100k rows.
>
> Cheers,
> Gopal
>
>
>


Re: [ANNOUNCE] Apache Hive 2.0.1 Released

2016-05-31 Thread Mich Talebzadeh
Thanks Sergey,

Congratulations.

May I add that Hive 0.14 and above can also deploy Spark as its executions
engine and with Spark on Hive on Spark execution engine you have a winning
combination.

BTW we are just discussing the merits of TEZ + LLAP versus Spark as the
execution engine for Spark. With Hive on Spark vs Hive on MapReduce the
performance gains are order of magnitude.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 21:39, Sergey Shelukhin  wrote:

> The Apache Hive team is proud to announce the the release of Apache Hive
> version 2.0.1.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top of
> Apache Hadoop (TM), it provides:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS (TM) or in other
> data storage systems such as Apache HBase (TM)
>
> * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
>
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
>
> Hive 2.0.1 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12334886&sty
> leName=Text&projectId=12310843
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>
>
>


Fwd: [ANNOUNCE] Apache Hive 2.0.1 Released

2016-05-31 Thread Mich Talebzadeh
Thanks Sergey,

Congratulations.

May I add that Hive 0.14 and above can also deploy Spark as its executions
engine and with Spark on Hive on Spark execution engine you have a winning
combination.

BTW we are just discussing the merits of TEZ + LLAP versus Spark as the
execution engine for Spark. With Hive on Spark vs Hive on MapReduce the
performance gains are order of magnitude.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 21:39, Sergey Shelukhin  wrote:

> The Apache Hive team is proud to announce the the release of Apache Hive
> version 2.0.1.
>
> The Apache Hive (TM) data warehouse software facilitates querying and
> managing large datasets residing in distributed storage. Built on top of
> Apache Hadoop (TM), it provides:
>
> * Tools to enable easy data extract/transform/load (ETL)
>
> * A mechanism to impose structure on a variety of data formats
>
> * Access to files stored either directly in Apache HDFS (TM) or in other
> data storage systems such as Apache HBase (TM)
>
> * Query execution via Apache Hadoop MapReduce and Apache Tez frameworks.
>
> For Hive release details and downloads, please visit:
> https://hive.apache.org/downloads.html
>
> Hive 2.0.1 Release Notes are available here:
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12334886&sty
> leName=Text&projectId=12310843
> <https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12334886&styleName=Text&projectId=12310843>
>
> We would like to thank the many contributors who made this release
> possible.
>
> Regards,
>
> The Apache Hive Team
>
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Mich Talebzadeh
Thanks Gopal.

SAP Replication server (SRS) does it to Hive real time as well. That is the
main advantage of replication as it is real time. Picks up committed data
from the log and sends it to hive as well. Also it ois way ahead of Sqoop
that only does the initial load really.  It does 10k rows at a time with
insert into Hive table. Hive table cannot be transactional to start with.

I. 2016/04/08 09:38:23. REPLICATE Replication Server: Dropped subscription
<102_105_t> for replication definition <102_t> with replicate at

I. 2016/04/08 09:38:31. REPLICATE Replication Server: Creating subscription
<102_105_t> for replication definition <102_t> with replicate at

I. 2016/04/08 09:38:31. PRIMARY Replication Server: Creating subscription
<102_105_t> for replication definition <102_t> with replicate at

T. 2016/04/08 09:38:32. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:32. (84): 'begin transaction  '
T. 2016/04/08 09:38:32. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:32. (84): 'select  count (*) from t  '
T. 2016/04/08 09:38:34. (84): Command sent to 'SYB_157.scratchpad':
T. 2016/04/08 09:38:34. (84): 'select OWNER, OBJECT_NAME, SUBOBJECT_NAME,
OBJECT_ID, DATA_OBJECT_ID, OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP2,
STATUS, TEMPORARY2, GENERATED, SECONDARY, NAMESPACE, EDITION_NA
ME, PADDING1, PADDING2, ATTRIBUTE from t  '
T. 2016/04/08 09:39:54. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:39:54. (86): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:12. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:12. (89): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:34. (87): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:34. (87): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:40:52. (88): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:40:52. (88): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:41:11. (90): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:41:11. (90): 'Bulk insert table 't' ( rows affected)'
T. 2016/04/08 09:41:56. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:41:56. (86): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:42:30. (87): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:42:30. (87): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:42:53. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:42:53. (89): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:43:14. (90): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:43:14. (90): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:43:33. (88): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:43:33. (88): 'Bulk insert table 't' (10000 rows affected)'
T. 2016/04/08 09:44:25. (86): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:44:25. (86): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:44:44. (89): Command sent to 'hiveserver2.asehadoop':
T. 2016/04/08 09:44:44. (89): 'Bulk insert table 't' (1 rows affected)'
T. 2016/04/08 09:45:37. (90): Command sent to 'hiveserver2.asehadoop':

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 31 May 2016 at 22:18, Gopal Vijayaraghavan  wrote:

>
> > Can LLAP be used as a caching tool for data from Oracle DB or any RDBMS.
>
> No, LLAP intermediates HDFS. It holds column & index data streams as-is
> (i.e dictionary encoding, RLE, bloom filters etc are preserved).
>
> Because it does not cache row-tuples, it cannot exist as a caching tool
> for another RDBMS.
>
> I have heard of Oracle GoldenGate replicating into Hive, but it is not
> without its own pains of schema compat.
>
> Cheers,
> Gopal
>
>
>
>


Re: why does HIVE can run normally without starting yarn?

2016-06-01 Thread Mich Talebzadeh
Hi,

Hadoop core comes with HDFS, map-reduce and Yarn.

When you start the core you tend to start hdfs, yarn resource manager on
the resource manager node and yarn nodemanager on all slaves. You can also
start history server as an option

start-dfs.sh
## Start YARN daemons
## Start the resourcemanager daemon
## ONLY ON THE RESOURCEMANAGER NODE!
#
yarn-daemon.sh start resourcemanager
#
## Start the nodemanager daemon
# ON ALL SLAVES
#
yarn-daemon.sh start nodemanager
#
mr-jobhistory-daemon.sh start historyserver

Now yarn is the chosen resource manager.

In etc directory $HADOOP_HOME/etc/Hadoop  in file slaves you tell hdfs
where to start all the datanodes. For example I have a two-node cluster hdfs

cat slaves
rhes564
rhes5

Now if you just start hdfs it will still go and look at the slaves file to
see the datanodes, regardless of yarn it will start the namenode and
datanodes!

hduser@rhes564:: :/home/hduser/dba/bin> start-dfs.sh
16/06/01 12:08:23 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Starting namenodes on [rhes564]



*rhes564: starting namenode, logging to
/home/hduser/hadoop-2.6.0/logs/hadoop-hduser-namenode-rhes564.outrhes564:
starting datanode, logging to
/home/hduser/hadoop-2.6.0/logs/hadoop-hduser-datanode-rhes564.outrhes5:
starting datanode, logging to
/home/hduser/hadoop-2.6.0/logs/hadoop-hduser-datanode-rhes5.out*

So you still have it.

In general you should deploy a resource manager. I do not use map-reduce I
use Spark as an execution engine and it does run on yarn-client mode in my
case.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 1 June 2016 at 03:09, Joseph  wrote:

>
> Hi all,
>
> I use hadoop 2.7.2, and I just start HDFS, then I can submit mapreduce
> jobs and run HIVE 1.2.1.  Do the jobs just execute locally If I don't
> start YARN?
>
> --
> Joseph
>


Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Mich Talebzadeh
Hi,

Spark does not support transactions because as I understand there is a
piece in the execution side that needs to send heartbeats to Hive metastore
saying a transaction is still alive". That has not been implemented in
Spark yet to my knowledge."

Any idea on the timelines when we are going to have support for
transactions in Spark for Hive ORC tables. This will really be useful.


Thanks,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Mich Talebzadeh
thanks for that.

I will have a look

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 2 June 2016 at 10:46, Elliot West  wrote:

> Related to this, there exists an API in Hive to simplify the integrations
> of other frameworks with Hive's ACID feature:
>
> See:
> https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API
>
> It contains code for maintaining heartbeats, handling locks and
> transactions, and submitting mutations in a distributed environment.
>
> We have used it to write to transactional tables from Cascading based
> processes.
>
> Elliot.
>
>
> On 2 June 2016 at 09:54, Mich Talebzadeh 
> wrote:
>
>>
>> Hi,
>>
>> Spark does not support transactions because as I understand there is a
>> piece in the execution side that needs to send heartbeats to Hive metastore
>> saying a transaction is still alive". That has not been implemented in
>> Spark yet to my knowledge."
>>
>> Any idea on the timelines when we are going to have support for
>> transactions in Spark for Hive ORC tables. This will really be useful.
>>
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


Re: Convert date in string format to timestamp in table definition

2016-06-04 Thread Mich Talebzadeh
or just create an internal table and do insert/select from external table
to that table as Dudu mentioned

hive> use test;
OK
hive> desc mytime;
OK
adddate timestamp

hive> insert into
> test.mytime
> select cast(concat_ws(' ',substring
("2016-05-17T02:10:44.527",1,10),substring ("2016-05-17T02:10:44.527",12))
as timestamp)  as adddate;

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 4 June 2016 at 21:31, Igor Kravzov  wrote:

> Thanks Dudu.
> So if I need actual date I will use view.
> Regarding partition column:  I can create 2 external tables based on the
> same data with integer or string column partition and see which one is more
> convenient for our use.
>
> On Sat, Jun 4, 2016 at 2:20 PM, Markovitz, Dudu 
> wrote:
>
>> I’m not aware of an option to do what you request in the external table
>> definition but you might want to that using a view.
>>
>>
>>
>> P.s.
>>
>> I seems to me that defining the partition column as a string would be
>> more user friendly than integer, e.g. –
>>
>>
>>
>> select * from threads_test where mmdd like ‘2016%’ – year 2016;
>>
>> select * from threads_test where mmdd like ‘201603%’ –- March 2016;
>>
>> select * from threads_test where mmdd like ‘__01’ -- first of
>> every month;
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> $ hdfs dfs -ls -R /tmp/threads_test
>>
>> drwxr-xr-x   - cloudera supergroup  0 2016-06-04 10:45
>> /tmp/threads_test/20160604
>>
>> -rw-r--r--   1 cloudera supergroup136 2016-06-04 10:45
>> /tmp/threads_test/20160604/data.txt
>>
>>
>>
>> $ hdfs dfs -cat /tmp/threads_test/20160604/data.txt
>>
>> {"url":"www.blablabla.com
>> ","pageType":"pg1","addDate":"2016-05-17T02:10:44.527","postDate":"2016-05-16T02:08:55","postText":"YadaYada"}
>>
>>
>>
>>
>> 
>>
>>
>>
>>
>>
>> hive> add jar
>> /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
>>
>>
>>
>> hive>
>>
>> create external table threads_test
>>
>> (
>>
>> url string
>>
>>,pagetypestring
>>
>>,adddate string
>>
>>,postdatestring
>>
>>,posttextstring
>>
>> )
>>
>> partitioned by (mmdd string)
>>
>> row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
>>
>> location '/tmp/threads_test'
>>
>> ;
>>
>>
>>
>> hive> alter table threads_test add partition (mmdd=20160604) location
>> '/tmp/threads_test/20160604';
>>
>>
>>
>> hive> select * from threads_test;
>>
>>
>>
>> www.blablabla.compg12016-05-17T02:10:44.527
>> 2016-05-16T02:08:55  YadaYada  20160604
>>
>>
>>
>> hive>
>>
>> create view threads_test_v
>>
>> as
>>
>> select  url
>>
>>,pagetype
>>
>>,cast (concat_ws(' ',substr (adddate ,1,10),substr (adddate
>> ,12)) as timestamp)  as adddate
>>
>>,cast (concat_ws(' ',substr (postdate,1,10),substr
>> (postdate,12)) as timestamp)  as postdate
>>
>>,posttext
>>
>>
>>
>> fromthreads_test
>>
>> ;
>>
>>
>>
>> hive> select * from threads_test_v;
>>
>>
>>
>> www.blablabla.compg12016-05-17 02:10:44.5272016-05-16
>> 02:08:55  YadaYada
>>
>>
>>
>>
>>
>> *From:* Igor Kravzov [mailto:igork.ine...@gmail.com]
>> *Sent:* Saturday, June 04, 2016 8:13 PM
>> *To:* user@hive.apache.org
>> *Subject:* Convert date in string format to timestamp in table definition
>>
>>
>>
>> Hi,
>>
>>
>>
>> I have 2 dates in Json file defined like this
>>
>> "addDate": "2016-05-17T02:10:44.527",
>>
>>   "postDate": "2016-05-16T02:08:55",
>>
>>
>>
>> Right now I define external table based on this file like this:
>>
>> CREATE external TABLE threads_test
>>
>> (url string,
>>
>>  pagetype string,
>>
>>  adddate string,
>>
>>  postdate string,
>>
>>  posttext string)
>>
>> partitioned by (mmdd int)
>>
>> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
>>
>> location 'my location';
>>
>>
>>
>> is it possible to define these 2 dates as timestamp?
>>
>> Do I need to change date format in the file? is it possible to specify
>> date format in table definition?
>>
>> Or I better off with string?
>>
>>
>>
>> Thanks in advance.
>>
>
>


Re: alter partitions on hive external table

2016-06-06 Thread Mich Talebzadeh
That order datetime/userid/customerId looks more natural to me.

Two questions:

What is the type of table in Hive?

Are you doing this for certain queries where you think userid as the most
significant column is going to help queries better?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 04:02, raj hive  wrote:

> Hi friends,
>
> I have created partitions on hive external tables. partitions on
> datetime/userid/customerId.
>
> now i have to change the order of the partitions for the existing data for
> all the dates.
>
> order of the partition is custerid/userid/datetime.
>
> Anyone can help me, how to alter the partitions for the existing table.
> Need a help to write a script to change the partions on existing data.
> almost 3 months data is there to modify as per new partition so changing
> each date is difficult. Any expert can help me.
>
> Thanks
> Raj
>


Re: alter partitions on hive external table

2016-06-06 Thread Mich Talebzadeh
so you are doing this for partition elimination?

it is a tough call whatever you do

Since userid is unique you can try

CLUSTERED BY (userid,datetime,customerid) INTO 256 BUCKETS

or try creating a new table based on new column partition and insert/select
part of data and see it actually improves performance.

I much doubt whichever way you go it is really going to have that impact on
your performance.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 08:18, raj hive  wrote:

> Hi Mich,
>
> table type is external table. Yes, I am doing this for certain queries
> where userid as the most significant column.
>
> On Mon, Jun 6, 2016 at 12:35 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> That order datetime/userid/customerId looks more natural to me.
>>
>> Two questions:
>>
>> What is the type of table in Hive?
>>
>> Are you doing this for certain queries where you think userid as the most
>> significant column is going to help queries better?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 6 June 2016 at 04:02, raj hive  wrote:
>>
>>> Hi friends,
>>>
>>> I have created partitions on hive external tables. partitions on
>>> datetime/userid/customerId.
>>>
>>> now i have to change the order of the partitions for the existing data
>>> for all the dates.
>>>
>>> order of the partition is custerid/userid/datetime.
>>>
>>> Anyone can help me, how to alter the partitions for the existing table.
>>> Need a help to write a script to change the partions on existing data.
>>> almost 3 months data is there to modify as per new partition so changing
>>> each date is difficult. Any expert can help me.
>>>
>>> Thanks
>>> Raj
>>>
>>
>>
>


Re: Why does the user need write permission on the location of external hive table?

2016-06-06 Thread Mich Talebzadeh
Well Sandeep, the permissioning on HDFS resembles that of Linux file system.

For security reason it does not allow you to write to that file. An
external table in Hive is just an interface.

Any reason why you have not got access to that file. Can you try to log in
with beeline with username and password?

The data is immutable What is the use case for this table? Are you going to
use data later in app/Hive and if so do you have permission to read it.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 16:59, Sandeep Giri  wrote:

> Yes, Mich that's right. That folder us read-only to me.
>
> That's my question. Why do we need modification permissions on the location
> while creating external table.
>
> This data is read-only. In hive, how can we process the huge data on which
> we don't have write permissions? Is cloning this data the only possibility?
> On May 31, 2016 3:15 PM, "Mich Talebzadeh" 
> wrote:
>
>> right that directly belongs to hdfs:hdfs and nonone else bar that user
>> can write to it.
>>
>> if you are connecting via beeline you need to specify the user and
>> password
>>
>> beeline -u jdbc:hive2://rhes564:10010/default
>> org.apache.hive.jdbc.HiveDriver -n hduser -p xxxx
>>
>> When I look at permissioning I see only hdfs can write to it not user
>> Sandeep?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 31 May 2016 at 09:20, Sandeep Giri  wrote:
>>
>>> Yes, when I run hadoop fs it gives results correctly.
>>>
>>> *hadoop fs -ls
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
>>> *Found 30 items*
>>> *-rw-r--r--   3 hdfs hdfs   6148 2015-12-04 15:19
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
>>> *-rw-r--r--   3 hdfs hdfs 803323 2015-12-04 15:19
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
>>> *-rw-r--r--   3 hdfs hdfs 284355 2015-12-04 15:19
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
>>> **
>>>
>>>
>>>
>>>
>>> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> is this location correct and valid?
>>>>
>>>> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/
>>>> tweets_raw/'
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 31 May 2016 at 08:50, Sandeep Giri  wrote:
>>>>
>>>>> Hi Hive Team,
>>>>>
>>>>> As per my understanding, in Hive, you can create two kinds of tables:
>>>>> Managed and External.
>>>>>
>>>>> In case of managed table, you own the data and hence when you drop the
>>>>> table the data is deleted.
>>>>>
>>>>> In case of external table, you don't have ownership of the data and
>>>>> hence when you delete such a table, the underlying data is not deleted.
>>>>> Only metadata is deleted.
>>>>>
>>>>> Now, recently i have observed that you can not create an external
>>>>> table over a location on which you don't have write (modification)
>>>>> permissions in HDFS. I completely fail to understand this.
>>>>>
>>>>> Use case: It is quite common that the data you are churning is huge
>>>>> and read-only. So, to churn such data via Hive, will you have to copy this
>>>>> huge data to a location on which you have write permissions?
>>>>>
>>>>> Please help.
>>>>

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-06 Thread Mich Talebzadeh
iveContext.scala:331)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $iwC$$iwC$$iwC$$iwC.(:37)
at $iwC$$iwC$$iwC.(:39)
at $iwC$$iwC.(:41)
at $iwC.(:43)
at (:45)
at .(:49)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 19:15, Alan Gates  wrote:

> This JIRA https://issues.apache.org/jira/browse/HIVE-12366 moved the
> heartbeat logic from the engine to the client.  AFAIK this was the only
> issue preventing working with Spark as an engine.  That JIRA was released
> in 2.0.
>
> I want to stress that to my knowledge no one has tested this combination
> of features, so there may be other problem.  But at least this issue has
> been resolved.
>
> Alan.
>
> > On Jun 2, 2016, at 01:54, Mich Talebzadeh 
> wrote:
> >
> >
> > Hi,
> >
> > Spark does not support transactions because as I understand there is a
> piece in the execution side that needs to send heartbeats to Hive metastore
> saying a transaction is still alive". That has not been implemented in
> Spark yet to my knowledge."
> >
> > Any idea on the timelines when we are going to have support for
> transactions in Spark for Hive ORC tables. This will really be useful.
> >
> >
> > Thanks,
> >
> >
> > Dr Mich Talebzadeh
> >
> > LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> > http://talebzadehmich.wordpress.com
> >
>
>


Re: Why does the user need write permission on the location of external hive table?

2016-06-06 Thread Mich Talebzadeh
Hi Sandeep.

I tend to use Hive external tables as staffing tables but still I will
require access writes to hdfs.

Zip files work OK as well. For example our CSV files are zipped using bzip2
to save space

However, you may request a temporary solution by disabling permission in
$HADOOP_HOME/etc/Hadoop/hdfs-site.xml


dfs.permissions
false


There are other ways as well.

Check this

http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs

HTH






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 21:00, Igor Kravzov  wrote:

> I see file are with extension .gz. Are these zipped?
> Did you try with unzipped files?
> Maybe in order to read the data hive needs to unzip files but does not
> have write permission?
> Just a wild guess...
>
> On Tue, May 31, 2016 at 4:20 AM, Sandeep Giri 
> wrote:
>
>> Yes, when I run hadoop fs it gives results correctly.
>>
>> *hadoop fs -ls
>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
>> *Found 30 items*
>> *-rw-r--r--   3 hdfs hdfs   6148 2015-12-04 15:19
>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
>> *-rw-r--r--   3 hdfs hdfs 803323 2015-12-04 15:19
>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
>> *-rw-r--r--   3 hdfs hdfs 284355 2015-12-04 15:19
>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
>> **
>>
>>
>>
>>
>> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> is this location correct and valid?
>>>
>>> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/tweets_raw/'
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 31 May 2016 at 08:50, Sandeep Giri  wrote:
>>>
>>>> Hi Hive Team,
>>>>
>>>> As per my understanding, in Hive, you can create two kinds of tables:
>>>> Managed and External.
>>>>
>>>> In case of managed table, you own the data and hence when you drop the
>>>> table the data is deleted.
>>>>
>>>> In case of external table, you don't have ownership of the data and
>>>> hence when you delete such a table, the underlying data is not deleted.
>>>> Only metadata is deleted.
>>>>
>>>> Now, recently i have observed that you can not create an external table
>>>> over a location on which you don't have write (modification) permissions in
>>>> HDFS. I completely fail to understand this.
>>>>
>>>> Use case: It is quite common that the data you are churning is huge and
>>>> read-only. So, to churn such data via Hive, will you have to copy this huge
>>>> data to a location on which you have write permissions?
>>>>
>>>> Please help.
>>>>
>>>> My data is located in a hdfs folder
>>>> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
>>>> only have readonly permission. And I am trying to execute the following
>>>> command
>>>>
>>>> *CREATE EXTERNAL TABLE tweets_raw (*
>>>> *id BIGINT,*
>>>> *created_at STRING,*
>>>> *source STRING,*
>>>> *favorited BOOLEAN,*
>>>> *retweet_count INT,*
>>>> *retweeted_status STRUCT<*
>>>> *text:STRING,*
>>>> *users:STRUCT>,*
>>>> *entities STRUCT<*
>>>> *urls:ARRAY>,*
>>>> *user_mentions:ARRAY>,*
>>>> *hashtags:ARRAY>>,*
>>>> *text STRING,*
>>>> *user1 STRUCT<*
>>>> *screen_name:STRING,*
>>>> *name:STRING,*
>>>> *friends_count:INT,*
>>>> *followers_count:INT,*
>>>> *statuses_count:INT,*
>>>> *verified:BOOLEAN,*
>>>> *utc_offs

Re: Why does the user need write permission on the location of external hive table?

2016-06-06 Thread Mich Talebzadeh
sorry should read* staging *tables ..

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 21:14, Mich Talebzadeh  wrote:

> Hi Sandeep.
>
> I tend to use Hive external tables as staffing tables but still I will
> require access writes to hdfs.
>
> Zip files work OK as well. For example our CSV files are zipped using
> bzip2 to save space
>
> However, you may request a temporary solution by disabling permission in
> $HADOOP_HOME/etc/Hadoop/hdfs-site.xml
>
> 
> dfs.permissions
> false
> 
>
> There are other ways as well.
>
> Check this
>
> http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 June 2016 at 21:00, Igor Kravzov  wrote:
>
>> I see file are with extension .gz. Are these zipped?
>> Did you try with unzipped files?
>> Maybe in order to read the data hive needs to unzip files but does not
>> have write permission?
>> Just a wild guess...
>>
>> On Tue, May 31, 2016 at 4:20 AM, Sandeep Giri 
>> wrote:
>>
>>> Yes, when I run hadoop fs it gives results correctly.
>>>
>>> *hadoop fs -ls
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
>>> *Found 30 items*
>>> *-rw-r--r--   3 hdfs hdfs   6148 2015-12-04 15:19
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
>>> *-rw-r--r--   3 hdfs hdfs 803323 2015-12-04 15:19
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
>>> *-rw-r--r--   3 hdfs hdfs 284355 2015-12-04 15:19
>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
>>> **
>>>
>>>
>>>
>>>
>>> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> is this location correct and valid?
>>>>
>>>> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/
>>>> tweets_raw/'
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 31 May 2016 at 08:50, Sandeep Giri  wrote:
>>>>
>>>>> Hi Hive Team,
>>>>>
>>>>> As per my understanding, in Hive, you can create two kinds of tables:
>>>>> Managed and External.
>>>>>
>>>>> In case of managed table, you own the data and hence when you drop the
>>>>> table the data is deleted.
>>>>>
>>>>> In case of external table, you don't have ownership of the data and
>>>>> hence when you delete such a table, the underlying data is not deleted.
>>>>> Only metadata is deleted.
>>>>>
>>>>> Now, recently i have observed that you can not create an external
>>>>> table over a location on which you don't have write (modification)
>>>>> permissions in HDFS. I completely fail to understand this.
>>>>>
>>>>> Use case: It is quite common that the data you are churning is huge
>>>>> and read-only. So, to churn such data via Hive, will you have to copy this
>>>>> huge data to a location on which you have write permissions?
>>>>>
>>>>> Please help.
>>>>>
>>>>> My data is located in a hdfs folder
>>>>> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
>>>>> only have readonly permission. And I am trying to execute the following
>>>>> command
>>>>>
>>>>> *CREATE EXTERNAL TABLE tweets_raw (*
>>>>> *id BIGINT,*
>>>>> *created_at STRING,*

Re: Why does the user need write permission on the location of external hive table?

2016-06-06 Thread Mich Talebzadeh
Hi Igor,

Hive can read from zipped files. If you are getting a lot of external files
it makes sense to zip them and store on staging hdfs directory

1) download say these csv files into your local file system and use bzip2
to zip them as part of ETL

 ls -l
total 68
-rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2
-rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2
-rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2
-rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2
-rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2
-rw-r--r-- 1 hduser hadoop  693 Apr 25 11:29 nw_2016.csv.bz2

Then put these files in a staging directory on hdfs usinh a shell script


for FILE in `ls *.*|grep -v .ksh`
do
  echo "Bzipping ${FILE}"
  /usr/bin/bzip2 ${FILE}
   hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR}
done

OK now the files are saved in ${TARGETDIR}

Now create the external table looking at this staging directory. *No need
to tell hive that these files are compressed*. It knows how to handle it.
They are stored as textfiles


DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet nw_10124772'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

*STORED AS TEXTFILE*LOCATION '/data/stg/accounts/nw/10124772'
TBLPROPERTIES ("skip.header.line.count"="1")

Now create the Hive table internally. Note that I want this data to be
compressed. You will tell it to compress the table with ZLIB or SNAPPY


DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER  INT
,PAYMENTDATEdate
,NETDECIMAL(20,2)
,VATDECIMAL(20,2)
,TOTAL  DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet nw_10124772'
CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
STORED AS ORC
*TBLPROPERTIES ( "orc.compress"="ZLIB" )*

Put data in target table. do the conversion and ignore empty rows

INSERT INTO TABLE t2
SELECT
  INVOICENUMBER
,
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd'))
AS paymentdate
, CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
FROM
stg_t2
WHERE
--INVOICENUMBER > 0 AND
CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0
-- Exclude empty rows
;

So pretty straight forward.

Now to your question

"it will affect performance, correct?"


Compression is a well established algorithm. It has been around in
databases. Almost all RDBMS (Oracle, Sybase etc) do compress the data at
database and backups through an option. Compression is more CPU intensive
than without it. However, the database will handle the conversion of data
from compressed to none when you read it or whatever. So yes there is a
performance price to pay albeit small using more CPU to uncompress the data
and present it. However, that is a small price to pay to reduce the storage
cost for data.

HTH











Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 23:18, Igor Kravzov  wrote:

> Mich, will Hive automatically detect and unzip zipped files? Ir there is
> special option in table configuration?
> it will affect performance, correct?
>
> On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh  > wrote:
>
>> Hi Sandeep.
>>
>> I tend to use Hive external tables as staffing tables but still I will
>> require access writes to hdfs.
>>
>> Zip files work OK as well. For example our CSV files are zipped using
>> bzip2 to save space
>>
>> However, you may request a temporary solution by disabling permission in
>> $HADOOP_HOME/etc/Hadoop/hdfs-site.xml
>>
>> 
>> dfs.permissions
>> false
>> 
>>
>> There are other ways as well.
>>
>> Check this
>>
>> http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
&g

Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi,

I noticed an issue with Spark creating and populating a Hive table.

The process as I see is as follows:


   1. Spark creates the Hive table. In this case an ORC table in a Hive
   Database
   2. Spark uses JDBC connection to get data out from an Oracle
   3. I create a temp table in Spark through (registerTempTable)
   4. Spark populates that table. That table is actually created in

   hdfs dfs -ls /tmp/hive/hduser
   drwx--   - hduser supergroup
   /tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059



   1. However, The original table itself does not have any locking on it!
   2. I log in into Hive and drop that table
   3.

   hive> drop table dummy;
   OK

   4.  That table is dropped OK
   5. Spark crashes with message

Started at
[08/06/2016 18:37:53.53]
16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/user/hive/warehouse/oraclehadoop.db/dummy/.hive-staging_hive_2016-06-08_18-38-08_804_3299712811201460314-1/-ext-1/_temporary/0/_temporary/attempt_201606081838_0001_m_00_0/part-0
(inode 831621): File does not exist. Holder
DFSClient_NONMAPREDUCE_-1836386597_1 does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3313)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3169)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy23.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
16/06/08 19:13:46 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times;
aborting job

Suggested solution.
In a concurrent env, Spark should apply locks in order to prevent such
operations. Locks are kept in Hive meta data table HIVE_LOCKS

HTH
Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi,


The idea of accessing Hive metada is to be aware of concurrency.



 In generall if I do the following In Hive



hive> create table test.dummy as select * from oraclehadoop.dummy;



We can see that hive applies the locks in Hive



[image: Inline images 2]




However, there seems to be an issue. *I do not see any exclusive lock on
the target table* (i.e. test.dummy). The locking type SHARED_READ on source
table oraclehadoop.dummy looks OK



 One can see the locks  in Hive database



[image: Inline images 1]



So there are few issues here:


   1. With Hive -> The source table is locked as SHARED_READ
   2. With Spark --> No locks at all
   3. With HIVE --> No locks on the target table
   4. With Spark --> No locks at all

 HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 20:22, David Newberger 
wrote:

> Could you be looking at 2 jobs trying to use the same file and one getting
> to it before the other and finally removing it?
>
>
>
> *David Newberger*
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Wednesday, June 8, 2016 1:33 PM
> *To:* user; user @spark
> *Subject:* Creating a Hive table through Spark and potential locking
> issue (a bug)
>
>
>
>
>
> Hi,
>
>
>
> I noticed an issue with Spark creating and populating a Hive table.
>
>
>
> The process as I see is as follows:
>
>
>
>1. Spark creates the Hive table. In this case an ORC table in a Hive
>Database
>2. Spark uses JDBC connection to get data out from an Oracle
>3. I create a temp table in Spark through (registerTempTable)
>4. Spark populates that table. That table is actually created in
>
>hdfs dfs -ls /tmp/hive/hduser
>
>drwx--   - hduser supergroup
>
>/tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059
>
>
>
>
>
>1. However, The original table itself does not have any locking on it!
>2. I log in into Hive and drop that table
>
> 3. hive> drop table dummy;
>
> OK
>
>
>
>1.  That table is dropped OK
>2. Spark crashes with message
>
> Started at
> [08/06/2016 18:37:53.53]
> 16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
> 1)
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /user/hive/warehouse/oraclehadoop.db/dummy/.hive-staging_hive_2016-06-08_18-38-08_804_3299712811201460314-1/-ext-1/_temporary/0/_temporary/attempt_201606081838_0001_m_00_0/part-0
> (inode 831621): File does not exist. Holder
> DFSClient_NONMAPREDUCE_-1836386597_1 does not have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3516)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3313)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3169)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1468)
> at org.apache.hadoop.ipc.Client.call(Client.java:1399)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy22.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
> at sun.reflect.GeneratedMe

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hive version is 2

We can discuss all sorts of scenarios.  However, Hivek is pretty good at
applying the locks at both the table and partition level. The idea of
having a metadata is to enforce these rules.

[image: Inline images 1]

For example above inserting from source to target table partitioned (year,
month) shows that locks are applied correctly

This is Hive running on Spark engine. The crucial point is that Hive
accesses its metadata and updates its hive_locks table. Again one can see
from data held in that table in metadata

[image: Inline images 2]

So I think there is a genuine issue here

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 22:36, Michael Segel  wrote:

> Hi,
>
> Lets take a step back…
>
> Which version of Hive?
>
> Hive recently added transaction support so you have to know your isolation
> level.
>
> Also are you running spark as your execution engine, or are you talking
> about a spark app running w a hive context and then you drop the table from
> within a Hive shell while the spark app is still running?
>
> You also have two different things happening… you’re mixing a DDL with a
> query.  How does hive know you have another app reading from the table?
> I mean what happens when you try a select * from foo; and in another shell
> try dropping foo?  and if you want to simulate a m/r job add something like
> an order by 1 clause.
>
> HTH
>
> -Mike
>
>
>
> On Jun 8, 2016, at 1:44 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> The idea of accessing Hive metada is to be aware of concurrency.
>
>
>  In generall if I do the following In Hive
>
>
> hive> create table test.dummy as select * from oraclehadoop.dummy;
>
>
> We can see that hive applies the locks in Hive
>
>
> 
>
>
>
>
>
> However, there seems to be an issue. *I do not see any exclusive lock on
> the target table* (i.e. test.dummy). The locking type SHARED_READ on
> source table oraclehadoop.dummy looks OK
>
>
>  One can see the locks  in Hive database
>
>
>
>
> 
>
>
>
>
> So there are few issues here:
>
>
>1. With Hive -> The source table is locked as SHARED_READ
>2. With Spark --> No locks at all
>3. With HIVE --> No locks on the target table
>4. With Spark --> No locks at all
>
>  HTH
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 June 2016 at 20:22, David Newberger 
> wrote:
>
>> Could you be looking at 2 jobs trying to use the same file and one
>> getting to it before the other and finally removing it?
>>
>>
>>
>> *David Newberger*
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>> *Sent:* Wednesday, June 8, 2016 1:33 PM
>> *To:* user; user @spark
>> *Subject:* Creating a Hive table through Spark and potential locking
>> issue (a bug)
>>
>>
>>
>>
>>
>> Hi,
>>
>>
>>
>> I noticed an issue with Spark creating and populating a Hive table.
>>
>>
>>
>> The process as I see is as follows:
>>
>>
>>
>>1. Spark creates the Hive table. In this case an ORC table in a Hive
>>Database
>>2. Spark uses JDBC connection to get data out from an Oracle
>>3. I create a temp table in Spark through (registerTempTable)
>>4. Spark populates that table. That table is actually created in
>>
>>hdfs dfs -ls /tmp/hive/hduser
>>
>>drwx--   - hduser supergroup
>>
>>/tmp/hive/hduser/b1ea6829-790f-4b37-a0ff-3ed218388059
>>
>>
>>
>>
>>
>>1. However, The original table itself does not have any locking on it!
>>2. I log in into Hive and drop that table
>>
>> 3. hive> drop table dummy;
>>
>> OK
>>
>>
>>
>>1.  That table is dropped OK
>>2. Spark crashes with message
>>
>> Started at
>> [08/06/2016 18:37:53.53]
>> 16/06/08 19:13:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
>> 1)
>>
>> org.apache.hadoop.ipc.RemoteException(org.apache.had

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
OK this seems to work.


   1. Create the target table first
   2.  Populate afterwards

 I first created the target table with

hive> create table test.dummy as select * from oraclehadoop.dummy where 1 =
2;

 Then did  INSERT/SELECT and tried to drop the target table when DML
(INSERT/SELECT) was going on

Now the process 6856 (drop table ..)  is waiting for the locks to be
released which is correct


Lock ID DatabaseTable   Partition   State   TypeTransaction
ID  Last Hearbeat   Acquired At UserHostname
6855testdummy   NULLACQUIREDSHARED_READ NULL
1465425703092   1465425703054   hduser  rhes564
6855oraclehadoopdummy   NULLACQUIREDSHARED_READ
NULL1465425703092   1465425703056   hduser  rhes564
6856testdummy   NULLWAITING EXCLUSIVE   NULL
1465425820073   NULLhduser  rhes564

Sounds like with Hive there is the issue with DDL + DML locks applied in a
single transaction i.e. --> create table A as select * from b

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 23:35, Eugene Koifman  wrote:

> if you split “create table test.dummy as select * from oraclehadoop.dummy;
> ”
> into create table statement, followed by insert into test.dummy as select…
> you should see the behavior you expect with Hive.
> Drop statement will block while insert is running.
>
> Eugene
>
> From: Mich Talebzadeh 
> Reply-To: "user@hive.apache.org" 
> Date: Wednesday, June 8, 2016 at 3:12 PM
> To: Michael Segel 
> Cc: David Newberger , "user@hive.apache.org"
> , "user @spark" 
> Subject: Re: Creating a Hive table through Spark and potential locking
> issue (a bug)
>
> Hive version is 2
>
> We can discuss all sorts of scenarios.  However, Hivek is pretty good at
> applying the locks at both the table and partition level. The idea of
> having a metadata is to enforce these rules.
>
> [image: Inline images 1]
>
> For example above inserting from source to target table partitioned (year,
> month) shows that locks are applied correctly
>
> This is Hive running on Spark engine. The crucial point is that Hive
> accesses its metadata and updates its hive_locks table. Again one can see
> from data held in that table in metadata
>
> [image: Inline images 2]
>
> So I think there is a genuine issue here
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 June 2016 at 22:36, Michael Segel  wrote:
>
>> Hi,
>>
>> Lets take a step back…
>>
>> Which version of Hive?
>>
>> Hive recently added transaction support so you have to know your
>> isolation level.
>>
>> Also are you running spark as your execution engine, or are you talking
>> about a spark app running w a hive context and then you drop the table from
>> within a Hive shell while the spark app is still running?
>>
>> You also have two different things happening… you’re mixing a DDL with a
>> query.  How does hive know you have another app reading from the table?
>> I mean what happens when you try a select * from foo; and in another
>> shell try dropping foo?  and if you want to simulate a m/r job add
>> something like an order by 1 clause.
>>
>> HTH
>>
>> -Mike
>>
>>
>>
>> On Jun 8, 2016, at 1:44 PM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> The idea of accessing Hive metada is to be aware of concurrency.
>>
>>
>>  In generall if I do the following In Hive
>>
>>
>> hive> create table test.dummy as select * from oraclehadoop.dummy;
>>
>>
>> We can see that hive applies the locks in Hive
>>
>>
>> 
>>
>>
>>
>>
>>
>> However, there seems to be an issue. *I do not see any exclusive lock on
>> the target table* (i.e. test.dummy). The locking type SHARED_READ on
>> source table oraclehadoop.dummy looks OK
>>
>>
>>  One can see the locks  in Hive database
>>
>>
>>
>>
>> 
>>
>>
>>
>>
>> So there are few issues here:
>>
>>
>>1. With Hive -> The source table is locked as SHARED_READ
>>2. With Spark --> No locks at all
>>3. With HIVE --> No l

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
BTW

DbTxnManager is set as well


  
hive.txn.manager

*org.apache.hadoop.hive.ql.lockmgr.DbTxnManager*

  Set to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager as part of
turning on Hive
  transactions, which also requires appropriate settings for
hive.compactor.initiator.on,
  hive.compactor.worker.threads, hive.support.concurrency (true),
hive.enforce.bucketing
  (true), and hive.exec.dynamic.partition.mode (nonstrict).
  The default DummyTxnManager replicates pre-Hive-0.13 behavior and
provides
  no transactions.

  


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 23:52, Mich Talebzadeh  wrote:

> OK this seems to work.
>
>
>1. Create the target table first
>2.  Populate afterwards
>
>  I first created the target table with
>
> hive> create table test.dummy as select * from oraclehadoop.dummy where 1
> = 2;
>
>  Then did  INSERT/SELECT and tried to drop the target table when DML
> (INSERT/SELECT) was going on
>
> Now the process 6856 (drop table ..)  is waiting for the locks to be
> released which is correct
>
>
> Lock ID DatabaseTable   Partition   State   Type
> Transaction ID  Last Hearbeat   Acquired At UserHostname
> 6855testdummy   NULLACQUIREDSHARED_READ NULL
> 1465425703092   1465425703054   hduser  rhes564
> 6855oraclehadoopdummy   NULLACQUIREDSHARED_READ
> NULL1465425703092   1465425703056   hduser  rhes564
> 6856testdummy   NULLWAITING EXCLUSIVE   NULL
> 1465425820073   NULLhduser  rhes564
>
> Sounds like with Hive there is the issue with DDL + DML locks applied in a
> single transaction i.e. --> create table A as select * from b
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 June 2016 at 23:35, Eugene Koifman  wrote:
>
>> if you split “create table test.dummy as select * from
>> oraclehadoop.dummy;”
>> into create table statement, followed by insert into test.dummy as
>> select… you should see the behavior you expect with Hive.
>> Drop statement will block while insert is running.
>>
>> Eugene
>>
>> From: Mich Talebzadeh 
>> Reply-To: "user@hive.apache.org" 
>> Date: Wednesday, June 8, 2016 at 3:12 PM
>> To: Michael Segel 
>> Cc: David Newberger , "user@hive.apache.org"
>> , "user @spark" 
>> Subject: Re: Creating a Hive table through Spark and potential locking
>> issue (a bug)
>>
>> Hive version is 2
>>
>> We can discuss all sorts of scenarios.  However, Hivek is pretty good at
>> applying the locks at both the table and partition level. The idea of
>> having a metadata is to enforce these rules.
>>
>> [image: Inline images 1]
>>
>> For example above inserting from source to target table partitioned
>> (year, month) shows that locks are applied correctly
>>
>> This is Hive running on Spark engine. The crucial point is that Hive
>> accesses its metadata and updates its hive_locks table. Again one can see
>> from data held in that table in metadata
>>
>> [image: Inline images 2]
>>
>> So I think there is a genuine issue here
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 8 June 2016 at 22:36, Michael Segel  wrote:
>>
>>> Hi,
>>>
>>> Lets take a step back…
>>>
>>> Which version of Hive?
>>>
>>> Hive recently added transaction support so you have to know your
>>> isolation level.
>>>
>>> Also are you running spark as your execution engine, or are you talking
>>> about a spark app running w a hive context and then you drop the table from
>>> within a Hive shell while the spark app is still running?
>>>
>>> You also have two different things happening… you’re mixing a DDL with a
>>> query.  How does hive know you have another app readi

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi,

Just to clarify I use Hive with Spark engine (default) so Hive on Spark
engine as we discussed and observed.

Now with regard to Spark (as an app NOT execution engine) doing the create
table in Hive and populating it, I don't think Spark itself does any
transactional enforcement. This means that Spark assumes no concurrency
for Hive table. It is probably the same reason why updates/deletes to Hive
ORC transactional tables through Spark fail.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 00:46, Eugene Koifman  wrote:

> Locks in Hive are acquired by the query complier and should be independent
> of the execution engine.
> Having said that, I’ve not tried this on Spark, so my answer is only
> accurate with Hive.
>
> Eugene
>
>
> From: Michael Segel 
> Reply-To: "user@hive.apache.org" 
> Date: Wednesday, June 8, 2016 at 3:42 PM
> To: "user@hive.apache.org" 
> Cc: David Newberger , "user @spark" <
> u...@spark.apache.org>
> Subject: Re: Creating a Hive table through Spark and potential locking
> issue (a bug)
>
>
> On Jun 8, 2016, at 3:35 PM, Eugene Koifman 
> wrote:
>
> if you split “create table test.dummy as select * from oraclehadoop.dummy;
> ”
> into create table statement, followed by insert into test.dummy as select…
> you should see the behavior you expect with Hive.
> Drop statement will block while insert is running.
>
> Eugene
>
>
> OK, assuming true…
>
> Then the ddl statement is blocked because Hive sees the table in use.
>
> If you can confirm this to be the case, and if you can confirm the same
> for spark and then you can drop the table while spark is running, then you
> would have a bug since Spark in the hive context doesn’t set any locks or
> improperly sets locks.
>
> I would have to ask which version of hive did you build spark against?
> That could be another factor.
>
> HTH
>
> -Mike
>
>
>


Using Hive table for twitter data

2016-06-09 Thread Mich Talebzadeh
Hi,

I am just exploring this.

Has anyone done recent load of twitter data into Hive table.

I used few of them.

This one I tried

ADD JAR /home/hduser/jars/hive-serdes-1.0-SNAPSHOT.jar;
--SET hive.support.sql11.reserved.keywords=false;
use test;
drop table if exists tweets;
CREATE EXTERNAL TABLE tweets (
  id BIGINT,
  created_at STRING,
  source STRING,
  favorited BOOLEAN,
  retweeted_status STRUCT<
text:STRING,
user1:STRUCT,
retweet_count:INT>,
  entities STRUCT<
urls:ARRAY>,
user_mentions:ARRAY>,
hashtags:ARRAY>>,
  text STRING,
  user1 STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
  in_reply_to_screen_name STRING
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/twitter_data'
;

It creates OK but no data is there.

I use Flume to populate that external directory

hdfs dfs -ls /twitter_data
-rw-r--r--   2 hduser supergroup 433868 2016-06-09 09:52
/twitter_data/FlumeData.1465462333430
-rw-r--r--   2 hduser supergroup 438933 2016-06-09 09:53
/twitter_data/FlumeData.1465462365382
-rw-r--r--   2 hduser supergroup 559724 2016-06-09 09:53
/twitter_data/FlumeData.1465462403606
-rw-r--r--   2 hduser supergroup 455594 2016-06-09 09:54
/twitter_data/FlumeData.1465462435124

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


Re: Using Hive table for twitter data

2016-06-09 Thread Mich Talebzadeh
thanks Gopal

that link

404 - OOPS!
Looks like you wandered too far from the herd!

LOL

Any reason why that table in Hive cannot read data in?

cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 10:09, Gopal Vijayaraghavan  wrote:

>
> > Has anyone done recent load of twitter data into Hive table.
>
> Not anytime recently, but the twitter corpus was heavily used to demo Hive.
>
> Here's the original post on auto-learning schemas from an arbitrary
> collection of JSON docs (like a MongoDB dump).
>
> http://hortonworks.com/blog/discovering-hive-schema-in-collections-of-json-
> documents/
>
>
> Cheers,
> Gopal
>
>
>


Re: Hive Table Creation failure on Postgres

2016-06-09 Thread Mich Talebzadeh
Well I know that the script works fine for Oracle (both base and
transactional).

Ok this is what this table is in Oracle. That column is 256 bytes.

[image: Inline images 2]


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 19:43, Siddhi Mehta  wrote:

> Hello Everyone,
>
> We are using postgres for hive persistent store.
>
> We are making use of the schematool to create hive schema and our hive
> configs have table and column validation enabled.
>
> While trying to create a simple hive table we ran into the following error.
>
> Error: Error while processing statement: FAILED: Execution Error, return
> code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
> MetaException(message:javax.jdo.JDODataStoreException: Wrong precision
> for column "*COLUMNS_V2"."COMMENT*" : was 4000 (according to the JDBC
> driver) but should be 256 (based on the MetaData definition for field
> org.apache.hadoop.hive.metastore.model.MFieldSchema.comment).
>
> Looks like the Hive Metastore validation expects it to be 255 but when I
> looked at the metastore script for Postgres  it creates the column with
> precision 4000.
>
> Interesting thing is that mysql scripts for the same hive version create
> the column with precision 255.
>
> Is there a config to communicate with Hive MetaStore validation layers as
> to what is the appropriate column precision to be based on the underlying
> persistent store  used or
> is this a known workaround to turn of validation when using postgress as
> the persistent store.
>
> Thanks,
> Siddhi
>


Re: column statistics for non-primitive types

2016-06-13 Thread Mich Talebzadeh
which version of Hive are you using?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 13 June 2016 at 16:00, Michael Häusler  wrote:

> Hi there,
>
>
> when testing column statistics I stumbled upon the following error message:
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (foo BIGINT, bar ARRAY, foobar
> STRUCT);
>
> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but array is passed.
>
> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but struct is passed.
>
>
> 1) Basically, it seems that column statistics don't work for non-primitive
> types. Are there any workarounds or any plans to change this?
>
> 2) Furthermore, the convenience syntax to compute statistics for all
> columns does not work as soon as there is a non-supported column. Are there
> any plans to change this, so it is easier to compute statistics for all
> supported columns?
>
> 3) ANALYZE TABLE will only provide the first failing *type* in the error
> message. Especially for wide tables it would be much easier if all
> non-supported column *names* would be printed.
>
>
> Best regards
> Michael
>
>


Re: Optimized Hive query

2016-06-13 Thread Mich Talebzadeh
you want to flatten the query I understand.

create temporary table tmp as select c from d;

INSERT INTO TABLE a
SELECT c from tmp where
condition

Is the INSERT code correct?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 13 June 2016 at 17:55, Aviral Agarwal  wrote:

> Hi,
> I would like to know if there is a way to convert nested hive sub-queries
> into optimized queries.
>
> For example :
> INSERT INTO TABLE a.b SELECT * FROM ( SELECT c FROM d)
>
> into
>
> INSERT INTO TABLE a.b SELECT c FROM D
>
> This is a simple example but the solution should apply is there were
> deeper nesting levels present.
>
> Thanks,
> Aviral Agarwal
>
>


Re: Optimized Hive query

2016-06-14 Thread Mich Talebzadeh
I presume the user is concerned with performance?

The whole use case of a CBO is to take care of queries by finding the
optimum access path.

otherwise we would have a RBO as is in the old days of Hive.

If you are in the more recent version of Hive CBO does the job.

However, you may think of moving from map-reduce execution engine to
something like Spark to accelerate the whole thing.

Alternatively use Spark for the query on Hive (assuming that you are
familiar with the product) to do the whole thing (CBO + execution).

Hive is pretty mature. Hive on map-reduce is problematic.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 08:37, Aviral Agarwal  wrote:

> Hi,
> Thanks for the replies.
> I already knew that the optimizer already does that.
> My usecase is a bit different though.
> I want to display the flattened query back to the user.
> So I was hoping of using internal Hive CBO to somehow change the AST
> generated for the query somehow.
>
> Thanks,
> Aviral
>
> On Tue, Jun 14, 2016 at 12:42 PM, Gopal Vijayaraghavan 
> wrote:
>
>>
>> > You can see that you get identical execution plans for the nested query
>> >and the flatten one.
>>
>> Wasn't that always though. Back when I started with Hive, before Stinger,
>> it didn't have the identity project remover.
>>
>> To know if your version has this fix, try looking at
>>
>> hive> set hive.optimize.remove.identity.project;
>>
>>
>> Cheers,
>> Gopal
>>
>>
>>
>>
>>
>


Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi Michael,

Statistics for columns in Hive are kept in Hive metadata table
tab_col_stats.

When I am looking at this table in Oracle, I only see statistics for
primitives columns here. STRUCT columns do not have it as a STRUCT column
will have to be broken into its primitive columns.  I don't think Hive has
the means to do that.

desc tab_col_stats;
 Name
Null?Type
 
 -
 CS_ID
NOT NULL NUMBER
 DB_NAME
NOT NULL VARCHAR2(128)
 TABLE_NAME
NOT NULL VARCHAR2(128)
 COLUMN_NAME
NOT NULL VARCHAR2(1000)
 COLUMN_TYPE
NOT NULL VARCHAR2(128)
 TBL_ID
NOT NULL NUMBER
 LONG_LOW_VALUE
NUMBER
 LONG_HIGH_VALUE
NUMBER
 DOUBLE_LOW_VALUE
NUMBER
 DOUBLE_HIGH_VALUE
NUMBER
 BIG_DECIMAL_LOW_VALUE
VARCHAR2(4000)
 BIG_DECIMAL_HIGH_VALUE
VARCHAR2(4000)
 NUM_NULLS
NOT NULL NUMBER
 NUM_DISTINCTS
NUMBER
 AVG_COL_LEN
NUMBER
 MAX_COL_LEN
NUMBER
 NUM_TRUES
NUMBER
 NUM_FALSES
NUMBER
 LAST_ANALYZED
NOT NULL NUMBER



 So in summary although column type STRUCT do exit, I don't think Hive can
cater for their statistics. Actually I don't think Oracle itself does it.

HTH

P.S. I am on Hive 2 and it does not.

hive> analyze table foo compute statistics for columns;
FAILED: UDFArgumentTypeException Only primitive type arguments are accepted
but array is passed.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 09:57, Michael Häusler  wrote:

> Hi there,
>
> you can reproduce the messages below with Hive 1.2.1.
>
> Best regards
> Michael
>
>
> On 2016-06-13, at 22:21, Mich Talebzadeh 
> wrote:
>
> which version of Hive are you using?
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 13 June 2016 at 16:00, Michael Häusler  wrote:
>
>> Hi there,
>>
>>
>> when testing column statistics I stumbled upon the following error
>> message:
>>
>> DROP TABLE IF EXISTS foo;
>> CREATE TABLE foo (foo BIGINT, bar ARRAY, foobar
>> STRUCT);
>>
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>> accepted but array is passed.
>>
>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS foobar, bar;
>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>> accepted but struct is passed.
>>
>>
>> 1) Basically, it seems that column statistics don't work for
>> non-primitive types. Are there any workarounds or any plans to change this?
>>
>> 2) Furthermore, the convenience syntax to compute statistics for all
>> columns does not work as soon as there is a non-supported column. Are there
>> any plans to change this, so it is easier to compute statistics for all
>> supported columns?
>>
>> 3) ANALYZE TABLE will only provide the first failing *type* in the error
>> message. Especially for wide tables it would be much easier if all
>> non-supported column *names* would be printed.
>>
>>
>> Best regards
>> Michael
>>
>>
>
>


Re: Optimized Hive query

2016-06-14 Thread Mich Talebzadeh
Amazing. that is the first time I have heard that an optimizer does not
have the concept of flattened query?

So what is the definition of syntax tree? Are you referring to the industry
notation "access path". This is the first time I have heard of such
notation called syntax tree. Are you stating that there is somehow some
explanation for optimiser "access path" that comes out independent of  the
optimizer and is called syntax tree?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 17:46, Markovitz, Dudu  wrote:

> It’s not the query that is being optimized but the syntax tree that is
> created upon the query (execute “explain extended select …”)
>
> In no point do we have a “flattened query”
>
>
>
> Dudu
>
>
>
> *From:* Aviral Agarwal [mailto:aviral12...@gmail.com]
> *Sent:* Tuesday, June 14, 2016 10:37 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Optimized Hive query
>
>
>
> Hi,
>
> Thanks for the replies.
>
> I already knew that the optimizer already does that.
>
> My usecase is a bit different though.
>
> I want to display the flattened query back to the user.
>
> So I was hoping of using internal Hive CBO to somehow change the AST
> generated for the query somehow.
>
>
>
> Thanks,
>
> Aviral
>
>
>
> On Tue, Jun 14, 2016 at 12:42 PM, Gopal Vijayaraghavan 
> wrote:
>
>
> > You can see that you get identical execution plans for the nested query
> >and the flatten one.
>
> Wasn't that always though. Back when I started with Hive, before Stinger,
> it didn't have the identity project remover.
>
> To know if your version has this fix, try looking at
>
> hive> set hive.optimize.remove.identity.project;
>
>
> Cheers,
> Gopal
>
>
>
>
>


Re: ORC does not support type conversion from INT to STRING.

2016-06-14 Thread Mich Talebzadeh
Hi Mahendar,


Did you load the meta-data DB/schema from backup and now seeing this error



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 19:04, Mahender Sarangam 
wrote:

> ping.
>
> On 6/13/2016 1:19 PM, Mahender Sarangam wrote:
>
> Hi,
>
> We are facing issue while reading data from ORC table. We have created ORC
> table and dumped data into it. We have deleted cluster due to some reason.
> When we recreated cluster (using Metastore) and table pointing to same
> location. When we perform reading from ORC table. We see below error.
>
> SELECT col2, Col1,
>   reflect("java.util.UUID", "randomUUID") AS ID,
>   Source,
>  1 ,
> SDate,
> EDate
> FROM Table ORC  JOIN Table2 _surr;
>
> ERROR : Vertex failed, vertexName=Map 1,
> vertexId=vertex_1465411930667_0212_1_01, diagnostics=[Task failed,
> taskId=task_1465411930667_0212_1_01_00, diagnostics=[TaskAttempt 0
> failed, info=[Error: Failure while running task:java.lang.RuntimeException:
> java.lang.RuntimeException: java.io.IOException: java.io.IOException: ORC
> does not support type conversion from INT to STRING.
>
>
> I think issue is reflect("java.util.UUID", "randomUUID") AS ID
>
>
> I know there is Bug raised while reading data from ORC table. Is there any
> workaround apart from reloading data.
>
> -MS
>
>
>
>
>


Re: ORC does not support type conversion from INT to STRING.

2016-06-14 Thread Mich Talebzadeh
you must excuse my ignorance

can you please elaborate on this as there seems something has gone wrong
somewhere?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 19:42, Mahender Sarangam 
wrote:

> Yes Mich. We have restored cluster from metastore.
>
> On 6/14/2016 11:35 AM, Mich Talebzadeh wrote:
>
> Hi Mahendar,
>
>
> Did you load the meta-data DB/schema from backup and now seeing this error
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 19:04, Mahender Sarangam 
> wrote:
>
>> ping.
>>
>> On 6/13/2016 1:19 PM, Mahender Sarangam wrote:
>>
>> Hi,
>>
>> We are facing issue while reading data from ORC table. We have created
>> ORC table and dumped data into it. We have deleted cluster due to some
>> reason. When we recreated cluster (using Metastore) and table pointing to
>> same location. When we perform reading from ORC table. We see below error.
>>
>> SELECT col2, Col1,
>>   reflect("java.util.UUID", "randomUUID") AS ID,
>>   Source,
>>  1 ,
>> SDate,
>> EDate
>> FROM Table ORC  JOIN Table2 _surr;
>>
>> ERROR : Vertex failed, vertexName=Map 1,
>> vertexId=vertex_1465411930667_0212_1_01, diagnostics=[Task failed,
>> taskId=task_1465411930667_0212_1_01_00, diagnostics=[TaskAttempt 0
>> failed, info=[Error: Failure while running task:java.lang.RuntimeException:
>> java.lang.RuntimeException: java.io.IOException: java.io.IOException: ORC
>> does not support type conversion from INT to STRING.
>>
>>
>> I think issue is reflect("java.util.UUID", "randomUUID") AS ID
>>
>>
>> I know there is Bug raised while reading data from ORC table. Is there
>> any workaround apart from reloading data.
>>
>> -MS
>>
>>
>>
>>
>>
>
>


Re: Optimized Hive query

2016-06-14 Thread Mich Talebzadeh
Thank you for cut and pace monologue.

very impressive. I will try to remember it

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 20:01, Markovitz, Dudu  wrote:

> *1)*
>
> Cost-based optimization in Hive
> <https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive>
>
>
> https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive
>
>
>
> Calcite is an open source, Apache Licensed, query planning and execution
> framework. Many pieces of Calcite are derived from Eigenbase Project.
> Calcite has optional JDBC server, query parser and validator, query
> optimizer and pluggable data source adapters. One of the available Calcite
> optimizer is a cost based optimizer based on volcano paper.
>
>
>
> *2)*
>
> The Volcano Optimizer Generator: Extensibility and Efficient Search
>
> Goetz Graefe, Portland State University
>
> William J. McKenna, University of Colorado at Boulder
>
> From Proc. IEEE Conf. on Data Eng., Vienna, April 1993, p. 209.
>
>
>
> *2.2. Optimizer Generator Input and Optimizer Operation*
>
> …
>
> The user queries to be optimized by a generated optimizer are specified as
> an algebra
>
> expression (tree) of *logical operators*. The translation from a user
> interface into a logical algebra
>
> expression must be performed by the parser and is not discussed here.
>
> …
>
>
>
> *3)*
>
> Abstract syntax tree
>
> From Wikipedia, the free encyclopedia
>
> https://en.wikipedia.org/wiki/Abstract_syntax_tree
>
>
>
> In computer science <https://en.wikipedia.org/wiki/Computer_science>, an 
> *abstract
> syntax tree* (*AST*), or just *syntax tree*, is a tree
> <https://en.wikipedia.org/wiki/Directed_tree> representation of the abstract
> syntactic <https://en.wikipedia.org/wiki/Abstract_syntax> structure of source
> code <https://en.wikipedia.org/wiki/Source_code> written in a programming
> language <https://en.wikipedia.org/wiki/Programming_language>.
>
>
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, June 14, 2016 7:58 PM
> *To:* user 
>
> *Subject:* Re: Optimized Hive query
>
>
>
> Amazing. that is the first time I have heard that an optimizer does not
> have the concept of flattened query?
>
>
>
> So what is the definition of syntax tree? Are you referring to the
> industry notation "access path". This is the first time I have heard of
> such notation called syntax tree. Are you stating that there is somehow
> some explanation for optimiser "access path" that comes out independent of
> the optimizer and is called syntax tree?
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 14 June 2016 at 17:46, Markovitz, Dudu  wrote:
>
> It’s not the query that is being optimized but the syntax tree that is
> created upon the query (execute “explain extended select …”)
>
> In no point do we have a “flattened query”
>
>
>
> Dudu
>
>
>
> *From:* Aviral Agarwal [mailto:aviral12...@gmail.com]
> *Sent:* Tuesday, June 14, 2016 10:37 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Optimized Hive query
>
>
>
> Hi,
>
> Thanks for the replies.
>
> I already knew that the optimizer already does that.
>
> My usecase is a bit different though.
>
> I want to display the flattened query back to the user.
>
> So I was hoping of using internal Hive CBO to somehow change the AST
> generated for the query somehow.
>
>
>
> Thanks,
>
> Aviral
>
>
>
> On Tue, Jun 14, 2016 at 12:42 PM, Gopal Vijayaraghavan 
> wrote:
>
>
> > You can see that you get identical execution plans for the nested query
> >and the flatten one.
>
> Wasn't that always though. Back when I started with Hive, before Stinger,
> it didn't have the identity project remover.
>
> To know if your version has this fix, try looking at
>
> hive> set hive.optimize.remove.identity.project;
>
>
> Cheers,
> Gopal
>
>
>
>
>
>


Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi,

My point was we are where we are and in this juncture there is no
collection of statistics for complex columns. That may be a future
enhancement.

But then the obvious question is how useful or meaningful these statistics
is going to be?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 21:03, Michael Häusler  wrote:

> Hi there,
>
> there might be two topics here:
>
> 1) feasibility of stats for non-primitive columns
> 2) ease of use
>
>
> 1) feasibility of stats for non-primitive columns:
>
> Hive currently collects different kind of statistics for different kind of
> types:
> numeric values: min, max, #nulls, #distincts
> boolean values: #nulls, #trues, #falses
> string values: #nulls, #distincts, avgLength, maxLength
>
> So, it seems quite possible to also collect at least partial stats for
> top-level non-primitive columns, e.g.:
> array values: #nulls, #distincts, avgLength, maxLength
> map values: #nulls, #distincts, avgLength, maxLength
> struct values: #nulls, #distincts
> union values: #nulls, #distincts
>
>
> 2) ease of use
>
> The presence of a single non-primitive column currently breaks the use of
> the convenience shorthand to gather statistics for all columns (ANALYZE
> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
> of column statistics for hive users.
>
> Best regards
> Michael
>
>
>
> On 2016-06-14, at 12:04, Mich Talebzadeh 
> wrote:
>
> Hi Michael,
>
> Statistics for columns in Hive are kept in Hive metadata table
> tab_col_stats.
>
> When I am looking at this table in Oracle, I only see statistics for
> primitives columns here. STRUCT columns do not have it as a STRUCT column
> will have to be broken into its primitive columns.  I don't think Hive has
> the means to do that.
>
> desc tab_col_stats;
>  Name
> Null?Type
>  
>  -
>  CS_ID
> NOT NULL NUMBER
>  DB_NAME
> NOT NULL VARCHAR2(128)
>  TABLE_NAME
> NOT NULL VARCHAR2(128)
>  COLUMN_NAME
> NOT NULL VARCHAR2(1000)
>  COLUMN_TYPE
> NOT NULL VARCHAR2(128)
>  TBL_ID
> NOT NULL NUMBER
>  LONG_LOW_VALUE
> NUMBER
>  LONG_HIGH_VALUE
> NUMBER
>  DOUBLE_LOW_VALUE
> NUMBER
>  DOUBLE_HIGH_VALUE
> NUMBER
>  BIG_DECIMAL_LOW_VALUE
> VARCHAR2(4000)
>  BIG_DECIMAL_HIGH_VALUE
> VARCHAR2(4000)
>  NUM_NULLS
> NOT NULL NUMBER
>  NUM_DISTINCTS
> NUMBER
>  AVG_COL_LEN
> NUMBER
>  MAX_COL_LEN
> NUMBER
>  NUM_TRUES
> NUMBER
>  NUM_FALSES
> NUMBER
>  LAST_ANALYZED
> NOT NULL NUMBER
>
>
>
>  So in summary although column type STRUCT do exit, I don't think Hive can
> cater for their statistics. Actually I don't think Oracle itself does it.
>
> HTH
>
> P.S. I am on Hive 2 and it does not.
>
> hive> analyze table foo compute statistics for columns;
> FAILED: UDFArgumentTypeException Only primitive type arguments are
> accepted but array is passed.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 09:57, Michael Häusler  wrote:
>
>> Hi there,
>>
>> you can reproduce the messages below with Hive 1.2.1.
>>
>> Best regards
>> Michael
>>
>>
>> On 2016-06-13, at 22:21, Mich Talebzadeh 
>> wrote:
>>
>> which version of Hive are you using?
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 13 June 2016 at 16:00, Michael Häusler  wrote:
>>
>>> Hi there,
>>>
>>>
>>> when testing column statistics I stumbled upon the following error
>>> message:
>>>
>>> DROP TABLE IF EXISTS foo;
>>> CREATE TABLE foo (foo BIGINT, bar ARRAY, foobar
>>> STRUCT);
>>>
>>> ANALYZE TABLE foo COMPUTE STATISTICS FOR COLUMNS;
>>> FAILED: UDFArgumentTypeException Only primitive type arguments are
>>> accepte

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
hi,

(2) There is a configuration "hive.stats.fetch.column.stats". If you set it
to true, it will automatically collect column stats for you when you insert
into/overwrite a new table. You can refer to HIVE-11160 for more details.

Not without its overheads.

Automatic gather stats is not new. Has been around for a good time in
RDBMS and can impact the performance of other queries running. So I am not
sure it can be considered as blessing.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 21:25, Pengcheng Xiong  wrote:

> Exactly, "the useful or meaningful these statistics is going to be" (The
> motivation behind).
>
> Best
> Pengcheng
>
>
> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>> Hi,
>>
>> My point was we are where we are and in this juncture there is no
>> collection of statistics for complex columns. That may be a future
>> enhancement.
>>
>> But then the obvious question is how useful or meaningful these
>> statistics is going to be?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:03, Michael Häusler  wrote:
>>
>>> Hi there,
>>>
>>> there might be two topics here:
>>>
>>> 1) feasibility of stats for non-primitive columns
>>> 2) ease of use
>>>
>>>
>>> 1) feasibility of stats for non-primitive columns:
>>>
>>> Hive currently collects different kind of statistics for different kind
>>> of types:
>>> numeric values: min, max, #nulls, #distincts
>>> boolean values: #nulls, #trues, #falses
>>> string values: #nulls, #distincts, avgLength, maxLength
>>>
>>> So, it seems quite possible to also collect at least partial stats for
>>> top-level non-primitive columns, e.g.:
>>> array values: #nulls, #distincts, avgLength, maxLength
>>> map values: #nulls, #distincts, avgLength, maxLength
>>> struct values: #nulls, #distincts
>>> union values: #nulls, #distincts
>>>
>>>
>>> 2) ease of use
>>>
>>> The presence of a single non-primitive column currently breaks the use
>>> of the convenience shorthand to gather statistics for all columns (ANALYZE
>>> TABLE foo COMPUTE STATISTICS FOR COLUMNS;). Imho, this slows down adoption
>>> of column statistics for hive users.
>>>
>>> Best regards
>>> Michael
>>>
>>>
>>>
>>> On 2016-06-14, at 12:04, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi Michael,
>>>
>>> Statistics for columns in Hive are kept in Hive metadata table
>>> tab_col_stats.
>>>
>>> When I am looking at this table in Oracle, I only see statistics for
>>> primitives columns here. STRUCT columns do not have it as a STRUCT column
>>> will have to be broken into its primitive columns.  I don't think Hive has
>>> the means to do that.
>>>
>>> desc tab_col_stats;
>>>  Name
>>> Null?Type
>>>  
>>>  -
>>>  CS_ID
>>> NOT NULL NUMBER
>>>  DB_NAME
>>> NOT NULL VARCHAR2(128)
>>>  TABLE_NAME
>>> NOT NULL VARCHAR2(128)
>>>  COLUMN_NAME
>>> NOT NULL VARCHAR2(1000)
>>>  COLUMN_TYPE
>>> NOT NULL VARCHAR2(128)
>>>  TBL_ID
>>> NOT NULL NUMBER
>>>  LONG_LOW_VALUE
>>> NUMBER
>>>  LONG_HIGH_VALUE
>>> NUMBER
>>>  DOUBLE_LOW_VALUE
>>> NUMBER
>>>  DOUBLE_HIGH_VALUE
>>> NUMBER
>>>  BIG_DECIMAL_LOW_VALUE
>>> VARCHAR2(4000)
>>>  BIG_DECIMAL_HIGH_VALUE
>>> VARCHAR2(4000)
>>>  NUM_NULLS
>>> NOT NULL NUMBER
>>>  NUM_DISTINCTS
>>> NUMBER
>>>  AVG_COL_LEN
>>> NUMBER
>>>  MAX_COL_LEN
>>> NUMBER
>>>  NUM_TRUES
>>> NUMBER
>>>  NUM_FALSES
>>> NUMBER
>>>  LA

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi,

there is another approach to reduce time for analyzing stats and that is
sampling, i.e. looking at the fraction of data. For example in Oracle one
can do that

EXEC 
DBMS_STATS.GATHER_TABLE_STATS(ownname=>user,tabname=>'T1',*estimate_percent=>
50%*)

In general Hive statistics is pretty straight forward. This is because
unless a table is ORC and transactional then update/deletes won't happen.
Only new inserts for most of data (immutable).

My opinion on this is also mixed. Does not apply to Hive. Most RDNMS
provide something like datachange() function that recommends analysing
table if the underlying table size is changed.

I gather with Hive any new insert will trigger an automatic analyze stats.
new data coming in to an existing table with data does not imply lack of
quality statistics. If the distribution remain more and less the same for a
given column, updating statistics is not going to make that much
difference. I would rather spend more time on making external indexes
useful for the optimizer.


 HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:05, Pengcheng Xiong  wrote:

> Hi Mich,
>
> I agree with you that column stats gathering in Hive is not cheap and
> comes with overheads. This is due to the large volume of data that Hive has
> to process. However, this is the price you have to pay anyway even with
> current "analyze table" solution.
>
>The new feature not only provides a way to make users have column stats
> automatically, but also saves overhead for the "insert into" case. In this
> case, the new column stats are generated incrementally, i.e., by merging
> with the existing stats. Without this feature, you have to scan the whole
> table and compute stats.
>
>In conclusion, this new feature should not have any more overhead than
> the current solution.
>
> Best
> Pengcheng
>
> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> hi,
>>
>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
>> it to true, it will automatically collect column stats for you when you
>> insert into/overwrite a new table. You can refer to HIVE-11160 for more
>> details.
>>
>> Not without its overheads.
>>
>> Automatic gather stats is not new. Has been around for a good time in
>> RDBMS and can impact the performance of other queries running. So I am not
>> sure it can be considered as blessing.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:25, Pengcheng Xiong  wrote:
>>
>>> Exactly, "the useful or meaningful these statistics is going to be" (The
>>> motivation behind).
>>>
>>> Best
>>> Pengcheng
>>>
>>>
>>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> My point was we are where we are and in this juncture there is no
>>>> collection of statistics for complex columns. That may be a future
>>>> enhancement.
>>>>
>>>> But then the obvious question is how useful or meaningful these
>>>> statistics is going to be?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 June 2016 at 21:03, Michael Häusler  wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> there might be two topics here:
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns
>>>>> 2) ease of use
>>>>>
>>>>>
>>>>> 1) feasibility of stats for non

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
Hi,

Is this automatic stats update is basic statistics or for all columns?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:10, Michael Häusler  wrote:

> Hi Mich,
>
> I agree with Pengcheng here. Automatic stats gathering can be extremely
> useful - and it is configurable.
>
> E.g., initial import into Hive happens as CSV or AVRO. Then you might want
> to do a conversion to ORC within Hive via create-table-as-select. At that
> point Hive is reading all the records anyway and we might just as well get
> as much useful information as possible for stats.
>
> Best
> Michael
>
>
> On 2016-06-14, at 23:05, Pengcheng Xiong  wrote:
>
> Hi Mich,
>
> I agree with you that column stats gathering in Hive is not cheap and
> comes with overheads. This is due to the large volume of data that Hive has
> to process. However, this is the price you have to pay anyway even with
> current "analyze table" solution.
>
>The new feature not only provides a way to make users have column stats
> automatically, but also saves overhead for the "insert into" case. In this
> case, the new column stats are generated incrementally, i.e., by merging
> with the existing stats. Without this feature, you have to scan the whole
> table and compute stats.
>
>In conclusion, this new feature should not have any more overhead than
> the current solution.
>
> Best
> Pengcheng
>
> On Tue, Jun 14, 2016 at 1:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> hi,
>>
>> (2) There is a configuration "hive.stats.fetch.column.stats". If you set
>> it to true, it will automatically collect column stats for you when you
>> insert into/overwrite a new table. You can refer to HIVE-11160 for more
>> details.
>>
>> Not without its overheads.
>>
>> Automatic gather stats is not new. Has been around for a good time in
>> RDBMS and can impact the performance of other queries running. So I am not
>> sure it can be considered as blessing.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 21:25, Pengcheng Xiong  wrote:
>>
>>> Exactly, "the useful or meaningful these statistics is going to be" (The
>>> motivation behind).
>>>
>>> Best
>>> Pengcheng
>>>
>>>
>>> On Tue, Jun 14, 2016 at 1:21 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> My point was we are where we are and in this juncture there is no
>>>> collection of statistics for complex columns. That may be a future
>>>> enhancement.
>>>>
>>>> But then the obvious question is how useful or meaningful these
>>>> statistics is going to be?
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 14 June 2016 at 21:03, Michael Häusler  wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> there might be two topics here:
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns
>>>>> 2) ease of use
>>>>>
>>>>>
>>>>> 1) feasibility of stats for non-primitive columns:
>>>>>
>>>>> Hive currently collects different kind of statistics for different
>>>>> kind of types:
>>>>> numeric values: min, max, #nulls, #distincts
>>>>> boolean values: #nulls, #trues, #falses
>>>>> string values: #nulls, #distincts, avgLength, maxLength
>>>>>
>>>>> So, it seems quite possible to also collect at least partial stats for
>>>

Re: column statistics for non-primitive types

2016-06-14 Thread Mich Talebzadeh
hm,

I am on Hive 2 and still the same

hive> create table testme as select * from oraclehadoop.sales_staging where
1 = 2;

hive> insert into testme select * from sales_staging limit  10;

hive> desc formatted testme;
OK
# col_name  data_type   comment
prod_id bigint
cust_id bigint
time_id timestamp
channel_id  bigint
promo_idbigint
quantity_sold   decimal(10,0)
amount_sold decimal(10,0)
# Detailed Table Information
Database:   test
Owner:  hduser
CreateTime: Tue Jun 14 23:00:55 BST 2016
LastAccessTime: UNKNOWN
Retention:  0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/testme
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles2
numRows 10
rawDataSize 3848068
totalSize   3948068
transient_lastDdlTime   1465941690

hive> analyze table testme compute statistics
*for columns;*OK
hive> desc formatted testme;
OK
# col_name  data_type   comment
prod_id bigint
cust_id bigint
time_id timestamp
channel_id  bigint
promo_idbigint
quantity_sold   decimal(10,0)
amount_sold decimal(10,0)
# Detailed Table Information
Database:   test
Owner:  hduser
CreateTime: Tue Jun 14 23:00:55 BST 2016
LastAccessTime: UNKNOWN
Retention:  0
Location:
hdfs://rhes564:9000/user/hive/warehouse/test.db/testme
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE
{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"prod_id\":\"true\",\"cust_id\":\"true\",\"time_id\":\"true\",\"channel_id\":\"true\",\"promo_id\":\"true\",\"quantity_sold\":\"true\",\"amount_sold\":\"true\"}}
numFiles2
numRows 10
rawDataSize 3848068
totalSize   3948068
transient_lastDdlTime   1465941690


Although there is gain to be made by having up-to-date stats, your quickest
performance buck is going ton come by running Hive on Spark engine (order
of magnitude) or using Spark on Hive tables. As ever your mileage varies
depending on the availability of RAM on your cluster. Having external
indexes visible to Hive optimizer will help but I suppose that is another
discussion

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 22:51, Michael Häusler  wrote:

> Hi Mich,
>
> as we are still on Hive 1.2.1, it is only working like this for basic
> stats.
> I would welcome it though, if it would work for column statistics as well
> - and it seems this feature is coming via HIVE-11160.
>
> Best
> Michael
>
> On 2016-06-14, at 23:42, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> Is this automatic stats update is basic statistics or for all columns?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 22:10, Michael Häusler  wrote:
>
>> Hi Mich,
>>
>> I agree with Pengcheng here. Automatic stats gathering can be extremely
>> useful - and it is configurable.
>>
>> E.g., initial import into Hive happens as CSV or AVRO. Then you might
>> want to do a conversion to ORC within Hive via create-table-as-select. At
>> that point Hive is reading all the records anyway and we might just as well
>> get as much useful information as possible for stats.
>>
>> Best
>> Michael
>>
>>
>> On 2016-06-14, at 23:05, Pengcheng Xiong  wrote:
>>
>> Hi Mich,
>>
>> I agree with you that column stats gathering in Hive is not cheap and
>> comes with overheads. This is due to the large volume of data that Hive has
>> to process. However, this is the price you have to pay anyway even with
>> current "analyze table" solution.
>>
>>The new feature not only provides a way to make users have column
>> stats automatically, but also saves overhe

Re: Hive indexes without improvement of performance

2016-06-16 Thread Mich Talebzadeh
Nothing.

Hive does not support external indexes even in version 2.

In other words, although you create indexes, they are not visible to Hive
optimizer as you have found out.

I wrote an article on this hoping that we should have external indexes
being used .

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 21:50, Vadim Dedkov  wrote:

> Hello!
>
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>
> My index creation:
> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
> WITH DEFERRED REBUILD;*
> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>
> Then I set configs:
> *set hive.optimize.autoindex=true;*
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> *set hive.index.compact.query.max.size=-1;*
> *set hive.index.compact.query.max.entries=-1; *
>
> And my query is:
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>
> Sometimes I have improvement of performance, but most of cases - not.
>
> In cases when I have improvement:
> 1. my query is
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> give me NullPointerException (in logs I see that Hive doesn't find my
> index table)
> 2. then I write:
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result with improvement
> (172 sec)
>
> In case when I don't have improvement, I can use either
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> without exception, either
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result
> (1153 sec)
>
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two:
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> My hadoop version is 2.6.0-cdh5.5.0
>
> What I do wrong?
>
> Thank you.
>
> --
> ___ ___
> Best regards,С уважением
> Vadim Dedkov.  Вадим Дедков.
>


Re: Hive indexes without improvement of performance

2016-06-16 Thread Mich Talebzadeh
Well I guess I have to agree to differ on this with Jorn as before.

Vadim,

Please go ahead and try what Jorn suggests. Report back if you see any
improvement.

Couple of points if I may:

Using Hive on Tez is not going to improve Optimiser's performance. That is
just the execution engine and BTW I would rather use Hive on Spark. Both
TEZ and Spark will be a better fit than the usual map-reduce engibe.

Actually my suggestion would be to use Hive as storage layer only and use
Spark as the query tool. In that case you don't need to worry about indexes
etc in Hive. Spark with DAG and In-memory computing will do a much better
job.

So


   1. Use Hive with its metadata to store data on HDFS
   2. Use Spark SQL to query that Data. Orders of magnitude faster.


However, I am all for you trying what Jorn suggested.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 22:02, Jörn Franke  wrote:

> The indexes are based on HDFS blocksize, which is usually around 128 mb.
> This means for hitting a single row you must always load the full block. In
> traditional databases this blocksize it is much faster. If the optimizer
> does not pick up the index then you can query the index directly (it is
> just a table!). Keep in mind that you should use for the index also an
> adequate storage format, such as Orc or parquet.
>
> You should not use the traditional indexes, but use Hive+Tez and the Orc
> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
> key importance that you insert the data sorted on the columns that you use
> in the where clause. You should compress the table with snappy.
> Additionally partitions make sense. Finally please use the right data types
> . Storage indexes work best with ints etc. for text fields you can try
> bloom filters.
>
> That being said, also in other relational databases such as Oracle
> Exadata, the use of traditional indexes is discouraged for warehouse
> scenarios, but storage indexes and columnar formats including compression
> will bring the most performance.
>
> On 16 Jun 2016, at 22:50, Vadim Dedkov  wrote:
>
> Hello!
>
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>
> My index creation:
> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
> WITH DEFERRED REBUILD;*
> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>
> Then I set configs:
> *set hive.optimize.autoindex=true;*
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> *set hive.index.compact.query.max.size=-1;*
> *set hive.index.compact.query.max.entries=-1; *
>
> And my query is:
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>
> Sometimes I have improvement of performance, but most of cases - not.
>
> In cases when I have improvement:
> 1. my query is
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> give me NullPointerException (in logs I see that Hive doesn't find my
> index table)
> 2. then I write:
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result with improvement
> (172 sec)
>
> In case when I don't have improvement, I can use either
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> without exception, either
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result
> (1153 sec)
>
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two:
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> My hadoop version is 2.6.0-cdh5.5.0
>
> What I do wrong?
>
> Thank you.
>
> --
> ___ ___
> Best regards,С уважением
> Vadim Dedkov.  Вадим Дедков.
>
>


Re: Hive indexes without improvement of performance

2016-06-16 Thread Mich Talebzadeh
Ok use explain extended your sql query to see if the optimizer makes a good
decision.

Help the optimizer by doing stats update at column level

ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMNS

use desc formatted  to see the stats#

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 23:01, Vadim Dedkov  wrote:

> *without any improvement of performance
> 17 июня 2016 г. 1:00 пользователь "Vadim Dedkov" 
> написал:
>
> Ok, thank you. I tried Hive with Tez for my index-problem without any
>> performance
>> 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" <
>> mich.talebza...@gmail.com> написал:
>>
>>>
>>> Well I guess I have to agree to differ on this with Jorn as before.
>>>
>>> Vadim,
>>>
>>> Please go ahead and try what Jorn suggests. Report back if you see any
>>> improvement.
>>>
>>> Couple of points if I may:
>>>
>>> Using Hive on Tez is not going to improve Optimiser's performance. That
>>> is just the execution engine and BTW I would rather use Hive on Spark. Both
>>> TEZ and Spark will be a better fit than the usual map-reduce engibe.
>>>
>>> Actually my suggestion would be to use Hive as storage layer only and
>>> use Spark as the query tool. In that case you don't need to worry about
>>> indexes etc in Hive. Spark with DAG and In-memory computing will do a much
>>> better job.
>>>
>>> So
>>>
>>>
>>>1. Use Hive with its metadata to store data on HDFS
>>>2. Use Spark SQL to query that Data. Orders of magnitude faster.
>>>
>>>
>>> However, I am all for you trying what Jorn suggested.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 June 2016 at 22:02, Jörn Franke  wrote:
>>>
>>>> The indexes are based on HDFS blocksize, which is usually around 128
>>>> mb. This means for hitting a single row you must always load the full
>>>> block. In traditional databases this blocksize it is much faster. If the
>>>> optimizer does not pick up the index then you can query the index directly
>>>> (it is just a table!). Keep in mind that you should use for the index also
>>>> an adequate storage format, such as Orc or parquet.
>>>>
>>>> You should not use the traditional indexes, but use Hive+Tez and the
>>>> Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It
>>>> is of key importance that you insert the data sorted on the columns that
>>>> you use in the where clause. You should compress the table with snappy.
>>>> Additionally partitions make sense. Finally please use the right data types
>>>> . Storage indexes work best with ints etc. for text fields you can try
>>>> bloom filters.
>>>>
>>>> That being said, also in other relational databases such as Oracle
>>>> Exadata, the use of traditional indexes is discouraged for warehouse
>>>> scenarios, but storage indexes and columnar formats including compression
>>>> will bring the most performance.
>>>>
>>>> On 16 Jun 2016, at 22:50, Vadim Dedkov  wrote:
>>>>
>>>> Hello!
>>>>
>>>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>>>
>>>> My index creation:
>>>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS
>>>> 'COMPACT' WITH DEFERRED REBUILD;*
>>>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>>>
>>>> Then I set configs:
>>>> *set hive.optimize.autoindex=true;*
>>>> *set hive.optimize.index.filter=true;*
>>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>>> *set hive.index.compact.query.max.size=-1;*
>>>> *set hive.index.compact.query.max.entries=-1; *
>>>>
>>>> And my query is:
>>>> *select count(*) from my_schem

Re: last stats time on table columns

2016-06-17 Thread Mich Talebzadeh
In general to see if columns stats have been updated you need to look at
the metadata tables tbl, tab_col_stats, table_params and so forth.

For example all table parameters are stored in table_params table, the
column stats are stored in epoch date in tab_col_stats

My metadata in on Oracle but table structure will be the same across all
databases I assume.

For example the following will get the last stats time for columns

set echo off
set linesize 180
set pagesize 40
set heading on
break on Database skip 1 on report
break on Table skip 1 on report
column time format a25 heading "LAST_ANALYSED Time (GMT)"
SELECT
  SUBSTR(DB_NAME,1,12) AS "Database"
, SUBSTR(TABLE_NAME,1,15) AS "Table"
, SUBSTR(COLUMN_NAME,1,15) AS "Column"
, SUBSTR((timestamp '1970-01-01 00:00:00' +
NUMTODSINTERVAL(LAST_ANALYZED,'second')) AT TIME ZONE
tz_offset('GMT'),1,18) AS time
FROM tab_col_stats
ORDER by DB_NAME, TABLE_NAME, COLUMN_NAME;

And the output will be something like

Database Table   Column  LAST_ANALYSED Time (GMT)
 --- --- -
oraclehadoop dummy   clustered   16-JUN-16 16.23.32
oraclehadoop id  16-JUN-16 16.23.32
oraclehadoop padding 16-JUN-16 16.23.32
oraclehadoop random_string   16-JUN-16 16.23.32
oraclehadoop randomised  16-JUN-16 16.23.32
oraclehadoop scattered   16-JUN-16 16.23.32
oraclehadoop small_vc16-JUN-16 16.23.32

oraclehadoop sales_staging   amount_sold 17-JUN-16 09.25.08
oraclehadoop channel_id  17-JUN-16 09.25.08
oraclehadoop cust_id 17-JUN-16 09.25.08
oraclehadoop prod_id 17-JUN-16 09.25.08
oraclehadoop promo_id17-JUN-16 09.25.08
oraclehadoop quantity_sold   17-JUN-16 09.25.08
oraclehadoop time_id 17-JUN-16 09.25.08
That is the only way I could find the stats time for columns.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 06:55, Damien Carol  wrote:

> ANALYZE TABLE  COMPUTE STATISTICS => change stats for the
> table and should should it
> ANALYZE TABLE  COMPUTE STATISTICS for COLUMNS => change stats
> for columns and should change it for columns but NOT for the table
>
> That's it.
>
> 2016-06-16 21:10 GMT+02:00 Ashok Kumar :
>
>> Greeting gurus,
>>
>> When I use
>>
>> ANALYZE TABLE  COMPUTE STATISTICS for COLUMNS,
>>
>> Where can I get the last stats time.
>>
>> DESC FORMATTED  does not show it
>>
>> thanking you
>>
>
>


Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-20 Thread Mich Talebzadeh
Hi David,

What are you actually trying to do with the data.

Hive and map-reduce are notoriously slow for this type of operations. Hive
is good for storage that is what I vouch for.

There are other alternatives.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 June 2016 at 15:43, David Nies  wrote:

> Dear Hive mailing list,
>
> in my setup, network throughput from the HiveServer2 to the client seems
> to be the bottleneck and I’m seeking a way do increase throughput. Let me
> elaborate my use case:
>
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>
> I want to fetch a huge amount of data from our Hive cluster. By huge I
> mean something around 100 million rows. The Hive table I’m querying is an
> external table whose data is stored in .avro. On HDFS, the data I want to
> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A
> cleverer filtering strategy (to reduce the amount of data) is no option,
> sadly, since I need all the data.
>
> I was able to reduce the time the MapReduce job takes to an agreeable
> interval fiddling around with
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking
> ages comes after MapReduce. I’m observing that the Hadoop namenode that is
> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our
> network is capable of much more. Playing around with `fetchSize` did not
> increase throughput.
>
> As I identified network throughput to be the bottleneck, I restricted my
> efforts to trying to increase it. For this, I simply run the query I’d
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the
> output to `/dev/null`. My `beeline` query looks something like that:
>
> beeline \
> -u jdbc:hive2://srv:1/db \
> -n user -p password \
> --outputformat=csv2 \
> --incremental=true \
> --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
> -e 'SELECT  FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
> I already tried playing around with additional `—hiveconf`s:
>
> --hiveconf hive.exec.compress.output=true \
> --hiveconf mapred.output.compression.type=BLOCK \
> --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>
> without success.
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
> Thank you in advance!
>
> Yours
>
> David Nies
> Entwickler Business Intelligence
>  ADITION technologies AG
>
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>
> T +49 211 987400 30
> F +49 211 987400 33
> E david.n...@adition.com
>
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> UStIDNr.: DE 218 858 434
>
>
>


Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-21 Thread Mich Talebzadeh
is the underlying table partitioned i.e.

'SELECT  FROM `db`.`table` WHERE (year=2016 AND month=6
AND day=1 AND hour=10)'

and also what is the RS size it is expected.

JDBC on its own should work. Is this an ORC table?

What version of Hive are you using?

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 June 2016 at 07:52, David Nies  wrote:

> In my test case below, I’m using `beeline` as the Java application
> receiving the JDBC stream. As I understand, this is the reference command
> line interface to Hive. Are you saying that the reference command line
> interface is not efficiently implemented? :)
>
> -David Nies
>
> Am 20.06.2016 um 17:46 schrieb Jörn Franke :
>
> Aside from this the low network performance could also stem from the Java
> application receiving the JDBC stream (not threaded / not efficiently
> implemented etc). However that being said, do not use jdbc for this.
>
> On 20 Jun 2016, at 17:28, Jörn Franke  wrote:
>
> Hallo,
>
> For no databases (including traditional ones) it is advisable to fetch
> this amount through jdbc. Jdbc is not designed for this (neither for import
> nor for export of large data volumes). It is a highly questionable approach
> from a reliability point of view.
>
> Export it as file to HDFS and fetch it from there or use oozie to dump the
> file from HDFS to a sftp or other server. There are alternatives depending
> on your use case.
>
> Best regards
>
> On 20 Jun 2016, at 16:43, David Nies  wrote:
>
> Dear Hive mailing list,
>
> in my setup, network throughput from the HiveServer2 to the client seems
> to be the bottleneck and I’m seeking a way do increase throughput. Let me
> elaborate my use case:
>
> I’m using Hive version 1.1.0 that is bundeled with Clouders 5.5.1.
>
> I want to fetch a huge amount of data from our Hive cluster. By huge I
> mean something around 100 million rows. The Hive table I’m querying is an
> external table whose data is stored in .avro. On HDFS, the data I want to
> fetch (i.e. the aforementioned 100 million rows) is about 5GB in size. A
> cleverer filtering strategy (to reduce the amount of data) is no option,
> sadly, since I need all the data.
>
> I was able to reduce the time the MapReduce job takes to an agreeable
> interval fiddling around with
> `mapreduce.input.fileinputformat.split.maxsize`. The part that is taking
> ages comes after MapReduce. I’m observing that the Hadoop namenode that is
> hosting the HiveServer2 is merely sending data with around 3 MB/sec. Our
> network is capable of much more. Playing around with `fetchSize` did not
> increase throughput.
>
> As I identified network throughput to be the bottleneck, I restricted my
> efforts to trying to increase it. For this, I simply run the query I’d
> normally run through JDBC (from Clojure/Java) via `beeline` and dumping the
> output to `/dev/null`. My `beeline` query looks something like that:
>
> beeline \
> -u jdbc:hive2://srv:1/db \
> -n user -p password \
> --outputformat=csv2 \
> --incremental=true \
> --hiveconf mapreduce.input.fileinputformat.split.maxsize=33554432 \
> -e 'SELECT  FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
> I already tried playing around with additional `—hiveconf`s:
>
> --hiveconf hive.exec.compress.output=true \
> --hiveconf mapred.output.compression.type=BLOCK \
> --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
>
> without success.
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
> Thank you in advance!
>
> Yours
>
> David Nies
> Entwickler Business Intelligence
>  ADITION technologies AG
>
> Oststraße 55, D-40211 Düsseldorf
> Schwarzwaldstraße 78b, D-79117 Freiburg im Breisgau
>
> T +49 211 987400 30
> F +49 211 987400 33
> E david.n...@adition.com
>
> Technischen Support erhalten Sie unter der +49 1805 2348466
> (Festnetzpreis: 14 ct/min; Mobilfunkpreise: maximal 42 ct/min)
>
> Abonnieren Sie uns auf XING oder besuchen Sie uns unter www.adition.com.
>
> Vorstände: Andreas Kleiser, Jörg Klekamp, Dr. Lutz Lowis, Marcus Schlüter
> Aufsichtsratsvorsitzender: Joachim Schneidmadl
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> UStIDNr.: DE 218 858 434
>
>
>
>


Re: Show Redudant database name in Beeline -Hive 2.0

2016-06-21 Thread Mich Talebzadeh
hm That is very strange. What other database besides default you expect to
be there

Mine is as below

Beeline version 2.0.0 by Apache Hive
0: jdbc:hive2://rhes564:10010/default> show databases;
++--+
| database_name  |
++--+
| accounts   |
| asehadoop  |
| default|
| iqhadoop   |
| mytable_db |
| oraclehadoop   |
| test   |
| twitterdb  |
++--+

Can you log in to your metadata and see database or ask an admin guy to do
so

It is in table DBS.

 select db_id, substr(name,1,20) AS DBName from dbs order by 1;
 DB_ID DBNAME
-- 
 1 default
 2 asehadoop
 6 oraclehadoop
11 test
16 iqhadoop
22 mytable_db
31 accounts
36 twitterdb

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 June 2016 at 07:56, karthi keyan  wrote:

> Hi all,
>
> Am using Hive 2.0 , once i have connected via beeline and i queried "show
> databases;" command , It will show same database name by more than once.
>
>
> ​
> Is there any issue over this ???
>
> Best,
> Karthik
>


Re: Network throughput from HiveServer2 to JDBC client too low

2016-06-21 Thread Mich Talebzadeh
this is a classic issue. are there other users using the same network to
connect to Hive.

Can your unix admin use a network sniffer to determine the issue with your
case?

in normal operations with modest amount of data do you see the same issue
or this is purely due to your load (the number of rows returned) of 100M
rows.

Yes I noticed your version of Hive at 1.1 on a vendor's package.

At this stage the question is what other alternatives are there to fetch
that 100Miilom rows.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 21 June 2016 at 08:15, David Nies  wrote:

>
>
> Am 20.06.2016 um 20:20 schrieb Gopal Vijayaraghavan :
>
>
> is hosting the HiveServer2 is merely sending data with around 3 MB/sec.
> Our network is capable of much more. Playing around with `fetchSize` did
> not increase throughput.
>
> ...
>
> --hiveconf
> mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
> \
>
>
> The current implementation you have is CPU bound in HiveServer2, the
> compression generally makes it worse.
>
> The fetch size does help, but it only prevents the system from doing
> synchronized operations frequently (pausing every 50 rows is too often,
> the default is now 1 rows).
>
>   -e 'SELECT  FROM `db`.`table` WHERE (year=2016 AND
> month=6 AND day=1 AND hour=10)' > /dev/null
>
>
> Quick q - are year/month/day/hour partition columns? If so, there might be
> a very different fix to this problem.
>
>
> Yes, year, month, day and hour are partition columns. I.e. I want to
> export exactly one partition. In my real use case, I want to use another
> filter (WHERE some_other_column = ), but for this case right here, it is
> exactly the data of one partition I want.
>
>
> In all cases, Hive is able only to utilize a tiny fraction of the
> bandwidth that is available. Is there a possibility to increase network
> throughput?
>
>
> A series of work-items are in progress for fixing the large row-set
> performance in HiveServer2
>
> https://issues.apache.org/jira/browse/HIVE-11527
>
> https://issues.apache.org/jira/browse/HIVE-12427
>
> What would be great would be to attach a profiler to your HiveServer2 &
> see which functions are hot, that will help fix those codepaths as part of
> the joint effort with the ODBC driver teams.
>
>
> I’ll see what I can do. I can’t restart the server at will though, since
> other teams are using it as well.
>
>
> Cheers,
> Gopal
>
>
> Thank you :)
> -David
>
>


Re: loading in ORC from big compressed file

2016-06-22 Thread Mich Talebzadeh
Hi

Are you using map-reduce as execution engine?

what version of Hive are you on?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 00:45, @Sanjiv Singh  wrote:

> Hi ,
>
> I have big compressed data file *my_table.dat.gz* ( approx size 100 GB)
>
> # load staging table *STAGE_**my_table* from file *my_table.dat.gz*
>
> HIVE>> LOAD DATA  INPATH '/var/lib/txt/*my_table.dat.gz*' OVERWRITE INTO
> TABLE STAGE_my_table ;
>
> *# insert into ORC table "my_table"*
>
> HIVE>> INSERT INTO TABLE my_table SELECT * FROM TXT_my_table;
> 
> INFO  : Map 1: 0(+1)/1  Reducer 2: 0/1
> 
>
>
> Insertion into orc table in going on since 5-6 hours , Seems everything is
> going sequential with one mapper reading complete file?
>
> Please suggest ? help me in improving ORC table load.
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>


Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-22 Thread Mich Talebzadeh
Hi Ajay,

I am afraid for now transaction heart beat do not work through Spark, so I
have no other solution.

This is interesting point as with Hive running on Spark engine there is no
issue with this as Hive handles the transactions,

I gather in simplest form Hive has to deal with its metadata for
transaction logic but Spark somehow cannot do that.

In short that is it. You need to do that through Hive.

Cheers,



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 16:08, Ajay Chander  wrote:

> Hi Mich,
>
> Right now I have a similar usecase where I have to delete some rows from a
> hive table. My hive table is of type ORC, Bucketed and included
> transactional property. I can delete from hive shell but not from my
> spark-shell or spark app. Were you able to find any work around? Thank
> you.
>
> Regards,
> Ajay
>
>
> On Thursday, June 2, 2016, Mich Talebzadeh 
> wrote:
>
>> thanks for that.
>>
>> I will have a look
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 2 June 2016 at 10:46, Elliot West  wrote:
>>
>>> Related to this, there exists an API in Hive to simplify the
>>> integrations of other frameworks with Hive's ACID feature:
>>>
>>> See:
>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API
>>>
>>> It contains code for maintaining heartbeats, handling locks and
>>> transactions, and submitting mutations in a distributed environment.
>>>
>>> We have used it to write to transactional tables from Cascading based
>>> processes.
>>>
>>> Elliot.
>>>
>>>
>>> On 2 June 2016 at 09:54, Mich Talebzadeh 
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> Spark does not support transactions because as I understand there is a
>>>> piece in the execution side that needs to send heartbeats to Hive metastore
>>>> saying a transaction is still alive". That has not been implemented in
>>>> Spark yet to my knowledge."
>>>>
>>>> Any idea on the timelines when we are going to have support for
>>>> transactions in Spark for Hive ORC tables. This will really be useful.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>
>>>
>>


Re: Show Redudant database name in Beeline -Hive 2.0

2016-06-22 Thread Mich Talebzadeh
Hi Karthi,

Those database names are picked up from the metadata of Hive/ Do you know
the type of RDBMS that holds your Hive database.

Check hive-site.xml  for
 javax.jdo.option.ConnectionURL

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 15:31, karthi keyan  wrote:

> Hi Mich,
>
> Some times  am facing this kind of issue with database "DEFAULT"..
>
> Connected to: Apache Hive (version 2.0.1)
> Driver: Hive JDBC (version 2.0.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://host:1/default> show databases;
> ++--+
> | database_name  |
> ++--+
> | default|
> | default|
> ++--+
> 2 rows selected (0.126 seconds)
>
>
> for Logging where to find the logging details??
>
> Best,
> Karthik
>
> On Tue, Jun 21, 2016 at 12:42 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> hm That is very strange. What other database besides default you expect
>> to be there
>>
>> Mine is as below
>>
>> Beeline version 2.0.0 by Apache Hive
>> 0: jdbc:hive2://rhes564:10010/default> show databases;
>> ++--+
>> | database_name  |
>> ++--+
>> | accounts   |
>> | asehadoop  |
>> | default|
>> | iqhadoop   |
>> | mytable_db |
>> | oraclehadoop   |
>> | test   |
>> | twitterdb  |
>> ++--+
>>
>> Can you log in to your metadata and see database or ask an admin guy to
>> do so
>>
>> It is in table DBS.
>>
>>  select db_id, substr(name,1,20) AS DBName from dbs order by 1;
>>      DB_ID DBNAME
>> -- 
>>  1 default
>>  2 asehadoop
>>  6 oraclehadoop
>> 11 test
>> 16 iqhadoop
>> 22 mytable_db
>> 31 accounts
>> 36 twitterdb
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 June 2016 at 07:56, karthi keyan  wrote:
>>
>>> Hi all,
>>>
>>> Am using Hive 2.0 , once i have connected via beeline and i queried
>>> "show databases;" command , It will show same database name by more than
>>> once.
>>>
>>>
>>> ​
>>> Is there any issue over this ???
>>>
>>> Best,
>>> Karthik
>>>
>>
>>
>


Re: Show Redudant database name in Beeline -Hive 2.0

2016-06-22 Thread Mich Talebzadeh
Sounds like it is picking up results from both metastores!

May be the cluster is not set up correctly. it should always pickup from
the active node (just one)



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 June 2016 at 06:23, karthi keyan  wrote:

> Hi Mich,
>
> Here is have used derby as JDBC metastore.
>
> jdbc:derby://:1527/metastore_db;create=true
>
> Let me explain the config:
>
> Actually in a cluster pointing the same MetaStore from 2 Hiverserver
> Running in two different node.
>
> 1- Starting Derby server - (networkserver)
> 2- Starting metaStore services.
> 3- Hiveserver 2 - Node 1
> 4- Hiveserver 2 - Node 2
>
> I think pointing same metastore will leads the issue ??  But it occurs in
> random not everytime !!
>
> -Karthik
>
>
>
> On Thu, Jun 23, 2016 at 2:48 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Karthi,
>>
>> Those database names are picked up from the metadata of Hive/ Do you know
>> the type of RDBMS that holds your Hive database.
>>
>> Check hive-site.xml  for
>>  javax.jdo.option.ConnectionURL
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 June 2016 at 15:31, karthi keyan  wrote:
>>
>>> Hi Mich,
>>>
>>> Some times  am facing this kind of issue with database "DEFAULT"..
>>>
>>> Connected to: Apache Hive (version 2.0.1)
>>> Driver: Hive JDBC (version 2.0.1)
>>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>>> 0: jdbc:hive2://host:1/default> show databases;
>>> ++--+
>>> | database_name  |
>>> ++--+
>>> | default|
>>> | default|
>>> ++--+
>>> 2 rows selected (0.126 seconds)
>>>
>>>
>>> for Logging where to find the logging details??
>>>
>>> Best,
>>> Karthik
>>>
>>> On Tue, Jun 21, 2016 at 12:42 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> hm That is very strange. What other database besides default you expect
>>>> to be there
>>>>
>>>> Mine is as below
>>>>
>>>> Beeline version 2.0.0 by Apache Hive
>>>> 0: jdbc:hive2://rhes564:10010/default> show databases;
>>>> ++--+
>>>> | database_name  |
>>>> ++--+
>>>> | accounts   |
>>>> | asehadoop  |
>>>> | default|
>>>> | iqhadoop   |
>>>> | mytable_db |
>>>> | oraclehadoop   |
>>>> | test   |
>>>> | twitterdb  |
>>>> ++--+
>>>>
>>>> Can you log in to your metadata and see database or ask an admin guy to
>>>> do so
>>>>
>>>> It is in table DBS.
>>>>
>>>>  select db_id, substr(name,1,20) AS DBName from dbs order by 1;
>>>>  DB_ID DBNAME
>>>> -- 
>>>>  1 default
>>>>  2 asehadoop
>>>>  6 oraclehadoop
>>>> 11 test
>>>> 16 iqhadoop
>>>> 22 mytable_db
>>>> 31 accounts
>>>> 36 twitterdb
>>>>
>>>> HTH
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 June 2016 at 07:56, karthi keyan 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Am using Hive 2.0 , once i have connected via beeline and i queried
>>>>> "show databases;" command , It will show same database name by more than
>>>>> once.
>>>>>
>>>>>
>>>>> ​
>>>>> Is there any issue over this ???
>>>>>
>>>>> Best,
>>>>> Karthik
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Optimize Hive Query

2016-06-23 Thread Mich Talebzadeh
Do you also have the output from

desc formatted tuning_dd_key

 and send the output please?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 June 2016 at 17:41, @Sanjiv Singh  wrote:

> Hi Gopal,
>
> I am using Tez as execution engine.
>
> DAG :
>
> ++--+
> |
>   Explain
> |
> +-+--+
> | Plan not optimized by CBO.
>   |
> |
>|
> | Vertex dependency in root stage
> |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)
> |
> |
>  |
> | Stage-0
>   |
> |Fetch Operator
>   |
> |   limit:-1
> |
> |   Stage-1
> |
> |  Reducer 2
> |
> |  File Output Operator [FS_55596]
>   |
> | compressed:false
>|
> | Statistics:Num rows: 6357592675 Data size: 54076899328 Basic
> stats: COMPLETE Column stats: NONE  |
> |
> table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input
> format:":"org.apache.hadoop.mapred.TextInputFormat","output
> format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}  |
> | Select Operator [SEL_55594]
> |
> |
>  
> outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"]
>|
> |Statistics:Num rows: 6357592675 Data size: 54076899328
> Basic stats: COMPLETE Column stats: NONE   |
> |PTF Operator [PTF_55593]
>|
> |   Function definitions:[{"Input
> definition":{"type:":"WINDOWING"}},{"partition by:":"_col0,
> _col1","name:":"windowingtablefunction","order by:":"_col2"}]  |
> |   Statistics:Num rows: 6357592675 Data size: 54076899328
> Basic stats: COMPLETE Column stats: NONE   |
> |   Select Operator [SEL_55592]
>   |
> |   |
>  outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6"]
>   |
> |   |  Statistics:Num rows: 6357592675 Data size:
> 54076899328 Basic stats: COMPLETE Column stats: NONE|
> |   |<-Map 1 [SIMPLE_EDGE] vectorized
>  |
> |  Reduce Output Operator [RS_55597]
>  |
> | key expressions:m_d_key (type: smallint),
> sb_gu_key (type: bigint), t_ev_st_dt (type: date) |
> | Map-reduce partition columns:m_d_key (type:
> smallint), sb_gu_key (type: bigint)|
> | sort order:+++
>  |
> | Statistics:Num rows: 6357592675 Data size:
> 54076899328 Basic stats: COMPLETE Column stats: NONE
>  |
> | value expressions:ad_zn_key (type: int), c_dt
> (type: date), e_p_dt (type: date), sq_nbr (type: int)   |
> | TableScan [TS_55590]
>   |
> |ACID table:true
>  |
> |alias:tuning_dd_key
>   |
> |Statistics:Num rows: 6357592675 Data size:
> 54076899328 Basic stats: COMPLETE Column stats: NONE  |
> |
>
> |
>
> +--

Re: Optimize Hive Query

2016-06-23 Thread Mich Talebzadeh
Funny enough it is pretty close to similar ORC transactional tables I have.
Standard with 256 buckets with two columns as below

number of distinct value in column m_d_key : 29
> number of distinct value in column sb_gu_key : 15434343


You have also vectorised data taking 1024 rows at once.

Still the optimizer does not tell me much. Also I don't use TEZ. I use
Spark as the execution engine.  From my experience (and I am sure there
will be plenty who will disagree with me :)), the optimiser does not make
much difference, it is the execution engine than delivers the performance.

The other alternative is that when you populate the table insert the data
sorted by m_d_key, sb_gu_key, t_ev_st_dt to ensure that that the optimizer
will be better off.

Also may help if you add t_ev_st_dt   as the third column of the bucket as
the LAG() function is using it in ORDER BY


CLUSTERED BY (m_d_key, sb_gu_key, t_ev_st_dt)  INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
  "transactional"="true",
  "orc.create.index"="true",
  "orc.bloom.filter.columns"="m_d_key, sb_gu_key, t_ev_st_dt",
  "orc.bloom.filter.fpp"="0.05",
  "orc.stripe.size"="16777216",
  "orc.row.index.stride"="1"
)

Others may have better ideas.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 June 2016 at 20:30, @Sanjiv Singh  wrote:

> Hi Mich ,
>
> Please find below output of command.
>
> desc formatted tuning_dd_key ;
>
>
> +---+---+---+--+
> |   col_name|
>   data_type   |comment
>|
>
> +---+---+---+--+
> | # col_name| data_type
>   | comment
>   |
> |   | NULL
>  | NULL
>  |
> | m_d_key   | smallint
>  |
>   |
> | sb_gu_key | bigint
>  |
>   |
> | t_ev_st_dt| date
>  |
>   |
> | ad_zn_key | int
>   |
>   |
> | c_dt  | date
>  |
>   |
> | e_p_dt| date
>  |
>   |
> | sq_nbr| int
>   |
>   |
> |   | NULL
>  | NULL
>  |
> | # Detailed Table Information  | NULL
>  | NULL
>  |
> | Database: | PRDDB
>   | NULL
>|
> | CreateTime:   | Thu Jun 23 11:03:53 EDT 2016
>  | NULL
>  |
> | LastAccessTime:   | UNKNOWN
>   | NULL
>|
> | Protect Mode: | None
>  | NULL
>  |
> | Retention:| 0
>   | NULL
>|
> | Table Type:   | MANAGED_TABLE
>   | NULL
>|
> | Table Parameters: | NULL
>  | NULL
>  |
> |   | COLUMN_STATS_ACCURATE
>   | true
>|
> |   | numFiles
>  | 256
>   |
> |   | numRows
>   | 6357592675
>|
> |   | rawDataSize
>   | 0
>   |
> |   | totalSize
>   | 54076898961
>

Re: Hive/Tez ORC tables -- rawDataSize value

2016-06-23 Thread Mich Talebzadeh
Hi,

Can you please send the output of

DESC FORMATTED 

after running (if you have not so already)

ANALYZE TABLE  COMPUTE STATISTICS FOR COLUMN

For both tables?


HTH,



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 June 2016 at 23:49, Lalitha MV  wrote:

> Hi,
>
> I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1.
>
> I created a hive table with text file size = ~141 Mb.
> show tblproperties of this table (textfile):
> numFiles1
> numRows 100
> rawDataSize 141869803
> totalSize   142869803
>
> I then created a hive table, with orc compression from the above table.
> The compressed file size is ~50 Mb.
>
> show tblproperties for new table (orc):
>
> numFiles1
> numRows 100
> rawDataSize 47100
> totalSize   50444668
>
> I had two sets of questions regarding this:
>
> 1. Why is the rawDataSize so high in case of ORC table (3.3 times the text
> file size).
> How is the rawDataSize calculated in this case? (Is it the sum of each
> datatype size of the columns, multiplied the numRows)?
>
> 2. In Hive query plans, the estimated data size of the tables in each
> phase (map and reduce), are equal to the rawDataSize. The number of
> reducers get caluclated from this size (atleast in Tez, not in case of MR
> though). Isn't this wrong, shouldn't it pick the totalSize rather? Is there
> a way to force Hive/Tez to pick the totalSize in query plans/ or atleast
> while calculating the number of reducers?
>
> Thanks in advance.
>
> Cheers,
> Lalitha
>


Re: Optimize Hive Query

2016-06-24 Thread Mich Talebzadeh
Hi Sanjiv,

Normally when it comes to this, I will try to find the section of the code
which cause the largest lag

SELECT
> sb_gu_key, m_d_key, t_ev_st_dt,
> LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY
>  t_ev_st_dt ) AS LAG_START_DT,
> a_z_key,
> c_dt,
> e_p_dt,
> sq_nbr,
> CASE WHEN LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key
>  ORDER BY  t_ev_st_dt ) IS NULL  OR a_z_key <> LAG( a_z_key , 1 , -999 )
>  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER BY  t_ev_st_dt )  THEN 'S'
>  ELSE NULL  END AS ST_FLAG
> FROM  `PRDDB`.tuning_dd_key ;




>From the above query which part is the most time consuming?

For example is the LAG function the most consuming section that takers the
lion's hare of the query?

Just execute the code and comment out LAG(t_ev_st_td) .  first

I suspect

CASE WHEN LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key
 ORDER BY  t_ev_st_dt ) IS NULL  OR a_z_key <> LAG( a_z_key , 1 , -999 )
 OVER ( PARTITION BY

is the other possible candidate as well with that OR than can cause the
issue

For example you can do the following to measure the timing

select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS
StartTime;
SELECT COUNT(1) FROM PRDDB`.tuning_dd_key
WHERE (LAG( t_ev_st_dt )  OVER ( PARTITION BY  m_d_key , sb_gu_key  ORDER
BY  t_ev_st_dt ) IS NULL)
OR
a_z_key <> LAG( a_z_key , 1 , -999 )  OVER ( PARTITION BY  m_d_key ,
sb_gu_key  ORDER BY  t_ev_st_dt )
select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS EndTime;


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Any and all responsibility for any loss, damage or
destruction of data or any other property which may arise from relying on
this email's technical content is explicitly disclaimed. The author will in
no case be liable for any monetary damages arising from such loss, damage
or destruction.



On 24 June 2016 at 22:34, @Sanjiv Singh  wrote:

> Hi Vijay,
>
> Please help me on thislet me know you need other info.
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Thu, Jun 23, 2016 at 12:41 PM, @Sanjiv Singh 
> wrote:
>
>> Hi Gopal,
>>
>> I am using Tez as execution engine.
>>
>> DAG :
>>
>> ++--+
>> |
>>   Explain
>> |
>> +-+--+
>> | Plan not optimized by CBO.
>> |
>> |
>>|
>> | Vertex dependency in root stage
>> |
>> | Reducer 2 <- Map 1 (SIMPLE_EDGE)
>>   |
>> |
>>  |
>> | Stage-0
>>   |
>> |Fetch Operator
>>   |
>> |   limit:-1
>>   |
>> |   Stage-1
>> |
>> |  Reducer 2
>>   |
>> |  File Output Operator [FS_55596]
>> |
>> | compressed:false
>>  |
>> | Statistics:Num rows: 6357592675 Data size: 54076899328
>> Basic stats: COMPLETE Column stats: NONE  |
>> |
>> table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input
>> format:":"org.apache.hadoop.mapred.TextInputFormat","output
>> format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}  |
>> | Select Operator [SEL_55594]
>> |
>> |
>>  
>> outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"]
>>|
>> |Statistics:Num rows: 6357592675 Data size: 54076899328
>> Basic stats: COMPLETE Column stats: NONE   |
>> |PTF Operator [PTF_55593]
>>|
>> |   Function definitions:[{"Input
>> definition":{"type:"

Querying Hive tables from Spark

2016-06-27 Thread Mich Talebzadeh
Hi,

I have done some extensive tests with Spark querying Hive tables.

It appears to me that Spark does not rely on statistics that are collected
by Hive on say ORC tables. It seems that Spark uses its own optimization to
query the Hive tables irrespective of Hive has collected by way of
statistics etc?

Case in point I have a FACT table bucketed on 5 dimensional foreign keys
like below

 CREATE TABLE IF NOT EXISTS oraclehadoop.sales2
 (
  PROD_IDbigint   ,
  CUST_IDbigint   ,
  TIME_IDtimestamp,
  CHANNEL_ID bigint   ,
  PROMO_ID   bigint   ,
  QUANTITY_SOLD  decimal(10)  ,
  AMOUNT_SOLDdecimal(10)
)
CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1")

Table is sorted in the order of prod_id, cust_id,time_id, channel_id and
promo_id. It has 22 million rows.

A simple query like below:

val s = HiveContext.table("sales2")
  s.filter($"prod_id" ===13 && $"cust_id" === 50833 && $"time_id" ===
"2000-12-26 00:00:00" && $"channel_id" === 2 && $"promo_id" === 999
).explain
  s.filter($"prod_id" ===13 && $"cust_id" === 50833 && $"time_id" ===
"2000-12-26 00:00:00" && $"channel_id" === 2 && $"promo_id" === 999
).collect.foreach(println)

Shows the plan as

== Physical Plan ==
Filter (prod_id#10L = 13) && (cust_id#11L = 50833)) && (time_id#12 =
9777888)) && (channel_id#13L = 2)) && (promo_id#14L = 999))
+- HiveTableScan
[prod_id#10L,cust_id#11L,time_id#12,channel_id#13L,promo_id#14L,quantity_sold#15,amount_sold#16],
MetastoreRelation oraclehadoop, sales2, None

*Spark returns 24 rows pretty fast in 22 seconds.*

Running the same on Hive with Spark as execution engine shows:

STAGE DEPENDENCIES:
  Stage-0 is a root stage
STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: sales2
  Filter Operator
predicate: (prod_id = 13) and (cust_id = 50833)) and
(UDFToString(time_id) = '2000-12-26 00:00:00')) and (channel_id = 2)) and
(promo_id = 999)) (type: boolean)
Select Operator
  expressions: 13 (type: bigint), 50833 (type: bigint),
2000-12-26 00:00:00.0 (type: timestamp), 2 (type: bigint), 999 (type:
bigint), quantity_sold (type: decimal(10,0)), amount_sold (type:
decimal(10,0))
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6
      ListSink

*And Hive on Spark returns the same 24 rows in 30 seconds*

Ok Hive query is just slower with Spark engine.

Assuming that the time taken will be optimization time + query time then it
appears that in most cases the optimization time does not really make that
impact on the overall performance?


Let me know your thoughts.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Optimize Hive Query

2016-06-27 Thread Mich Talebzadeh
Hi,

Curious to see if this issue been resolved (performance) after compaction?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 June 2016 at 21:11, @Sanjiv Singh  wrote:

> Thanks Gopal for your inputs For now I have create NON ACID table and
> loaded data see below from logs proper group splits happening .
>
> 2016-06-25 12:52:00,160 [INFO] [InputInitializer {Map 1} #0]
> |tez.HiveSplitGenerator|: Number of grouped splits: 512
>
>
> On compaction issue , Compaction enabled with two workers. why compaction
> not happened ? will check metastore logs.
>
> I have too many ACID tables on hive and how many worker should be
> configured ? currently it is 2.
>
> Thanks a lot once again.
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Fri, Jun 24, 2016 at 9:14 PM, @Sanjiv Singh 
> wrote:
>
>> Thanks Gopal for your inputs. Let me run compaction explicitly on table
>> then see how query works.
>>
>>
>>
>> Let
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Fri, Jun 24, 2016 at 7:53 PM, Gopal Vijayaraghavan 
>> wrote:
>>
>>>
>>> > Yes for this tables, ACID enabled.  it has only 256 files for each
>>> >buckets. these are create only when data initially loaded in this table.
>>>
>>> Yes, the initial load goes in as an insert DELTA too - that requires
>>> another compaction to move into base files.
>>>
>>> The fact that they haven't been automatically compacted yet, suggests
>>> that
>>> the compactor isn't working for some reason (check hive metastore logs).
>>>
>>> > One thing that I am not able to understand that its is running with 1
>>> >MAPPER.
>>>
>>> The size of deltas shows up as 0, till the compaction goes through - in
>>> Hive2, it will be -1 which will be correctly interpreted as "unknown
>>> size".
>>>
>>>
>>> > | -rw-r--r--   3 H56473 hdfs  215973009 2016-06-23 17:38
>>>
>>> >/apps/hive/warehouse/PRDDB.db/tuning_dd_key/delta_0001570_0001570/bucket_0
>>> >  |
>>>
>>> Clearly an issue due to the lack of compaction - I see a single delta
>>> with
>>> 255 buckets and no base_* files at all.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Querying Hive tables from Spark

2016-06-27 Thread Mich Talebzadeh
Thanks Gopal.


I added a compact index to this table as below on 5 columns

hive> show formatted indexes on sales2;
OK
idx_nametab_namecol_names
idx_tab_nameidx_typecomment

sales2_idx  sales2  prod_id, cust_id, time_id,
channel_id, promo_id oraclehadoop__sales2_sales2_idx__   compact

But as I expected it,  CBO ignores it

STAGE PLANS:
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
TableScan
  alias: sales2
  Statistics: Num rows: 22052232 Data size: 6527460672 Basic stats:
COMPLETE Column stats: NONE
  Filter Operator
predicate: (prod_id = 13) and (cust_id = 50833)) and
(UDFToString(time_id) = '2000-12-26 00:00:00')) and (channel_id = 2)) and
(promo_id = 999)) (type: boolean)
Statistics: Num rows: 689132 Data size: 203983072 Basic stats:
COMPLETE Column stats: NONE
Select Operator
  expressions: 13 (type: bigint), 50833 (type: bigint),
2000-12-26 00:00:00.0 (type: timestamp), 2 (type: bigint), 999 (type:
bigint), quantity_sold (type: decimal(10,0)), amount_sold (type:
decimal(10,0))
  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6
  Statistics: Num rows: 689132 Data size: 203983072 Basic
stats: COMPLETE Column stats: NONE
  ListSink

thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 June 2016 at 17:38, Gopal Vijayaraghavan  wrote:

> > It appears to me that Spark does not rely on statistics that are
> >collected by Hive on say ORC tables.
> > It seems that Spark uses its own optimization to query the Hive tables
> >irrespective of Hive has collected by way of statistics etc?
>
> Spark does not have a cost based optimizer yet - please follow this JIRA,
> which suggests that it is planned for the future.
>
> <https://issues.apache.org/jira/browse/SPARK-16026>
>
>
> > CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256
> >BUCKETS
> ...
> > Table is sorted in the order of prod_id, cust_id,time_id, channel_id and
> >promo_id. It has 22 million rows.
>
> Not it is not.
>
> Due to whatever backwards compatbilitiy brain-damage of Hive-1, CLUSTERED
> BY *DOES* not CLUSTER at all.
>
> Add at least
>
> SORTED BY (PROD_ID)
>
> if what you care about is scanning performance with the ORC indexes.
>
>
> > And Hive on Spark returns the same 24 rows in 30 seconds
>
> That sounds slow for 22 million rows. That should be a 5-6 second query in
> Hive on a single 16-core box.
>
> Is this a build from source? Has the build got log4j1.x with INFO/DEBUG?
>
> > Assuming that the time taken will be optimization time + query time then
> >it appears that in most cases the optimization time does not really make
> >that impact on the overall performance?
>
> The optimizer's impact is most felt when you have 3+ joins - computing
> join order, filter transitivity etc.
>
> In this case, all the optimizer does is simplify predicates.
>
> Cheers,
> Gopal
>
>
>


Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-06 Thread Mich Talebzadeh
Dear forum members

I will be presenting on the topic of "Running Spark on Hive or Hive on
Spark, your mileage varies" in Future of Data: London
<http://www.meetup.com/futureofdata-london/events/232423292/>

*Details*

*Organized by: Hortonworks <http://hortonworks.com/>*

*Date: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM *

*Place: London*

*Location: One Canada Square, Canary Wharf,  London E14 5AB.*

*Nearest Underground:  Canary Warf (map
<https://maps.google.com/maps?f=q&hl=en&q=One+Canada+Square%2C+Canary+Wharf%2C+E14+5AB%2C+London%2C+gb>)
*

If you are interested please register here
<http://www.meetup.com/futureofdata-london/events/232423292/>

Looking forward to seeing those who can make it to have an interesting
discussion and leverage your experience.
Regards,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: hive 2.1.0 beeline cannot show verbose log

2016-07-07 Thread Mich Talebzadeh
Well this works in Hive 2

hive --hiveconf hive.root.logger=DEBUG,console
Logging initialized using configuration in
file:/usr/lib/hive/conf/hive-log4j2.properties
16/07/07 11:36:22 [main]: INFO SessionState:
Logging initialized using configuration in
file:/usr/lib/hive/conf/hive-log4j2.properties
16/07/07 11:36:22 [main]: DEBUG conf.VariableSubstitution: Substitution is
on: hive

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 7 July 2016 at 09:46, Xionghua Hu  wrote:

> Dear all,
>
>
> In Hive 1.2.1 , config the log verbose , the beeline client will show the
> verbose log like this:
>
> 16/07/07 13:29:33 INFO mapreduce.Job: The url to track the job:
> http://host:8088/proxy/application_1467708727273_0035/
> 16/07/07 13:29:33 INFO exec.Task: Starting Job = job_1467708727273_0035,
> Tracking URL = http://host:8088/proxy/application_1467708727273_0035/
> 16/07/07 13:29:33 INFO exec.Task: Kill Command =
> /hadoop/hadoop-2.7.2/bin/hadoop job  -kill job_1467708727273_0035
> 16/07/07 13:30:07 INFO exec.Task: Hadoop job information for Stage-1:
> number of mappers: 1; number of reducers: 1
> 16/07/07 13:30:07 WARN mapreduce.Counters: Group
> org.apache.hadoop.mapred.Task$Counter is deprecated. Use
> org.apache.hadoop.mapreduce.TaskCounter instead
> 16/07/07 13:30:07 INFO exec.Task: 2016-07-07 13:30:07,905 Stage-1 map =
> 0%,  reduce = 0%
> 16/07/07 13:30:17 INFO exec.Task: 2016-07-07 13:30:17,757 Stage-1 map =
> 100%,  reduce = 0%, Cumulative CPU 2.39 sec
> 16/07/07 13:30:28
>
> the verbose config:
>
>   
> hive.server2.logging.operation.enabled
> true
> When true, HS2 will save operation logs and make them
> available for clients
>   
>   
> hive.server2.logging.operation.log.location
> /hadooplog/apache-hive-1.2.1-bin/operation_logs
> Top level directory where operation logs are stored if
> logging functionality is enabled
>   
>   
> hive.server2.logging.operation.level
> VERBOSE
> 
>   Expects one of [none, execution, performance, verbose].
>   HS2 operation logging mode available to clients to be set at session
> level.
>   For this to work, hive.server2.logging.operation.enabled should be
> set to true.
> NONE: Ignore any logging
> EXECUTION: Log completion of tasks
> PERFORMANCE: Execution + Performance logs
> VERBOSE: All logs
> 
>   
>
> However, when upgrade to hive 2.1.0, with the same verbose configure, the
> verbose log cannot show.(Also, hue 3.10.0 cannot display the verbose log)
>
> so , how to enable the verbose log in hive 2.1.0?
>
> Any advise is appreciated.
>
> Thanks in advance!
>
>
>


Re: hive 2.1.0 beeline cannot show verbose log

2016-07-07 Thread Mich Talebzadeh
Hi

Is this available in Hive 2?

hive> set hive.async.log.enabled=false;
Query returned non-zero code: 1, cause: hive configuration
hive.async.log.enabled does not exists.

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 July 2016 at 04:26, Xionghua Hu  wrote:

> Appologize for my careless.
>
>  hive.async.log.enabled=false;
>
> can resolve the issue.
>
> the verbose log can show normally.
>
> Thanks !
>
> 2016-07-08 10:48 GMT+08:00 Xionghua Hu :
>
>> I have tried the set : hive.async.log.enabled=false;
>>
>> The same issue happens. The verbose log canno show.
>>
>> 2016-07-08 0:15 GMT+08:00 Prasanth Jayachandran <
>> pjayachand...@hortonworks.com>:
>>
>>> Hi
>>>
>>> Can you try disabling async logging in HS2 and see if that helps?
>>>
>>> set hive.async.log.enabled=false;
>>>
>>> Thanks
>>> Prasanth
>>>
>>>
>>>
>>>
>>> On Thu, Jul 7, 2016 at 3:37 AM -0700, "Mich Talebzadeh" <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Well this works in Hive 2
>>>
>>> hive --hiveconf hive.root.logger=DEBUG,console
>>> Logging initialized using configuration in
>>> file:/usr/lib/hive/conf/hive-log4j2.properties
>>> 16/07/07 11:36:22 [main]: INFO SessionState:
>>> Logging initialized using configuration in
>>> file:/usr/lib/hive/conf/hive-log4j2.properties
>>> 16/07/07 11:36:22 [main]: DEBUG conf.VariableSubstitution: Substitution
>>> is on: hive
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 7 July 2016 at 09:46, Xionghua Hu  wrote:
>>>
>>>> Dear all,
>>>>
>>>>
>>>> In Hive 1.2.1 , config the log verbose , the beeline client will show
>>>> the verbose log like this:
>>>>
>>>> 16/07/07 13:29:33 INFO mapreduce.Job: The url to track the job:
>>>> http://host:8088/proxy/application_1467708727273_0035/
>>>> 16/07/07 13:29:33 INFO exec.Task: Starting Job =
>>>> job_1467708727273_0035, Tracking URL =
>>>> http://host:8088/proxy/application_1467708727273_0035/
>>>> 16/07/07 13:29:33 INFO exec.Task: Kill Command =
>>>> /hadoop/hadoop-2.7.2/bin/hadoop job  -kill job_1467708727273_0035
>>>> 16/07/07 13:30:07 INFO exec.Task: Hadoop job information for Stage-1:
>>>> number of mappers: 1; number of reducers: 1
>>>> 16/07/07 13:30:07 WARN mapreduce.Counters: Group
>>>> org.apache.hadoop.mapred.Task$Counter is deprecated. Use
>>>> org.apache.hadoop.mapreduce.TaskCounter instead
>>>> 16/07/07 13:30:07 INFO exec.Task: 2016-07-07 13:30:07,905 Stage-1 map =
>>>> 0%,  reduce = 0%
>>>> 16/07/07 13:30:17 INFO exec.Task: 2016-07-07 13:30:17,757 Stage-1 map =
>>>> 100%,  reduce = 0%, Cumulative CPU 2.39 sec
>>>> 16/07/07 13:30:28
>>>>
>>>> the verbose config:
>>>>
>>>>   
>>>> hive.server2.logging.operation.enabled
>>>> true
>>>> When true, HS2 will save operation logs and make them
>>>> available for clients
>>>>   
>>>>   
>>>> hive.server2.logging.operation.log.location
>>>> /hadooplog/apache-hive-1.2.1-bin/operation_logs
>>>> Top level directory where operation logs are stored if
>>>> logging functionality is enabled
>>>>   
>>>>   
>>>> hive.server2.logging.operation.level
>>>> VERBOSE
>>>> 
>>>>   Expects one of [none, execution, performance, verbose].
>>>>   HS2 operation logging mode available to clients to be set at
>>>> session level.
>>>>   For this to work, hive.server2.logging.operation.enabled should
>>>> be set to true.
>>>> NONE: Ignore any logging
>>>> EXECUTION: Log completion of tasks
>>>> PERFORMANCE: Execution + Performance logs
>>>> VERBOSE: All logs
>>>> 
>>>>   
>>>>
>>>> However, when upgrade to hive 2.1.0, with the same verbose configure,
>>>> the verbose log cannot show.(Also, hue 3.10.0 cannot display the verbose
>>>> log)
>>>>
>>>> so , how to enable the verbose log in hive 2.1.0?
>>>>
>>>> Any advise is appreciated.
>>>>
>>>> Thanks in advance!
>>>>
>>>>
>>>>
>>>
>>
>


Re: Hive Metastore on Amazon Aurora

2016-07-11 Thread Mich Talebzadeh
Hi  Elliot,

Am I correct that you want to put your Hive metastore on Amazon? Is the
metastore (database/schema) is sitting on on MySQL and you want to migrate
your MySQL to cloud now?

Two questions that need to be verified


   1. How big is your current metadata
   2. Do you do a lot of transaction activity using ORC files with
   Insert/Update/Delete that need to communicate with metastore with heartbeat
   etc?


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 13:58, Elliot West  wrote:

> Hello,
>
> Is anyone running the Hive metastore database on Amazon Aurora?:
> https://aws.amazon.com/rds/aurora/details/. My expectation is that it
> should work nicely as it is derived from MySQL but I'd be keen to hear of
> user's experiences with this setup.
>
> Many thanks,
>
> Elliot.
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
The presentation will go deeper into the topic. Otherwise some thoughts  of
mine. Fell free to comment. criticise :)


   1. I am a member of Spark Hive and Tez user groups plus one or two others
   2. Spark is by far the biggest in terms of community interaction
   3. Tez, typically one thread in a month
   4. Personally started building Tez for Hive from Tez source and gave up
   as it was not working. This was my own build as opposed to a distro
   5. if Hive says you should use Spark or Tez then using Spark is a
   perfectly valid choice
   6. If Tez & LLAP offers you a Spark (DAG + in-memory caching) under the
   bonnet why bother.
   7. Yes I have seen some test results (Hive on Spark vs Hive on Tez) etc.
   but they are a bit dated (not being unkind) and cannot be taken as is
   today. One their concern if I recall was excessive CPU and memory usage of
   Spark but then with the same token LLAP will add additional need for
   resources
   8. Essentially I am more comfortable to use less of technology stack
   than more.  With Hive and Spark (in this context) we have two. With Hive,
   Tez and LLAP, we have three stacks to look after that add to skill cost as
   well.
   9. Yep. It is still good to keep it simple


My thoughts on this are that if you have a viable open source product like
Spark which is becoming a sort of Vogue in Big Data space and moving very
fast, why look for another one. Hive does what it says on the Tin and good
reliable Data Warehouse.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 15:22, Ashok Kumar  wrote:

> Hi Mich,
>
> Your recent presentation in London on this topic "Running Spark on Hive or
> Hive on Spark"
>
> Have you made any more interesting findings that you like to bring up?
>
> If Hive is offering both Spark and Tez in addition to MR, what stopping
> one not to use Spark? I still don't get why TEZ + LLAP is going to be a
> better choice from what you mentioned?
>
> thanking you
>
>
>
> On Tuesday, 31 May 2016, 20:22, Mich Talebzadeh 
> wrote:
>
>
> Couple of points if I may and kindly bear with my remarks.
>
> Whilst it will be very interesting to try TEZ with LLAP. As I read from
> LLAP
>
> "Sub-second queries require fast query execution and low setup cost. The
> challenge for Hive is to achieve this without giving up on the scale and
> flexibility that users depend on. This requires a new approach using a
> hybrid engine that leverages Tez and something new called  LLAP (Live Long
> and Process, #llap online).
>
> LLAP is an optional daemon process running on multiple nodes, that
> provides the following:
>
>- Caching and data reuse across queries with compressed columnar data
>in-memory (off-heap)
>- Multi-threaded execution including reads with predicate pushdown and
>hash joins
>- High throughput IO using Async IO Elevator with dedicated thread and
>core per disk
>- Granular column level security across applications
>- "
>
> OK so we have added an in-memory capability to TEZ by way of LLAP, In
> other words what Spark does already and BTW it does not require a daemon
> running on any host. Don't take me wrong. It is interesting but this sounds
> to me (without testing myself) adding caching capability to TEZ to bring it
> on par with SPARK.
>
> Remember:
>
> Spark -> DAG + in-memory caching
> TEZ = MR on DAG
> TEZ + LLAP => DAG + in-memory caching
>
> OK it is another way getting the same result. However, my concerns:
>
>
>- Spark has a wide user base. I judge this from Spark user group
>traffic
>- TEZ user group has no traffic I am afraid
>- LLAP I don't know
>
> Sounds like Hortonworks promote TEZ and Cloudera does not want to know
> anything about Hive. and they promote Impala but that sounds like a sinking
> ship these days.
>
> Having said that I will try TEZ + LLAP :) No pun intended
>
> Regards
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 31 May 2016 at 08:1

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
 leave it to you guys to guess which one is better :)

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 17:02, Michael Segel  wrote:

> Just a clarification.
>
> Tez is ‘vendor’ independent.  ;-)
>
> Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the
> deck in their favor.
>
> Drill could be in the same boat, although there now more committers who
> are not working for MapR. I’m not sure who outside of HW is supporting Tez.
>
> But I digress.
>
> Here in the Spark user list, I have to ask how do you run hive on spark?
> Is the execution engine … the spark context always running? (Client mode I
> assume)
> Are the executors always running?   Can you run multiple queries from
> multiple users in parallel?
>
> These are some of the questions that should be asked and answered when
> considering how viable spark is going to be as the engine under Hive…
>
> Thx
>
> -Mike
>
> On May 29, 2016, at 3:35 PM, Mich Talebzadeh 
> wrote:
>
> thanks I think the problem is that the TEZ user group is exceptionally
> quiet. Just sent an email to Hive user group to see anyone has managed to
> built a vendor independent version.
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 21:23, Jörn Franke  wrote:
>
>> Well I think it is different from MR. It has some optimizations which you
>> do not find in MR. Especially the LLAP option in Hive2 makes it
>> interesting.
>>
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is
>> integrated in the Hortonworks distribution.
>>
>>
>> On 29 May 2016, at 21:43, Mich Talebzadeh 
>> wrote:
>>
>> Hi Jorn,
>>
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys
>> from TEZ user group kindly gave a hand but I could not go very far (or may
>> be I did not make enough efforts) making it work.
>>
>> That TEZ user group is very quiet as well.
>>
>> My understanding is TEZ is MR with DAG but of course Spark has both plus
>> in-memory capability.
>>
>> It would be interesting to see what version of TEZ works as execution
>> engine with Hive.
>>
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of
>> Hive etc as I am sure you already know.
>>
>> Cheers,
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 20:19, Jörn Franke  wrote:
>>
>>> Very interesting do you plan also a test with TEZ?
>>>
>>> On 29 May 2016, at 13:40, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>
>>> Basically took the original table imported using Sqoop and created and
>>> populated a new ORC table partitioned by year and month into 48 partitions
>>> as follows:
>>>
>>> 
>>> ​
>>> Connections use JDBC via beeline. Now for each partition using MR it
>>> takes an average of 17 minutes as seen below for each PARTITION..  Now that
>>> is just an individual partition and there are 48 partitions.
>>>
>>> In contrast doing the same operation with Spark engine took 10 minutes
>>> all inclusive. I just gave up on MR. You can see the StartTime and
>>> FinishTime from below
>>>
>>> 
>>>
>>> This is by no means indicate that Spark is much better than MR but shows
>>> that some very good results can ve achieved using Spark engine.
>>>
>>>
>>> Dr Mich Talebzadeh
>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Ended Job = job_1468226887011_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 23  Reduce: 1   Cumulative CPU: 499.37 sec   HDFS Read:
403754774 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 8 minutes 19 seconds 370 msec
OK
1
Time taken: 202.333 seconds, Fetched: 1 row(s)

So in summary

Table MR/sec Spark/sec
Parquet   239.53214.38
ORC   202.33317.77

 Still I would use Spark if I had a choice and I agree that on VLT (very
large tables), the limitation in available memory may be the overriding
factor in using Spark.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 19:25, Gopal Vijayaraghavan  wrote:

>
> > Status: Finished successfully in 14.12 seconds
> > OK
> > 1
> > Time taken: 14.38 seconds, Fetched: 1 row(s)
>
> That might be an improvement over MR, but that still feels far too slow.
>
>
> Parquet numbers are in general bad in Hive, but that's because the Parquet
> reader gets no actual love from the devs. The community, if it wants to
> keep using Parquet heavily needs a Hive dev to go over to Parquet-mr and
> cut a significant number of memory copies out of the reader.
>
> The Spark 2.0 build for instance, has a custom Parquet reader for SparkSQL
> which does this. SPARK-12854 does for Spark+Parquet what Hive 2.0 does for
> ORC (actually, it looks more like hive's VectorizedRowBatch than
> Tungsten's flat layouts).
>
> But that reader cannot be used in Hive-on-Spark, because it is not a
> public reader impl.
>
>
> Not to pick an arbitrary dataset, my workhorse example is a TPC-H lineitem
> at 10Gb scale with a single 16 box.
>
> hive(tpch_flat_orc_10)> select max(l_discount) from lineitem;
> Query ID = gopal_20160711175917_f96371aa-2721-49c8-99a0-f7c4a1eacfda
> Total jobs = 1
> Launching Job 1 out of 1
>
>
> Status: Running (Executing on YARN cluster with App id
> application_1466700718395_0256)
>
> ---
> ---
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING
> PENDING  FAILED  KILLED
> ---
> ---
> Map 1 ..  llap SUCCEEDED 13 130
> 0   0   0
> Reducer 2 ..  llap SUCCEEDED  1  10
> 0   0   0
> ---
> ---
> VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 0.71 s
>
> ---
> ---
> Status: DAG finished successfully in 0.71 seconds
>
> Query Execution Summary
> ---
> ---
> OPERATIONDURATION
> ---
> ---
> Compile Query   0.21s
> Prepare Plan0.13s
> Submit Plan 0.34s
> Start DAG   0.23s
> Run DAG 0.71s
> ---
> ---
>
> Task Execution Summary
> ---
> ---
>   VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
> OUTPUT_RECORDS
> ---
> ---
>  Map 1 604.00 00 59,957,438
>   13
>  Reducer 2 105.00 00 13
>0
> ---
> ---
>
> LLAP IO Summary
> ---
> ---
>   VERTICES ROWGROUPS  META_HIT  META_MISS  DATA_HIT  DATA_MISS  ALLOCATION
> USED  TOTAL

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Mich Talebzadeh
Another point with Hive on spark and Hive on Tez + LLAP, I am thinking loud
:)


   1. I am using Hive on Spark and I have a table of 10GB say with 100
   users concurrently accessing the same partition of ORC table  (last one
   hour or so)
   2. Spark takes data and puts in in memory. I gather only data for that
   partition will be loaded for 100 users. In other words there will be 100
   copies.
   3. Spark unlike RDBMS does not have the notion of hot cache or Most
   Recently Used (MRU) or Least Recently Used. So once the user finishes data
   is released from Spark memory. The next user will load that data again.
   Potentially this is somehow wasteful of resources?
   4. With Tez we only have DAG. It is MR with DAG. So the same algorithm
   will be applied to 100 users session but no memory usage
   5. If I add LLAP, will that be more efficient in terms of memory usage
   compared to Hive or not? Will it keep the data in memory for reuse or not.
   6. What I don't understand what makes Tez and LLAP more efficient
   compared to Spark!

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 11 July 2016 at 21:54, Mich Talebzadeh  wrote:

> In my test I did like for like keeping the systematic the same namely:
>
>
>1. Table was a parquet table of 100 Million rows
>2. The same set up was used for both Hive on Spark and Hive on MR
>3. Spark was very impressive compared to MR on this particular test.
>
>
> Just to see any issues I created an ORC table in in the image of Parquet
> (insert/select from Parquet to ORC) with stats updated for columns etc
>
> These were the results of the same run using ORC table this time:
>
> hive> select max(id) from oraclehadoop.dummy;
>
> Starting Spark Job = b886b869-5500-4ef7-aab9-ae6fb4dad22b
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-11 21:35:45,020 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:48,033 Stage-2_0: 0(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:51,046 Stage-2_0: 1(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:52,050 Stage-2_0: 3(+8)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:53,055 Stage-2_0: 8(+4)/23 Stage-3_0: 0/1
> 2016-07-11 21:35:54,060 Stage-2_0: 11(+1)/23Stage-3_0: 0/1
> 2016-07-11 21:35:55,065 Stage-2_0: 12(+0)/23Stage-3_0: 0/1
> 2016-07-11 21:35:56,071 Stage-2_0: 12(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:57,076 Stage-2_0: 13(+8)/23Stage-3_0: 0/1
> 2016-07-11 21:35:58,081 Stage-2_0: 20(+3)/23Stage-3_0: 0/1
> 2016-07-11 21:35:59,085 Stage-2_0: 23/23 Finished   Stage-3_0: 0(+1)/1
> 2016-07-11 21:36:00,089 Stage-2_0: 23/23 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 16.08 seconds
> OK
> 1
> Time taken: 17.775 seconds, Fetched: 1 row(s)
>
> Repeat with MR engine
>
> hive> set hive.execution.engine=mr;
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
> versions. Consider using a different execution engine (i.e. spark, tez) or
> using Hive 1.X releases.
>
> hive> select max(id) from oraclehadoop.dummy;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
> the future versions. Consider using a different execution engine (i.e.
> spark, tez) or using Hive 1.X releases.
> Query ID = hduser_20160711213100_8dc2afae-8644-4097-ba33-c7bd3c304bf8
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=
> Starting Job = job_1468226887011_0008, Tracking URL =
> http://rhes564:8088/proxy/application_1468226887011_0008/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
> job_1468226887011_0008
> Hadoop job information for Stage-1: number of mappers: 23; number of
> reducers: 1
> 2016-07-11 21:37:00,061 Stage-1 map = 0%,  reduce = 0%
> 2016-07-11 21:3

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
This the whole idea. Spark uses DAG + IM, MR is classic


This is for Hive on Spark

hive> explain select max(id) from dummy_parquet;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
Spark
  Edges:
Reducer 2 <- Map 1 (GROUP, 1)

*  DagName:
hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*  Vertices:
Map 1
Map Operator Tree:
TableScan
  alias: dummy_parquet
  Statistics: Num rows: 1 Data size: 7
Basic stats: COMPLETE Column stats: NONE
  Select Operator
expressions: id (type: int)
outputColumnNames: id
Statistics: Num rows: 1 Data size: 7
Basic stats: COMPLETE Column stats: NONE
Group By Operator
  aggregations: max(id)
  mode: hash
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 4 Basic stats:
COMPLETE Column stats: NONE
  Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 4 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col0 (type: int)
Reducer 2
Reduce Operator Tree:
  Group By Operator
aggregations: max(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 4 Basic stats:
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
Time taken: 2.801 seconds, Fetched: 50 row(s)

And this is with setting the execution engine to MR

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or
using Hive 1.X releases.

hive> explain select max(id) from dummy_parquet;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: dummy_parquet
Statistics: Num rows: 1 Data size: 7 Basic
stats: COMPLETE Column stats: NONE
Select Operator
  expressions: id (type: int)
  outputColumnNames: id
  Statistics: Num rows: 1 Data size: 7 Basic
stats: COMPLETE Column stats: NONE
  Group By Operator
aggregations: max(id)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
Column stats: NONE
Reduce Output Operator
  sort order:
  Statistics: Num rows: 1 Data size: 4 Basic stats:
COMPLETE Column stats: NONE
  value expressions: _col0 (type: int)
  Reduce Operator Tree:
Group By Operator
  aggregations: max(VALUE._col0)
  mode: mergepartial
  outputColumnNames: _col0
  Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
  File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
Time taken: 0.1 seconds, Fetched: 44 row(s)


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 08:16, Markovitz, Dudu  wrote:


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
That is only a plan not what execution engine is doing.

As I stated before Spark uses DAG + in-memory computing. MR is serial on
disk.

The key is the execution here or rather the execution engine.

In general

The standard MapReduce  as I know reads the data from HDFS, apply
map-reduce algorithm and writes back to HDFS. If there are many iterations
of map-reduce then, there will be many intermediate writes to HDFS. This is
all serial writes to disk. Each map-reduce step is completely independent
of other steps, and the executing engine does not have any global knowledge
of what map-reduce steps are going to come after each map-reduce step. For
many iterative algorithms this is inefficient as the data between each
map-reduce pair gets written and read from the file system.

The equivalent to parallelism in Big Data is deploying what is known as
Directed Acyclic Graph (DAG
<https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a
nutshell deploying DAG results in a fuller picture of global optimisation
by deploying parallelism, pipelining consecutive map steps into one and not
writing intermediate data to HDFS. So in short this prevents writing data
back and forth after every reduce step which for me is a significant
improvement, compared to the classical MapReduce algorithm.

Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
computing. Think of it as a comparison between a classic RDBMS like Oracle
and IMDB like Oracle TimesTen with in-memory processing.

The outcome is that Hive using Spark as execution engine is pretty
impressive. You have the advantage of Hive CBO + In-memory computing. If
you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
optimizer called Catalyst that does not have CBO yet plus in memory
computing.

As usual your mileage varies.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:

> I don’t see how this explains the time differences.
>
>
>
> Dudu
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, July 12, 2016 10:56 AM
> *To:* user 
> *Cc:* user @spark 
>
> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> This the whole idea. Spark uses DAG + IM, MR is classic
>
>
>
>
>
> This is for Hive on Spark
>
>
>
> hive> explain select max(id) from dummy_parquet;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
>   Stage: Stage-1
> Spark
>   Edges:
> Reducer 2 <- Map 1 (GROUP, 1)
> *  DagName:
> hduser_20160712083219_632c2749-7387-478f-972d-9eaadd9932c6:1*
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: dummy_parquet
>   Statistics: Num rows: 1 Data size: 7
> Basic stats: COMPLETE Column stats: NONE
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 7
> Basic stats: COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: max(id)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col0 (type: int)
> Reducer 2
> Reduce Operator Tree:
>   Group By Operator
> aggregations: max(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 4 Basic stats:
> COMPLETE Colum

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
I suggest that you try it for yourself then

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 10:35, Markovitz, Dudu  wrote:

> The principals are very clear and if our use-case was a complex one,
> combined from many stages I would expect performance benefits from the
> Spark engine.
>
> Since our use-case is a simple one and most of the work here is just
> reading the files, I don’t see how we can explain the performance
> differences unless the data was already cached in the Spark test.
>
> Clearly, we’re missing something.
>
>
>
> Dudu
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, July 12, 2016 12:16 PM
>
> *To:* user 
> *Cc:* user @spark 
> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> That is only a plan not what execution engine is doing.
>
>
>
> As I stated before Spark uses DAG + in-memory computing. MR is serial on
> disk.
>
>
>
> The key is the execution here or rather the execution engine.
>
>
>
> In general
>
>
>
>
> The standard MapReduce  as I know reads the data from HDFS, apply
> map-reduce algorithm and writes back to HDFS. If there are many iterations
> of map-reduce then, there will be many intermediate writes to HDFS. This is
> all serial writes to disk. Each map-reduce step is completely independent
> of other steps, and the executing engine does not have any global knowledge
> of what map-reduce steps are going to come after each map-reduce step. For
> many iterative algorithms this is inefficient as the data between each
> map-reduce pair gets written and read from the file system.
>
>
>
> The equivalent to parallelism in Big Data is deploying what is known as
> Directed Acyclic Graph (DAG
> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a
> nutshell deploying DAG results in a fuller picture of global optimisation
> by deploying parallelism, pipelining consecutive map steps into one and not
> writing intermediate data to HDFS. So in short this prevents writing data
> back and forth after every reduce step which for me is a significant
> improvement, compared to the classical MapReduce algorithm.
>
>
>
> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
> computing. Think of it as a comparison between a classic RDBMS like Oracle
> and IMDB like Oracle TimesTen with in-memory processing.
>
>
>
> The outcome is that Hive using Spark as execution engine is pretty
> impressive. You have the advantage of Hive CBO + In-memory computing. If
> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
> optimizer called Catalyst that does not have CBO yet plus in memory
> computing.
>
>
>
> As usual your mileage varies.
>
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 12 July 2016 at 09:33, Markovitz, Dudu  wrote:
>
> I don’t see how this explains the time differences.
>
>
>
> Dudu
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Tuesday, July 12, 2016 10:56 AM
> *To:* user 
> *Cc:* user @spark 
>
>
> *Subject:* Re: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> This the whole idea. Spark uses DAG + IM, MR is classic
>
>
>
>
>
> This is for Hive on Spark
>
>
>
> hive> explain select max(id) from dummy_parquet;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: St

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
sorry I completely miss your points

I was NOT talking about Exadata. I was comparing Oracle 12c caching with
that of Oracle TimesTen. no one mentioned Exadata here and neither
storeindex etc..


so if Tez is not MR with DAG could you give me an example of how it works.
No opinions but relevant to this point. I do not know much about Tez as I
stated it before

Case in point if Tez could do the job on its own why Tez is used in
conjunction with LLAP as Martin alluded to as well in this thread.


Having said that , I would be interested if you provide a working example
of Hive on Tez, compared to Hive on MR.

One experiment is worth hundreds of opinions





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 13:31, Jörn Franke  wrote:

>
> I think the comparison with Oracle rdbms and oracle times ten is not so
> good. There are times when the in-memory database of Oracle is slower than
> the rdbms (especially in case of Exadata) due to the issue that in-memory -
> as in Spark - means everything is in memory and everything is always
> processed (no storage indexes , no bloom filters etc) which explains this
> behavior quiet well.
>
> Hence, I do not agree with the statement that tez is basically mr with dag
> (or that llap is basically in-memory which is also not correct). This is a
> wrong oversimplification and I do not think this is useful for the
> community, but better is to understand when something can be used and when
> not. In-memory is also not the solution to everything and if you look for
> example behind SAP Hana or NoSql there is much more around this, which is
> not even on the roadmap of Spark.
>
> Anyway, discovering good use case patterns should be done on standardized
> benchmarks going beyond the select count etc
>
> On 12 Jul 2016, at 11:16, Mich Talebzadeh 
> wrote:
>
> That is only a plan not what execution engine is doing.
>
> As I stated before Spark uses DAG + in-memory computing. MR is serial on
> disk.
>
> The key is the execution here or rather the execution engine.
>
> In general
>
> The standard MapReduce  as I know reads the data from HDFS, apply
> map-reduce algorithm and writes back to HDFS. If there are many iterations
> of map-reduce then, there will be many intermediate writes to HDFS. This is
> all serial writes to disk. Each map-reduce step is completely independent
> of other steps, and the executing engine does not have any global knowledge
> of what map-reduce steps are going to come after each map-reduce step. For
> many iterative algorithms this is inefficient as the data between each
> map-reduce pair gets written and read from the file system.
>
> The equivalent to parallelism in Big Data is deploying what is known as
> Directed Acyclic Graph (DAG
> <https://en.wikipedia.org/wiki/Directed_acyclic_graph>) algorithm. In a
> nutshell deploying DAG results in a fuller picture of global optimisation
> by deploying parallelism, pipelining consecutive map steps into one and not
> writing intermediate data to HDFS. So in short this prevents writing data
> back and forth after every reduce step which for me is a significant
> improvement, compared to the classical MapReduce algorithm.
>
> Now Tez is basically MR with DAG. With Spark you get DAG + in-memory
> computing. Think of it as a comparison between a classic RDBMS like Oracle
> and IMDB like Oracle TimesTen with in-memory processing.
>
> The outcome is that Hive using Spark as execution engine is pretty
> impressive. You have the advantage of Hive CBO + In-memory computing. If
> you use Spark for all this (say Spark SQL) but no Hive, Spark uses its own
> optimizer called Catalyst that does not have CBO yet plus in memory
> computing.
>
> As usual your mileage varies.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's te

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
thanks Marcin.

What Is your guesstimate on the order of "faster" please?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 14:35, Marcin Tustin  wrote:

> Quick note - my experience (no benchmarks) is that Tez without LLAP (we're
> still not on hive 2) is faster than MR by some way. I haven't dug into why
> that might be.
>
> On Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> sorry I completely miss your points
>>
>> I was NOT talking about Exadata. I was comparing Oracle 12c caching with
>> that of Oracle TimesTen. no one mentioned Exadata here and neither
>> storeindex etc..
>>
>>
>> so if Tez is not MR with DAG could you give me an example of how it
>> works. No opinions but relevant to this point. I do not know much about Tez
>> as I stated it before
>>
>> Case in point if Tez could do the job on its own why Tez is used in
>> conjunction with LLAP as Martin alluded to as well in this thread.
>>
>>
>> Having said that , I would be interested if you provide a working example
>> of Hive on Tez, compared to Hive on MR.
>>
>> One experiment is worth hundreds of opinions
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 12 July 2016 at 13:31, Jörn Franke  wrote:
>>
>>>
>>> I think the comparison with Oracle rdbms and oracle times ten is not so
>>> good. There are times when the in-memory database of Oracle is slower than
>>> the rdbms (especially in case of Exadata) due to the issue that in-memory -
>>> as in Spark - means everything is in memory and everything is always
>>> processed (no storage indexes , no bloom filters etc) which explains this
>>> behavior quiet well.
>>>
>>> Hence, I do not agree with the statement that tez is basically mr with
>>> dag (or that llap is basically in-memory which is also not correct). This
>>> is a wrong oversimplification and I do not think this is useful for the
>>> community, but better is to understand when something can be used and when
>>> not. In-memory is also not the solution to everything and if you look for
>>> example behind SAP Hana or NoSql there is much more around this, which is
>>> not even on the roadmap of Spark.
>>>
>>> Anyway, discovering good use case patterns should be done on
>>> standardized benchmarks going beyond the select count etc
>>>
>>> On 12 Jul 2016, at 11:16, Mich Talebzadeh 
>>> wrote:
>>>
>>> That is only a plan not what execution engine is doing.
>>>
>>> As I stated before Spark uses DAG + in-memory computing. MR is serial on
>>> disk.
>>>
>>> The key is the execution here or rather the execution engine.
>>>
>>> In general
>>>
>>> The standard MapReduce  as I know reads the data from HDFS, apply
>>> map-reduce algorithm and writes back to HDFS. If there are many iterations
>>> of map-reduce then, there will be many intermediate writes to HDFS. This is
>>> all serial writes to disk. Each map-reduce step is completely independent
>>> of other steps, and the executing engine does not have any global knowledge
>>> of what map-reduce steps are going to come after each map-reduce step. For
>>> many iterative algorithms this is inefficient as the data between each
>>> map-reduce pair 

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
I guess that is what DAG adds up to with Tez



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 14:40, Marcin Tustin  wrote:

> More like 2x than 10x as I recall.
>
> On Tue, Jul 12, 2016 at 9:39 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> thanks Marcin.
>>
>> What Is your guesstimate on the order of "faster" please?
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 12 July 2016 at 14:35, Marcin Tustin  wrote:
>>
>>> Quick note - my experience (no benchmarks) is that Tez without LLAP
>>> (we're still not on hive 2) is faster than MR by some way. I haven't dug
>>> into why that might be.
>>>
>>> On Tue, Jul 12, 2016 at 9:19 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> sorry I completely miss your points
>>>>
>>>> I was NOT talking about Exadata. I was comparing Oracle 12c caching
>>>> with that of Oracle TimesTen. no one mentioned Exadata here and neither
>>>> storeindex etc..
>>>>
>>>>
>>>> so if Tez is not MR with DAG could you give me an example of how it
>>>> works. No opinions but relevant to this point. I do not know much about Tez
>>>> as I stated it before
>>>>
>>>> Case in point if Tez could do the job on its own why Tez is used in
>>>> conjunction with LLAP as Martin alluded to as well in this thread.
>>>>
>>>>
>>>> Having said that , I would be interested if you provide a working
>>>> example of Hive on Tez, compared to Hive on MR.
>>>>
>>>> One experiment is worth hundreds of opinions
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 12 July 2016 at 13:31, Jörn Franke  wrote:
>>>>
>>>>>
>>>>> I think the comparison with Oracle rdbms and oracle times ten is not
>>>>> so good. There are times when the in-memory database of Oracle is slower
>>>>> than the rdbms (especially in case of Exadata) due to the issue that
>>>>> in-memory - as in Spark - means everything is in memory and everything is
>>>>> always processed (no storage indexes , no bloom filters etc) which 
>>>>> explains
>>>>> this behavior quiet well.
>>>>>
>>>>> Hence, I do not agree with the statement that tez is basically mr with
>>>>> dag (or that llap is basically in-memory which is also not correct). T

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
Thanks Alan. Point taken.

In mitigation, here are members in Spark forum who have shown (interest) in
using Hive directly and I quote one:

"Did you have any benchmark for using Spark as backend engine for Hive vs
using Spark thrift server (and run spark code for hive queries)? We are
using later but it will be very useful to remove thriftserver, if we can. "

Cheers,

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 15:39, Alan Gates  wrote:

>
> > On Jul 11, 2016, at 16:22, Mich Talebzadeh 
> wrote:
> >
> > 
> >   • If I add LLAP, will that be more efficient in terms of memory
> usage compared to Hive or not? Will it keep the data in memory for reuse or
> not.
> >
> Yes, this is exactly what LLAP does.  It keeps a cache of hot data (hot
> columns of hot partitions) and shares that across queries.  Unlike many MPP
> caches it will cache the same data on multiple nodes if it has more workers
> that want to access the data than can be run on a single node.
>
> As a side note, it is considered bad form in Apache to send a message to
> two lists.  It causes a lot of background noise for people on the Spark
> list who probably aren’t interested in Hive performance.
>
> Alan.
>
>
>


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-12 Thread Mich Talebzadeh
I just read further notes on LLAP.

As Gopal explained LLAP has more to do that just in-memory and I quote
Gopal:

"...  LLAP is designed to be hammered by multiple user sessions running
different queries, designed to automate the cache eviction & selection
process. There's no user visible explicit .cache() to remember - it's
automatic and concurrent. ..."

Sounds like what Oracle classic or SAP ASE do in terms of buffer management
strategy. As I understand Spark does not have this concept of hot area
(MRU/LRU chain). It loads data into its memory if needed and gets rid of
it. if ten users read the same table those blocks from that table will be
loaded 10 times which is not efficient.

 LLAP is more intelligent in this respect. So somehow it maintains a Most
Recently Used (MRU), Least Recently Used (LRU) chain. It maintains this
buffer management strategy throughout the cluster. It must be using some
clever algorithm to do so.

Cheers

.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 12 July 2016 at 15:59, Mich Talebzadeh  wrote:

> Thanks Alan. Point taken.
>
> In mitigation, here are members in Spark forum who have shown (interest)
> in using Hive directly and I quote one:
>
> "Did you have any benchmark for using Spark as backend engine for Hive vs
> using Spark thrift server (and run spark code for hive queries)? We are
> using later but it will be very useful to remove thriftserver, if we can. "
>
> Cheers,
>
> Mich
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 July 2016 at 15:39, Alan Gates  wrote:
>
>>
>> > On Jul 11, 2016, at 16:22, Mich Talebzadeh 
>> wrote:
>> >
>> > 
>> >   • If I add LLAP, will that be more efficient in terms of memory
>> usage compared to Hive or not? Will it keep the data in memory for reuse or
>> not.
>> >
>> Yes, this is exactly what LLAP does.  It keeps a cache of hot data (hot
>> columns of hot partitions) and shares that across queries.  Unlike many MPP
>> caches it will cache the same data on multiple nodes if it has more workers
>> that want to access the data than can be run on a single node.
>>
>> As a side note, it is considered bad form in Apache to send a message to
>> two lists.  It causes a lot of background noise for people on the Spark
>> list who probably aren’t interested in Hive performance.
>>
>> Alan.
>>
>>
>>
>


Verifying Hive execution engine used within a session

2016-07-13 Thread Mich Talebzadeh
May be a naive question

Is there any parameter to display the default execution engine used by Hive
say, MR, Spark or Tez in the current session.

Of course one can find it out by running a job in a session. However, is
there such setting like

show execution.engine

This should be dynamic as one can switch the engines

set hive.execution.engine=tez;


Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Verifying Hive execution engine used within a session

2016-07-13 Thread Mich Talebzadeh
Nice one Shaw

hive> set hive.execution.engine;
hive.execution.engine=mr

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 July 2016 at 11:09, 刘虓  wrote:

> Hi,
> try set hive.execution.engine;
>
> 2016-07-13 18:08 GMT+08:00 Mich Talebzadeh :
>
>> May be a naive question
>>
>> Is there any parameter to display the default execution engine used by
>> Hive say, MR, Spark or Tez in the current session.
>>
>> Of course one can find it out by running a job in a session. However, is
>> there such setting like
>>
>> show execution.engine
>>
>> This should be dynamic as one can switch the engines
>>
>> set hive.execution.engine=tez;
>>
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: Verifying Hive execution engine used within a session

2016-07-13 Thread Mich Talebzadeh
Please send a brief message to Unsubscribe: user-unsubscr...@hive.apache.org
in here <https://hive.apache.org/mailing_lists.html>

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 July 2016 at 11:31, Robin Jain  wrote:

> Unsubscribe
>
> Sincerely,
>
> Robin Jain
>
> On Jul 13, 2016, at 3:22 AM, Mich Talebzadeh 
> wrote:
>
> Nice one Shaw
>
> hive> set hive.execution.engine;
> hive.execution.engine=mr
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 13 July 2016 at 11:09, 刘虓  wrote:
>
>> Hi,
>> try set hive.execution.engine;
>>
>> 2016-07-13 18:08 GMT+08:00 Mich Talebzadeh :
>>
>>> May be a naive question
>>>
>>> Is there any parameter to display the default execution engine used by
>>> Hive say, MR, Spark or Tez in the current session.
>>>
>>> Of course one can find it out by running a job in a session. However, is
>>> there such setting like
>>>
>>> show execution.engine
>>>
>>> This should be dynamic as one can switch the engines
>>>
>>> set hive.execution.engine=tez;
>>>
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>


Re: can't start up hive 2.1 hiveserver2/metastore services

2016-07-13 Thread Mich Talebzadeh
Can hardly read the image :)

Did you start the metastore before?

$HIVE_HOME/bin/hive --service metastore &

Assuming it is running on default port 9083 do you see  the process

netstat -alnp|egrep 'Local|9083'


and then the same for Hive thrift server

$HIVE_HOME/bin/hiveserver2 &

By default that runs on port 10000


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 13 July 2016 at 17:33, Qiuzhuang Lian  wrote:

> We download hive 2.1 and run into errors when starting to
> metastore/hiveserver2 services, here is the errors,
>
> Any clues?
>
> Regards,
> Qiuzhuang
>
> [image: Inline image 1]
>
>
>


Re: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-13 Thread Mich Talebzadeh
Hi Wenli,

You mentioned:

Coming to HoS, I think the main problem now is many optimization should be
done , but seems no progress.  Like conditional task , union sql cann’t
convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
no one working on them.



Is this issue specific to Hive on Spark or they apply equally to Hive on
MapReduce as well. In other words a general issue with Hive optimizer  case
hive-9044?


Thanks





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 July 2016 at 01:56, Wangwenli  wrote:

> Seems LLAP like tachyon,  which purpose is also cache data between
> applications.
>
>
>
> Coming to HoS, I think the main problem now is many optimization should be
> done , but seems no progress.  Like conditional task , union sql cann’t
> convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
> no one working on them.
>
>
>
> On contrast, sparksql is improve  very fast
>
>
>
> Regards
>
> wenli
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *发送时间:* 2016年7月13日 7:21
> *收件人:* user
> *主题:* Re: Using Spark on Hive with Hive also using Spark as its execution
> engine
>
>
>
> I just read further notes on LLAP.
>
>
>
> As Gopal explained LLAP has more to do that just in-memory and I quote
> Gopal:
>
>
>
> "...  LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent. ..."
>
>
>
> Sounds like what Oracle classic or SAP ASE do in terms of buffer
> management strategy. As I understand Spark does not have this concept of
> hot area (MRU/LRU chain). It loads data into its memory if needed and gets
> rid of it. if ten users read the same table those blocks from that table
> will be loaded 10 times which is not efficient.
>
>
>
>  LLAP is more intelligent in this respect. So somehow it maintains a Most
> Recently Used (MRU), Least Recently Used (LRU) chain. It maintains this
> buffer management strategy throughout the cluster. It must be using some
> clever algorithm to do so.
>
>
>
> Cheers
>
>
>
> .
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 12 July 2016 at 15:59, Mich Talebzadeh 
> wrote:
>
> Thanks Alan. Point taken.
>
>
>
> In mitigation, here are members in Spark forum who have shown (interest)
> in using Hive directly and I quote one:
>
>
>
> "Did you have any benchmark for using Spark as backend engine for Hive vs
> using Spark thrift server (and run spark code for hive queries)? We are
> using later but it will be very useful to remove thriftserver, if we can. "
>
>
>
> Cheers,
>
>
>
> Mich
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 12 July 2016 at 15:39, Alan Gates  wrote:
>
>
> > On Jul 11, 2016, at 16

Re: can't start up hive 2.1 hiveserver2/metastore services

2016-07-13 Thread Mich Talebzadeh
Hi qiuzhuang

hive is a bash file.

try running it with

sh -x $HIVE_HOME/bin/hive --service metastore


To see where it is failing.

In general I source the environment file before running the query. I am on
Hive 2 and it works. Have not tried Hive 2.1

#!/bin/ksh
ENVFILE=/home/hduser/dba/bin/environment.ksh
if [[ -f $ENVFILE ]]
then
. $ENVFILE
else
echo "Abort: $0 failed. No environment file ( $ENVFILE ) found"
exit 1
fi
FILE_NAME=`basename $0 .ksh`
LOG_FILE=${LOGDIR}/${FILE_NAME}.log
[ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}
echo `date` " ""=== Starting hiveserver metastore ===" >>
${LOG_FILE}
$HIVE_HOME/bin/hive --service metastore &
netstat -alnp|egrep 'Local|9083'
echo `date` " ""=== Started hiveserver2  metastore ===" >>
${LOG_FILE}
exit

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 July 2016 at 03:01, Qiuzhuang Lian  wrote:

> Hi Mich,
>
>
> I have to use this command:
>
> bin/hive --service metastore
>
> then it works.
>
> While setting HIVE_HOME and add hive bin path, then issue command as
> follows,
>
> hive --service metastore
>
> it would breaks.
>
> I think shell under bin should be improved to address this?
>
> Regards,
> qiuzhuang
>
>
>
> On Thu, Jul 14, 2016 at 12:39 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Can hardly read the image :)
>>
>> Did you start the metastore before?
>>
>> $HIVE_HOME/bin/hive --service metastore &
>>
>> Assuming it is running on default port 9083 do you see  the process
>>
>> netstat -alnp|egrep 'Local|9083'
>>
>>
>> and then the same for Hive thrift server
>>
>> $HIVE_HOME/bin/hiveserver2 &
>>
>> By default that runs on port 1
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 13 July 2016 at 17:33, Qiuzhuang Lian 
>> wrote:
>>
>>> We download hive 2.1 and run into errors when starting to
>>> metastore/hiveserver2 services, here is the errors,
>>>
>>> Any clues?
>>>
>>> Regards,
>>> Qiuzhuang
>>>
>>> [image: Inline image 1]
>>>
>>>
>>>
>>
>


Re: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-14 Thread Mich Talebzadeh
Wjich version of Hive and Spark please?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 July 2016 at 07:35, Wangwenli  wrote:

> It is specific to HoS
>
>
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *发送时间:* 2016年7月14日 11:55
> *收件人:* user
> *主题:* Re: 答复: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> Hi Wenli,
>
>
>
> You mentioned:
>
>
>
> Coming to HoS, I think the main problem now is many optimization should be
> done , but seems no progress.  Like conditional task , union sql cann’t
> convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
> no one working on them.
>
>
>
> Is this issue specific to Hive on Spark or they apply equally to Hive on
> MapReduce as well. In other words a general issue with Hive optimizer  case
> hive-9044?
>
>
>
> Thanks
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 14 July 2016 at 01:56, Wangwenli  wrote:
>
> Seems LLAP like tachyon,  which purpose is also cache data between
> applications.
>
>
>
> Coming to HoS, I think the main problem now is many optimization should be
> done , but seems no progress.  Like conditional task , union sql cann’t
> convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
> no one working on them.
>
>
>
> On contrast, sparksql is improve  very fast
>
>
>
> Regards
>
> wenli
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *发送时间:* 2016年7月13日 7:21
> *收件人:* user
> *主题:* Re: Using Spark on Hive with Hive also using Spark as its execution
> engine
>
>
>
> I just read further notes on LLAP.
>
>
>
> As Gopal explained LLAP has more to do that just in-memory and I quote
> Gopal:
>
>
>
> "...  LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent. ..."
>
>
>
> Sounds like what Oracle classic or SAP ASE do in terms of buffer
> management strategy. As I understand Spark does not have this concept of
> hot area (MRU/LRU chain). It loads data into its memory if needed and gets
> rid of it. if ten users read the same table those blocks from that table
> will be loaded 10 times which is not efficient.
>
>
>
>  LLAP is more intelligent in this respect. So somehow it maintains a Most
> Recently Used (MRU), Least Recently Used (LRU) chain. It maintains this
> buffer management strategy throughout the cluster. It must be using some
> clever algorithm to do so.
>
>
>
> Cheers
>
>
>
> .
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 12 July 2016 at 15:59, Mich Talebzadeh 
> wrote:
>
> Thanks Alan. Point taken.
>
>
>
> In mitigation, here are members in Sp

Re: 答复: 答复: 答复: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-14 Thread Mich Talebzadeh
fine which version of spark are using for Hive execution/query engine
please?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 July 2016 at 08:05, Wangwenli  wrote:

> I using 1.x latest,but ,I checked  the master branch(2.x),  the
> latest code,  no update.
>
>
>
> Regards
>
> Wenli
>
>
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *发送时间:* 2016年7月14日 15:02
> *收件人:* user
> *主题:* Re: 答复: 答复: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> Wjich version of Hive and Spark please?
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 14 July 2016 at 07:35, Wangwenli  wrote:
>
> It is specific to HoS
>
>
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *发送时间:* 2016年7月14日 11:55
> *收件人:* user
> *主题:* Re: 答复: Using Spark on Hive with Hive also using Spark as its
> execution engine
>
>
>
> Hi Wenli,
>
>
>
> You mentioned:
>
>
>
> Coming to HoS, I think the main problem now is many optimization should be
> done , but seems no progress.  Like conditional task , union sql cann’t
> convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
> no one working on them.
>
>
>
> Is this issue specific to Hive on Spark or they apply equally to Hive on
> MapReduce as well. In other words a general issue with Hive optimizer  case
> hive-9044?
>
>
>
> Thanks
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 14 July 2016 at 01:56, Wangwenli  wrote:
>
> Seems LLAP like tachyon,  which purpose is also cache data between
> applications.
>
>
>
> Coming to HoS, I think the main problem now is many optimization should be
> done , but seems no progress.  Like conditional task , union sql cann’t
> convert to mapjoin(hive-9044)   etc, so many optimize feature is pending,
> no one working on them.
>
>
>
> On contrast, sparksql is improve  very fast
>
>
>
> Regards
>
> wenli
>
> *发件人:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *发送时间:* 2016年7月13日 7:21
> *收件人:* user
> *主题:* Re: Using Spark on Hive with Hive also using Spark as its execution
> engine
>
>
>
> I just read further notes on LLAP.
>
>
>
> As Gopal explained LLAP has more to do that just in-memory and I quote
> Gopal:
>
>
>
> "...  LLAP is designed to be hammered by multiple user sessions running
> different queries, designed to automate the cache eviction & selection
> process. There's no user visible explicit .cache() to remember - it's
> automatic and concurrent. ..."
>
>
>
> Sounds like what Oracle classic or SAP ASE do in terms of buffer
> management strategy. As I understand Spark does not have this concept of
> hot area (MRU/LRU chain). It loads data into its memory if needed and gets
> rid of it. if ten users read the same table those blocks from that table
> will be loaded 

A dedicated Web UI interface for Hive

2016-07-14 Thread Mich Talebzadeh
Hi Gopal,

If I recall you were working on a UI support for Hive. Currently the one
available is the standard Hadoop one on port 8088.

Do you have any timelines which release of Hive is going to have this
facility?

Thanks,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: A dedicated Web UI interface for Hive

2016-07-15 Thread Mich Talebzadeh
Hi Marcin,

Which two web interfaces are these. I know the usual one on 8088 any other
one?

I want something in line with what Spark provides. I thought Gopal has got
something:

[image: Inline images 1]


Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 July 2016 at 23:29, Marcin Tustin  wrote:

> What do you want it to do? There are at least two web interfaces I can
> think of.
>
> On Thu, Jul 14, 2016 at 6:04 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Gopal,
>>
>> If I recall you were working on a UI support for Hive. Currently the one
>> available is the standard Hadoop one on port 8088.
>>
>> Do you have any timelines which release of Hive is going to have this
>> facility?
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>


Re: A dedicated Web UI interface for Hive

2016-07-15 Thread Mich Talebzadeh
Hi Marcin,

For Hive on Spark I can use Spark 1.3.1 UI which does not have DAG diagram
(later versions like 1.6.1 have it). But yes you are correct.

However, I was certain that Gopal was working on a UI interface if my
memory serves right.

Cheers,

Mich



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 July 2016 at 16:08, Marcin Tustin  wrote:

> I was thinking of query and admin interfaces.
>
> There's ambari, which has plugins for introspecting what's up with tez
> sessions. I can't use those because I don't use the yarn history server (I
> find it very flaky).
>
> There's also hue, which is a query interface.
>
> If you're running on spark as the execution engine, can you not use the
> spark UI for those applications to see what's up with hive?
>
> On Fri, Jul 15, 2016 at 3:19 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Marcin,
>>
>> Which two web interfaces are these. I know the usual one on 8088 any
>> other one?
>>
>> I want something in line with what Spark provides. I thought Gopal has
>> got something:
>>
>> [image: Inline images 1]
>>
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 July 2016 at 23:29, Marcin Tustin  wrote:
>>
>>> What do you want it to do? There are at least two web interfaces I can
>>> think of.
>>>
>>> On Thu, Jul 14, 2016 at 6:04 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi Gopal,
>>>>
>>>> If I recall you were working on a UI support for Hive. Currently the
>>>> one available is the standard Hadoop one on port 8088.
>>>>
>>>> Do you have any timelines which release of Hive is going to have this
>>>> facility?
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>>
>>> Want to work at Handy? Check out our culture deck and open roles
>>> <http://www.handy.com/careers>
>>> Latest news <http://www.handy.com/press> at Handy
>>> Handy just raised $50m
>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>>  led
>>> by Fidelity
>>>
>>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>  led
> by Fidelity
>
>


Re: Hive on TEZ + LLAP

2016-07-16 Thread Mich Talebzadeh
Hi,

This is interesting. Are there any late presentations of Hive on Tez and
Hive on Tez with LLAP.

Also has there been simple benchmarks to compare:


   1. Hive on MR
   2. Hine on Tez
   3. Hive on Tez with LLAP

It would be interesting how these three fare.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 July 2016 at 00:06, Gopal Vijayaraghavan  wrote:

>
> > I have also heard about Hortonworks with Tez + LLAP but that is a distro?
>
> Yes. AFAIK, during Hadoop Summit there was a HDP 2.5 techpreview sandbox
> instance which shipped Hive2 (scroll down all the way to end in the
> downloads page).
>
> Enable the "interactive mode" in Ambari for a HiveServer2 config group &
> HiveServer2 switches over to LLAP.
>
> Though if you're interested in measuring performance, I debate the
> usefulness of an in-memory buffer-cache for a 1-node & cpu/memory
> constrained VM.
>
> > Is it a complicated work to build it with Do It Yourself so to speak?
>
> Complicated enough that I have automated it (at least for myself & most of
> the devs).
>
> https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/README.md
>
> That setup should work as long as you have a base Apache compatible
> hadoop-2.7.1 install.
>
> Because the way to deploy LLAP is a "yarn jar" & then have YARN run the
> instances, no part of the actual deploy requires root on any worker node.
>
> All you need is access to the metastore db (new features in the metastore)
> and a single Zk ensemble to register LLAP onto.
>
> That makes it really easy to "drop into" an existing YARN cluster where
> you're not an admin, but the LLAP install is then tied to a single user
> (you).
>
> That's set up a bit unconventionally since LLAP was never meant to hijack
> a user like this and allow access from the CLI.
>
> The real reason for that is so that I can do hive --debug and debug the
> CLI from remote much more easily than HiveServer2's massive number of
> threads.
>
> I did put up a demo GIF earlier during the Summit, which should give you
> an idea of how fast/slow LLAP is with S3 data (which is when the
> read-through cache really comes into the limelight).
>
> <https://twitter.com/t3rmin4t0r/status/748630764959338497/photo/1>
>
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: Hive External Storage Handlers

2016-07-18 Thread Mich Talebzadeh
Hi,

You can move up to Hive 2 that works fine and pretty stable. You can opt
for Hive 1.2.1 if yoy wish.

If you want to use Spark (the replacement for Shark) as the execution
engine for Hive then the version that works (that I have managed to make it
work with Hive is Spark 1.3.1) that you will need to build from source.

It works and it is table.

Otherwise you may decide to use Spark Thrift Server (STS) that allows JDBC
access to Spark SQL (through beeline, Squirrel , Zeppelin) that has Hive
SQL context built into it as if you were using Hive Thrift Server (HSS)

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 July 2016 at 21:38, Lavelle, Shawn  wrote:

> Hello,
>
>
>
> I am working with an external storage handler written for Hive 0.11
> and run on a Shark execution engine.  I’d like to move forward and upgrade
> to hive 1.2.1 on spark 1.6 or even 2.0.
>
>This storage has a need to run queries across tables existing in
> different databases in the external data store, so existing drivers that
> map hive to external storage in 1 to 1 mappings are insufficient. I have
> attempted this upgrade already, but found out that predicate pushdown was
> not occurring.  Was this changed in 1.2?
>
>Can I update and use the same storage handler in Hive or has this
> concept been replaced by the RDDs and DataFrame API?
>
>
>Are these questions better for the Spark list?
>
>
>
>Thank you,
>
>
>
> ~ Shawn M Lavelle
>
>
>
>
>
>
> Shawn Lavelle
> Software Development
>
> 4101 Arrowhead Drive
> Medina, Minnesota 55340-9457
> Phone: 763 551 0559
> Fax: 763 551 0750
> *Email:* shawn.lave...@osii.com
> *Website: **www.osii.com* <http://www.osii.com>
>


Re: ORC does not support type conversion from INT to STRING.

2016-07-18 Thread Mich Talebzadeh
Hi Mathew,

In layman's term if I create the source ORC table column as INT and then
create a target ORC table but that column has now been defined as STRING
and do an INSERT/SELECT from source table how data is internally stored?

Is it implicitly converted into new format using CAST function or it is
stored as is and just masked?

The version of Hive I am using is 2 and it works OK for primitive data
types (insert/select from INT to String)

However, I believe Mahender is referring to Complex types?

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 July 2016 at 22:31, Matthew McCline  wrote:

>
> Hi Mahender,
>
>
> Schema Evolution is available on the latest recent version of Hive.
>
>
> For example, if you set
> hive.metastore.disallow.incompatible.col.type.changes=false;​ on master
> (i.e. hive2) it will support INT to STRING conversion.
>
>
> If you need to remain on an older version, then you are out of luck.
>
>
> Thanks,
>
> Matt
>
>
> --
> *From:* Mahender Sarangam 
> *Sent:* Monday, July 18, 2016 1:59 PM
> *To:* user@hive.apache.org
> *Subject:* Re: ORC does not support type conversion from INT to STRING.
>
>
> Hi Mich,
>
> Sorry for delay in responding. here is the scenario,
>
> We have created new cluster  and we have moved all ORC File data into new
> cluster. We have re-created table pointing to ORC location. We have
> modified data type of ORC table from *INT *to *String.* From then onward,
> we were unable to fire select statement against this ORC table, hive keep
> throwing exception, "Orc table select. Unable to convert Int to String".
> Looks like it is bug in ORC table only. Where in we modify the datatype
> from *int to string,* is causing problem with ORC reading/select
> statement, it throws exceptio. Please let me know if there are any
> workaround for this scenario. Is this behavior expected previously also.
>
>
> */Mahender*
>
>
>
>
>
>
> On 6/14/2016 11:47 AM, Mich Talebzadeh wrote:
>
> you must excuse my ignorance
>
> can you please elaborate on this as there seems something has gone wrong
> somewhere?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 19:42, Mahender Sarangam 
> wrote:
>
>> Yes Mich. We have restored cluster from metastore.
>>
>> On 6/14/2016 11:35 AM, Mich Talebzadeh wrote:
>>
>> Hi Mahendar,
>>
>>
>> Did you load the meta-data DB/schema from backup and now seeing this error
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 19:04, Mahender Sarangam > > wrote:
>>
>>> ping.
>>>
>>> On 6/13/2016 1:19 PM, Mahender Sarangam wrote:
>>>
>>> Hi,
>>>
>>> We are facing issue while reading data from ORC table. We have created
>>> ORC table and dumped data into it. We have deleted cluster due to some
>>> reason. When we recreated cluster (using Metastore) and table pointing to
>>> same location. When we perform reading from ORC table. We see below error.
>>>
>>> SELECT col2, Col1,
>>>   reflect("java.util.UUID", "randomUUID") AS ID,
>>>   Source,
>>>  1 ,
>>> SDate,
>>> EDate
>>> FROM Table ORC  JOIN Table2 _surr;
>>>
>>> ERROR : Vertex failed, vertexName=Map 1,
>>> vertexId=vertex_1465411930667_0212_1_01, diagnostics=[Task failed,
>>> taskId=task_1465411930667_0212_1_01_00, diagnostics=[TaskAttempt 0
>>> failed, info=[Error: Failure while running task:java.lang.RuntimeException:
>>> java.lang.RuntimeException: java.io.IOException: java.io.IOException: ORC
>>> does not support type conversion from INT to STRING.
>>>
>>>
>>> I think issue is reflect("java.util.UUID", "randomUUID") AS ID
>>>
>>>
>>> I know there is Bug raised while reading data from ORC table. Is there
>>> any workaround apart from reloading data.
>>>
>>> -MS
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>


Re: Hive on TEZ + LLAP

2016-07-18 Thread Mich Talebzadeh
These looks pretty impressive. What execution mode were you running these?
Yarn client may be?

 *QueryMR/sec
TEZ/sec TEZ+LLAP/sec*
  203.317   13.681
3.809
*Order of Magnitude*---   15
times53 times
  *faster*


My calculations on Hive 2 on Spark 1.3.1 (obviously we are comparing
different bases but it is interesting as a sample) reflects the following:

Table MR/sec Spark/sec  Order of Magnitude
faster
Parquet   239.53214.38   16 times
ORC   202.33317.77   11 times

So the hybrid engine seems to make much difference which if I just consider
Tez only and Tez + LLAP the gain is more than 3 times

Cheers,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 July 2016 at 23:53, Gopal Vijayaraghavan  wrote:

>
> > Also has there been simple benchmarks to compare:
> >
> > 1. Hive on MR
> > 2. Hine on Tez
> > 3. Hive on Tez with LLAP
>
> I ran one today, with a small BI query in my test suite against a 1Tb
> data-set.
>
> TL;DR - MRv2 (203.317 seconds), Tez (13.681s), LLAP (3.809s).
>
> *Warning*: This is not a historical view, all engines are using the same
> new & improved vectorized operators from 2.2.0-SNAPSHOT, only the physical
> planner and the physical scheduling is different between runs.
>
> The difference between pre-Stinger, Stinger and Stinger.next is much much
> larger than this.
>
> <
> https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-t
> pcds/query55.sql>
>
>
> select  i_brand_id brand_id, i_brand brand,
> sum(ss_ext_sales_price) ext_price
>  from date_dim, store_sales, item
>  where date_dim.d_date_sk = store_sales.ss_sold_date_sk
> and store_sales.ss_item_sk = item.i_item_sk
> and i_manager_id=36
> and d_moy=12
> and d_year=2001
>  group by i_brand, i_brand_id
>  order by ext_price desc, i_brand_id
> limit 100 ;
>
>
> =MRv2==
>
>
> set hive.execution.engine=mr;
>
> ...
> 2016-07-18 22:22:57 Uploaded 1 File to:
> file:/tmp/gopal/b58a60d6-ff05-47bc-ad02-428aaa15779d/hive_2016-07-18_22-22-
> 43_389_3112118969207749230-1/-local-10007/HashTable-Stage-3/MapJoin-mapfile
> 131--.hashtable (914 bytes)
>
> 2016-07-18 22:22:57 End of local task; Time Taken: 2.47 sec.
> ...
> Time taken: 203.317 seconds, Fetched: 100 row(s)
>
> =Tez===
>
>
>
> set hive.execution.engine=tez;
> set hive.llap.execution.mode=none;
>
> Time taken: 13.681 seconds, Fetched: 100 row(s)
>
> =LLAP==
>
>
> set hive.llap.execution.mode=all;
>
>
>
> Task Execution Summary
> ---
> ---
>   VERTICES   DURATION(ms)  CPU_TIME(ms)  GC_TIME(ms)  INPUT_RECORDS
> OUTPUT_RECORDS
> ---
> ---
>  Map 11016.00 00 93,123,704
>9,048
>  Map 4   0.00 00 10,000
>   31
>  Map 5   0.00 00296,344
>2,675
>  Reducer 2 207.00 00  9,048
>  100
>  Reducer 3   0.00 00100
>0
> ---
> ---
>
>
> Query Execution Summary
> ---
> ---
> OPERATIONDURATION
> ---
> ---
> Compile Query   1.64s
> Prepare Plan0.32s
> Submit Plan 0.57s
> Start DAG   0.21s
> Run DAG 1.02s
> ---

Re: Hive External Storage Handlers

2016-07-18 Thread Mich Talebzadeh
"So not use a self-compiled hive or Spark version, but only the ones
supplied by distributions (cloudera, Hortonworks, Bigtop...) You will face
performance problems, strange errors etc when building and testing your
code using self-compiled versions."

This comment does not make sense and is meaningless without any evidence.
Either you provide evidence that you have done this work and you
encountered errors or better not mention it. Sounds like scaremongering.








Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 06:51, Jörn Franke  wrote:

> So not use a self-compiled hive or Spark version, but only the ones
> supplied by distributions (cloudera, Hortonworks, Bigtop...) You will face
> performance problems, strange errors etc when building and testing your
> code using self-compiled versions.
>
> If you use the Hive APIs then the engine should not be relevant for your
> storage handler. Nevertheless, the APIs of the storage handler might have
> changed.
>
> However, I wonder why a 1-1 mapping does not work for you.
>
> On 18 Jul 2016, at 22:46, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> You can move up to Hive 2 that works fine and pretty stable. You can opt
> for Hive 1.2.1 if yoy wish.
>
> If you want to use Spark (the replacement for Shark) as the execution
> engine for Hive then the version that works (that I have managed to make it
> work with Hive is Spark 1.3.1) that you will need to build from source.
>
> It works and it is table.
>
> Otherwise you may decide to use Spark Thrift Server (STS) that allows JDBC
> access to Spark SQL (through beeline, Squirrel , Zeppelin) that has Hive
> SQL context built into it as if you were using Hive Thrift Server (HSS)
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 18 July 2016 at 21:38, Lavelle, Shawn  wrote:
>
>> Hello,
>>
>>
>>
>> I am working with an external storage handler written for Hive 0.11
>> and run on a Shark execution engine.  I’d like to move forward and upgrade
>> to hive 1.2.1 on spark 1.6 or even 2.0.
>>
>>This storage has a need to run queries across tables existing in
>> different databases in the external data store, so existing drivers that
>> map hive to external storage in 1 to 1 mappings are insufficient. I have
>> attempted this upgrade already, but found out that predicate pushdown was
>> not occurring.  Was this changed in 1.2?
>>
>>Can I update and use the same storage handler in Hive or has this
>> concept been replaced by the RDDs and DataFrame API?
>>
>>
>>Are these questions better for the Spark list?
>>
>>
>>
>>Thank you,
>>
>>
>>
>> ~ Shawn M Lavelle
>>
>>
>>
>>
>> 
>>
>> Shawn Lavelle
>> Software Development
>>
>> 4101 Arrowhead Drive
>> Medina, Minnesota 55340-9457
>> Phone: 763 551 0559
>> Fax: 763 551 0750
>> *Email:* shawn.lave...@osii.com
>> *Website: **www.osii.com* <http://www.osii.com>
>>
>
>


Re: hive external table on gzip

2016-07-19 Thread Mich Talebzadeh
pretty simple

--1 Move gz file or files into HDFS: Multiple files can be in that staging
directory with hdfs dfs -copyFromLocal /*.gz
hdfs://rhes564:9000/data/stg/
--2 Create an external table. Just one will do CREATE EXTERNAL TABLE stg_t2
... STORED AS TEXTFILE LOCATION '/data/stg/'
--3 Create the internal Hive table.  CREATE TABLE t2 (  STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY" )
--4 Insert the data from the external table to the Hive  table INSERT INTO
TABLE t2 SELECT...FROM stg_t2
--5 remove the gz files if needed once processed hdfs dfs -rm
hdfs://rhes564:9000/data/stg/*.gz

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 12:03, Amatucci, Mario, Vodafone Group <
mario.amatu...@vodafone.com> wrote:

>
>
> Hi I have huge gzip on hdfs and |I’d like to create an external table on
> top of them
>
> Any code example? Cheers
>
> Ps
>
> I cannot use snappy or lzo for some constraints
>
>
>
> --
>
> Kind regards
>
> Mario Amatucci
> CG TB PS GDC PRAGUE THINK BIG
>
>
>


Re: Hive on TEZ + LLAP

2016-07-19 Thread Mich Talebzadeh
Thanks

In this sample query

select  i_brand_id brand_id, i_brand brand,
sum(ss_ext_sales_price) ext_price
 from
*date_dim, store_sales, item * where date_dim.d_date_sk =
store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and i_manager_id=36
and d_moy=12
and d_year=2001
 group by i_brand, i_brand_id
 order by ext_price desc, i_brand_id
limit 100 ;

What was the type (Parquet, text, ORC etc) and row count for each three
tables above?

thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 02:17, Gopal Vijayaraghavan  wrote:

>
> > These looks pretty impressive. What execution mode were you running
> >these? Yarn client may be?
>
> There is no other mode - everything runs on YARN.
>
> > 53 times
>
>
> The factor is actually bigger in actual execution.
>
> The MRv2 version takes 2.47s to prep a query, while the LLAP version takes
> 1.64s.
>
> The MRv2 version takes 200.319s to execute the query, while the LLAP
> version takes 1.02s.
>
> The execution factor is nearly ~200x, but the compile becomes significant
> as you scale down the latencies.
>
> > My calculations on Hive 2 on Spark 1.3.1
>
> Not sure where Hive2-on-Spark is going - the last commit to SparkCompiler
> was late last year, before there was a Hive2.
>
> On the speed front, I'm pretty sure you have got most of the Hive2
> optimizations disabled, even the most basic of the Stinger optimizations
> might be missing for you.
>
> Check if you have
>
> set hive.vectorized.execution.enabled=true;
>
>
> Some of these new optimizations don't work on H-o-S, because Hive-on-Spark
> does not implement a true broadcast join - instead it uses a
> SparkHashTableSinkOperatorwhich actually writes to HDFS instead of sending
> it directy to the downstream task.
>
>
> I don't understand why that is the case instead of RDD brodcast, but that
> prevents the JOIN optimizations which convert the 34 sec query into a 3.8
> sec query from applying to Spark execution.
>
> A couple of examples would be
>
> set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
> set hive.vectorized.execution.mapjoin.minmax.enabled=true;
>
> Those two make easy work of joins in LLAP, particularly semi-joins which
> are common in BI queries.
>
>
> Once LLAP is out of tech preview, we can enable most of them by default
> for Tez+LLAP, but that would not mean all of it applies to
> Hive-on-(Spark/MR).
>
> Getting these new features onto another engine takes active effort from
> the engine's devs.
>
> Cheers,
> Gopal
>
>
>
>
>
>
>
>
>
>
>


Re: Hive on TEZ + LLAP

2016-07-19 Thread Mich Talebzadeh
Sounds like if I am correct joining a fact table store_sales; with two
dimensions?

cool

thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 18:31, Gopal Vijayaraghavan  wrote:

> > What was the type (Parquet, text, ORC etc) and row count for each three
> >tables above?
>
> I always use ORC for flat columnar data.
>
> ORC is designed to be ideal if you have measure/dimensions normalized into
> tables - most SQL workloads don't start with an indefinite depth tree.
>
> hive> select count(1) from store_sales;
> OK
> 2879987999
> Time taken: 2.603 seconds, Fetched: 1 row(s)
> hive> select count(1) from store;
> OK
> 1002
> Time taken: 0.213 seconds, Fetched: 1 row(s)
> hive> select count(1) from date_dim;
> OK
> 73049
> Time taken: 0.186 seconds, Fetched: 1 row(s)
> hive>
>
> The DPP semi-join for date_dim is very fast, so out of the ~2.8 billion
> records only 93 million are read into the cache.
>
> Standard TPC-DS data-set at 1000 scale - same layout you can get from
> hive-testbench && ./tpcds-setup.sh 1000;
>
> Cheers,
> Gopal
>
>
>


Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-19 Thread Mich Talebzadeh
Hi all,

This will be in London tomorrow Wednesday 20th July starting at 18:00 hour
for refreshments and kick off at 18:30, 5 minutes walk from Canary Wharf
Station, Jubilee Line

If you wish you can register and get more info here
<http://www.meetup.com/futureofdata-london/>

It will be in La Tasca West India Docks Road E14
<http://www.meetup.com/futureofdata-london/events/232423292/>

and especially if you like Spanish food :)

Regards,




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 15 July 2016 at 11:06, Joaquin Alzola  wrote:

> It is on the 20th (Wednesday) next week.
>
>
>
> *From:* Marco Mistroni [mailto:mmistr...@gmail.com]
> *Sent:* 15 July 2016 11:04
> *To:* Mich Talebzadeh 
> *Cc:* user @spark ; user 
> *Subject:* Re: Presentation in London: Running Spark on Hive or Hive on
> Spark
>
>
>
> Dr Mich
>
>   do you have any slides or videos available for the presentation you did
> @Canary Wharf?
>
> kindest regards
>
>  marco
>
>
>
> On Wed, Jul 6, 2016 at 10:37 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
> Dear forum members
>
>
>
> I will be presenting on the topic of "Running Spark on Hive or Hive on
> Spark, your mileage varies" in Future of Data: London
> <http://www.meetup.com/futureofdata-london/events/232423292/>
>
> *Details*
>
> *Organized by: Hortonworks <http://hortonworks.com/>*
>
> *Date: Wednesday, July 20, 2016, 6:00 PM to 8:30 PM *
>
> *Place: London*
>
> *Location: One Canada Square, Canary Wharf,  London E14 5AB.*
>
> *Nearest Underground:  Canary Warf (map
> <https://maps.google.com/maps?f=q&hl=en&q=One+Canada+Square%2C+Canary+Wharf%2C+E14+5AB%2C+London%2C+gb>)
> *
>
> If you are interested please register here
> <http://www.meetup.com/futureofdata-london/events/232423292/>
>
> Looking forward to seeing those who can make it to have an interesting
> discussion and leverage your experience.
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>


Re: ORC does not support type conversion from INT to STRING.

2016-07-19 Thread Mich Talebzadeh
in Hive 2,  I don't see this issue INSERT/SELECT from INT to String column!

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 20:39, Mahender Sarangam 
wrote:

>
> Thanks Matthew,
>
> Currently we are in Hive 1.2 version only, Is there any setting like
> "hive.metastore.disallow.incompatible.col.type.changes=false;​" in Hive 1.2
> or any around apart for reloading entire table data.  For Quick workaround,
> we are reloading entire data.
> Can you please share with us Jira for Schema Evolution.
>
>
> @Mich : Currently we have only primitive types. But I'm also interested to
> know "how the behavior will be  in complex types"
>
>
> /Mahender
>
>
> On 7/18/2016 3:55 PM, Mich Talebzadeh wrote:
>
> Hi Mathew,
>
> In layman's term if I create the source ORC table column as INT and then
> create a target ORC table but that column has now been defined as STRING
> and do an INSERT/SELECT from source table how data is internally stored?
>
> Is it implicitly converted into new format using CAST function or it is
> stored as is and just masked?
>
> The version of Hive I am using is 2 and it works OK for primitive data
> types (insert/select from INT to String)
>
> However, I believe Mahender is referring to Complex types?
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 18 July 2016 at 22:31, Matthew McCline 
> wrote:
>
>>
>> Hi Mahender,
>>
>>
>> Schema Evolution is available on the latest recent version of Hive.
>>
>>
>> For example, if you set
>> hive.metastore.disallow.incompatible.col.type.changes=false;​ on master
>> (i.e. hive2) it will support INT to STRING conversion.
>>
>>
>> If you need to remain on an older version, then you are out of luck.
>>
>>
>> Thanks,
>>
>> Matt
>>
>>
>> --
>> *From:* Mahender Sarangam 
>> *Sent:* Monday, July 18, 2016 1:59 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: ORC does not support type conversion from INT to STRING.
>>
>>
>> Hi Mich,
>>
>> Sorry for delay in responding. here is the scenario,
>>
>> We have created new cluster  and we have moved all ORC File data into new
>> cluster. We have re-created table pointing to ORC location. We have
>> modified data type of ORC table from *INT *to *String.* From then
>> onward, we were unable to fire select statement against this ORC table,
>> hive keep throwing exception, "Orc table select. Unable to convert Int to
>> String". Looks like it is bug in ORC table only. Where in we modify the
>> datatype from *int to string,* is causing problem with ORC
>> reading/select statement, it throws exceptio. Please let me know if there
>> are any workaround for this scenario. Is this behavior expected previously
>> also.
>>
>>
>> */Mahender*
>>
>>
>>
>>
>>
>>
>> On 6/14/2016 11:47 AM, Mich Talebzadeh wrote:
>>
>> you must excuse my ignorance
>>
>> can you please elaborate on this as there seems something has gone wrong
>> somewhere?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 19:42, M

Re: ORC does not support type conversion from INT to STRING.

2016-07-19 Thread Mich Talebzadeh
Is that a distro from Hortonworks? In that case what Matthew mentioned may
be valid. Unless you go through pain of inserting using CAST function?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 July 2016 at 22:19, Mahender Sarangam 
wrote:

> But we are using Hive 1.2 version
>
>
> On 7/19/2016 12:43 PM, Mich Talebzadeh wrote:
>
> in Hive 2,  I don't see this issue INSERT/SELECT from INT to String column!
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 July 2016 at 20:39, Mahender Sarangam 
> wrote:
>
>>
>> Thanks Matthew,
>>
>> Currently we are in Hive 1.2 version only, Is there any setting like
>> "hive.metastore.disallow.incompatible.col.type.changes=false;​" in Hive 1.2
>> or any around apart for reloading entire table data.  For Quick workaround,
>> we are reloading entire data.
>> Can you please share with us Jira for Schema Evolution.
>>
>>
>> @Mich : Currently we have only primitive types. But I'm also interested
>> to know "how the behavior will be  in complex types"
>>
>>
>> /Mahender
>>
>>
>> On 7/18/2016 3:55 PM, Mich Talebzadeh wrote:
>>
>> Hi Mathew,
>>
>> In layman's term if I create the source ORC table column as INT and then
>> create a target ORC table but that column has now been defined as STRING
>> and do an INSERT/SELECT from source table how data is internally stored?
>>
>> Is it implicitly converted into new format using CAST function or it is
>> stored as is and just masked?
>>
>> The version of Hive I am using is 2 and it works OK for primitive data
>> types (insert/select from INT to String)
>>
>> However, I believe Mahender is referring to Complex types?
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 18 July 2016 at 22:31, Matthew McCline 
>> wrote:
>>
>>>
>>> Hi Mahender,
>>>
>>>
>>> Schema Evolution is available on the latest recent version of Hive.
>>>
>>>
>>> For example, if you set
>>> hive.metastore.disallow.incompatible.col.type.changes=false;​ on master
>>> (i.e. hive2) it will support INT to STRING conversion.
>>>
>>>
>>> If you need to remain on an older version, then you are out of luck.
>>>
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>>
>>> --
>>> *From:* Mahender Sarangam 
>>> *Sent:* Monday, July 18, 2016 1:59 PM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: ORC does not support type conversion from INT to STRING.
>>>
>>>
>>> Hi Mich,
>>>
>>> Sorry for delay in responding. here is the scenario,
>>>
>>> We have created new cluster  and we have moved all ORC File data 

Re: Hive on spark

2016-07-27 Thread Mich Talebzadeh
You mean you want to run Hive using Spark as the execution engine which
uses Yarn by default?


Something like below

hive> select max(id) from oraclehadoop.dummy_parquet;
Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
Finished
Status: Finished successfully in 13.14 seconds
OK
1
Time taken: 13.426 seconds, Fetched: 1 row(s)


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 July 2016 at 20:31, Mudit Kumar  wrote:

> Hi All,
>
> I need to configure hive cluster based on spark engine (yarn).
> I already have a running hadoop cluster.
>
> Can someone point me to relevant documentation?
>
> TIA.
>
> Thanks,
> Mudit
>


Re: Hive on spark

2016-07-27 Thread Mich Talebzadeh
Hi,

I made a presentation in London on 20th July on this subject:. In that I
explained how to make Spark work as an execution engine for Hive.

Query Engines for Hive, MR, Spark, Tez and LLAP – Considerations
<http://www.meetup.com/futureofdata-london/events/232423292/>!

See if I can send the presentation

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 July 2016 at 04:24, Mudit Kumar  wrote:

> Yes Mich,exactly.
>
> Thanks,
> Mudit
>
> From: Mich Talebzadeh 
> Reply-To: 
> Date: Thursday, July 28, 2016 at 1:08 AM
> To: user 
> Subject: Re: Hive on spark
>
> You mean you want to run Hive using Spark as the execution engine which
> uses Yarn by default?
>
>
> Something like below
>
> hive> select max(id) from oraclehadoop.dummy_parquet;
> Starting Spark Job = 8218859d-1d7c-419c-adc7-4de175c3ca6d
> Query Hive on Spark job[1] stages:
> 2
> 3
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-07-27 20:38:17,269 Stage-2_0: 0(+8)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:20,298 Stage-2_0: 8(+4)/24 Stage-3_0: 0/1
> 2016-07-27 20:38:22,309 Stage-2_0: 11(+1)/24Stage-3_0: 0/1
> 2016-07-27 20:38:23,330 Stage-2_0: 12(+8)/24Stage-3_0: 0/1
> 2016-07-27 20:38:26,360 Stage-2_0: 17(+7)/24Stage-3_0: 0/1
> 2016-07-27 20:38:27,386 Stage-2_0: 20(+4)/24Stage-3_0: 0/1
> 2016-07-27 20:38:28,391 Stage-2_0: 21(+3)/24Stage-3_0: 0/1
> 2016-07-27 20:38:29,395 Stage-2_0: 24/24 Finished   Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 13.14 seconds
> OK
> 1
> Time taken: 13.426 seconds, Fetched: 1 row(s)
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 July 2016 at 20:31, Mudit Kumar  wrote:
>
>> Hi All,
>>
>> I need to configure hive cluster based on spark engine (yarn).
>> I already have a running hadoop cluster.
>>
>> Can someone point me to relevant documentation?
>>
>> TIA.
>>
>> Thanks,
>> Mudit
>>
>
>


Fwd: Building Spark 2 from source that does not include the Hive jars

2016-07-28 Thread Mich Talebzadeh
Anyone in Hive forum knows about this?

Thanks

This has worked before including 1.6.1 etc

Build Spark without Hive jars. The idea being to use Spark as Hive
execution engine.

There is some notes on Hive on Spark: Getting Started
<https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started>

The usual process is to do

dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"

However, now I am getting this warning
[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 10:08 min (Wall Clock)
[INFO] Finished at: 2016-07-27T15:07:11+01:00
[INFO] Final Memory: 98M/1909M
[INFO]

+ rm -rf /data6/hduser/spark-2.0.0/dist
+ mkdir -p /data6/hduser/spark-2.0.0/dist/jars
+ echo 'Spark [WARNING] The requested profile "parquet-provided" could not
be activated because it does not exist. built for Hadoop [WARNING] The
requested profile "parquet-provided" could not be activated because it does
not exist.'
+ echo 'Build flags: -Pyarn,hadoop-provided,hadoop-2.6,parquet-provided'


And this is the only tgz file I see

./spark-[WARNING] The requested profile "parquet-provided" could not be
activated because it does not exist.-bin-hadoop2-without-hive.tgz

Any clues what is happening and the correct way of creating the build:

My interest is to extract the jar file similar to below from the build

 spark-assembly-1.3.1-hadoop2.4.0.jar

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Updating ORC table fails with [Error 10122]: Bucketized tables do not support INSERT INTO:

2016-07-29 Thread Mich Talebzadeh
Table was created as ORC transactional table

CREATE TABLE `payees`(
  `transactiondescription` string,
  `hits` int,
  `hashtag` string)
CLUSTERED BY (
  transactiondescription)
INTO 256 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://rhes564:9000/user/hive/warehouse/accounts.db/payees'
TBLPROPERTIES (
  'last_modified_by'='hduser',
  'last_modified_time'='1469818104',
  'numFiles'='256',
  'numRows'='620',
  'orc.compress'='ZLIB',
  'ransactional'='true',
  'rawDataSize'='0',
  'totalSize'='113650',
  'transient_lastDdlTime'='1469818104')


Updating column hashtag in this table based on column transactiondescription

it fails

hive> update payees set hashtag = "HARRODS" here transactiondescription
like "%HARRODS%";
FAILED: ParseException line 1:38 missing EOF at 'here' near '"HARRODS"'
hive> update payees set hashtag = "HARRODS" where transactiondescription
like "%HARRODS%";

FAILED: SemanticException [Error 10122]: Bucketized tables do not support
INSERT INTO: Table: accounts.payees


What would be the least painful solution without some elaborate means?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


<    1   2   3   4   5   6   7   8   >