Erroneous column selection in subquery

2017-09-11 Thread Nitin Vijayvargiya
Hi all,
Noticed something extremely strange and concerning regarding subquery column
selection. Here's the query, and the sample results in presto vs hive
SELECTipm.ip,ipm.master_id AS mid,COALESCE(ad.cookie, '') AS cid,
COALESCE(ad.url_cid, '') AS campaign_id,COALESCE(ad.url_sid, '') AS
segment_id,CASEWHEN ad.url_event='clk'THEN 'media_click' 
ELSE 'media_impression'END AS event,count(*) AS count,CASE
WHEN COALESCE(o.name, '') != ''THEN 'first_party_media'
ELSE ot.token_type END AS campaign_type,ad.dt,
lower(COALESCE(ot.organization_name, o.name)) AS orgFROM(SELECT
base_ip,url_event,url_orgid,
url_token, url_cid,url_sid,cookie,
url,date_time,dtFROM
tag.pixelWHEREdt BETWEEN '2017-08-23' AND
'2017-08-23'AND user_agent NOT LIKE 'ELB-HealthChecker%'
GROUP BY1, 2, 3, 4, 5, 6, 7, 8, 9, 10) adJOIN
default.ip_to_master_v2 ipmON ipm.ip = ad.base_iplEFT JOIN
default.organization_token otON ot.token = ad.url_tokenLEFT JOIN
aa.organization oON ad.url_orgid = o.idWHERE 
COALESCE(ot.organization_name, o.name, '') != '' GROUP BY1, 2, 3, 4, 5, 6,
8, 9, 10LIMIT 10
Presto   ip                 | mid                                          
            | cid                                                              
             | campaign_id | segment_id | event                      | count |
campaign_type    | dt                | org (ommitted)

 216.110.25.2     | 93c8a694f6c0d39-                               |
420ad81704196549e258d903aba10300 | 50                 | 2              
   | media_impression | 1        | first_party_media | 2017-08-23 | **
 14.140.116.135 | 6dfbca80b99c365-60934b2c332bcaa |
96453917a70dcbeb59598e01fe970400  | 43                 | 7              
    | media_impression | 2        | first_party_media | 2017-08-23 | **
Hive103.4.16.84 51 | ba82f9c1e7ef7-51ba82f9c1e7ef7       | 50    | 50    | 2   |
media_impression | 2   | first_party_media | 2017-08-23 | 5012.175.247.66  |
c4028d27f0ebbb2-ef86eff7815a004   | 109  | 109  | 21 | media_impression | 23 |
first_party_media | 2017-08-23 | 109116.197.188.12 |
e78268e3ce3e13c-19ffd91bd5236a8 | 43    | 43    | 7   | media_impression | 4   |
first_party_media | 2017-08-23 | 43
Alright, so clearly the results are very different in presto compared to hive. 
What's interesting is that columns 3, 4, and 10 show the same stat: campaign_id.
I tried a bunch of different things (different aliasing, grouping, etc.), but
nothing worked. The thing that did the trick was pulling out the subquery into
its own table
DROP TABLE IF EXISTS aa.tmp_subquery;CREATE TABLE IF NOT EXISTS aa.tmp_subquery
ASSELECTbase_ip,url_event,url_orgid,url_token, url_cid,
url_sid,cookie,url,date_time,dtFROMtag.pixelWHEREdt
BETWEEN '{start--mm-dd}' AND '{end--mm-dd}'AND user_agent NOT LIKE
'ELB-HealthChecker%'GROUP BY1, 2, 3, 4, 5, 6, 7, 8, 9, 10;
INSERT OVERWRITE TABLE aa.campaign_accounts PARTITION (campaign_type, dt, org)
SELECTipm.ip,ipm.master_id AS mid,COALESCE(ad.cookie, '') AS cid,
COALESCE(ad.url_cid, '') AS campaign_id,COALESCE(ad.url_sid, '') AS
segment_id,CASEWHEN ad.url_event='clk'THEN 'media_click' 
ELSE 'media_impression'END AS event,count(*) AS count,CASE
WHEN COALESCE(o.name, '') != ''THEN 'first_party_media'
ELSE ot.token_type END AS campaign_type,ad.dt,
lower(COALESCE(ot.organization_name, o.name)) AS orgFROMaa.tmp_subquery
adJOIN default.ip_to_master_v2 ipmON ipm.ip = ad.base_iplEFT
JOIN default.organization_token otON ot.token = ad.url_tokenLEFT
JOIN aa.organization oON ad.url_orgid = o.idWHERE 
COALESCE(ot.organization_name, o.name, '') != '' GROUP BY1, 2, 3, 4, 5, 6,
8, 9, 10DISTRIBUTE BYcampaign_type, dt, orgSORT BYmid;
The hive output matches presto's.Seems like a bug that's easy to miss and can
cause some major damage.
Anyone encountered this before?
Thanks,
Nitin

Re: Hive LLAP service is not starting

2017-09-11 Thread Gopal Vijayaraghavan
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: 
> /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory)

LLAP expects to find a tarball where tez.lib.uris is - looks like you've got a 
directory?

Cheers,
Gopal




Re: Why is a single INSERT very slow in Hive?

2017-09-11 Thread Jörn Franke
Why do you want to do single inserts? 
It has been more designed for bulk loads.
In any case newer version of Hive 2 using TEZ +llap improve it significantly 
(also for bulk analysis). Nevertheless, it is good practice to not use single 
inserts in an analysis systems, but try to combine and bulk-load them.

> On 11. Sep 2017, at 21:01, Jinhui Qin  wrote:
> 
>  
> 
> Hi, 
> I am new to Hive. I just created a simple table in hive and inserted two 
> records, the first insertion took 16.4 sec, while the second took 14.3 sec. 
> Why is that very slow? is this the normal performance you get in Hive using 
> INSERT ? Is there a way to improve the performance of a single "insert" in 
> Hive? Any help would be really appreciated. Thanks!
> 
> Here is the record from a terminal in Hive shell:
> 
> =
> 
> hive> show tables;
> OK
> Time taken: 2.758 seconds
> hive> create table people(id int, name string, age int);
> OK
> Time taken: 0.283 seconds
> hive> insert into table people(1,'Tom A', 20);
> Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1505146047428_0098, Tracking URL = 
> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0098/
> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill 
> job_1505146047428_0098
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2017-09-11 13:41:01,492 Stage-1 map = 0%,  reduce = 0%
> 2017-09-11 13:41:06,940 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.7 
> sec
> MapReduce Total cumulative CPU time: 2 seconds 700 msec
> Ended Job = job_1505146047428_0098
> Stage-4 is selected by condition resolver.
> Stage-3 is filtered out by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Moving data to: 
> hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-11_13-40-52_106_462156758110461544
> 1-1/-ext-1
> Loading data to table default.people
> Table default.people stats: [numFiles=1, numRows=1, totalSize=11, 
> rawDataSize=10]
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS Write: 
> 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 700 msec
> OK
> Time taken: 16.417 seconds
> hive> insert into table people values(1,'Tom A', 20);
> Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79
> Total jobs = 3
> Launching Job 1 out of 3
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1505146047428_0099, Tracking URL = 
> http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0099/
> Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill 
> job_1505146047428_0099
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2017-09-11 13:41:36,289 Stage-1 map = 0%,  reduce = 0%
> 2017-09-11 13:41:40,721 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.28 
> sec
> MapReduce Total cumulative CPU time: 2 seconds 280 msec
> Ended Job = job_1505146047428_0099
> Stage-4 is selected by condition resolver.
> Stage-3 is filtered out by condition resolver.
> Stage-5 is filtered out by condition resolver.
> Moving data to: 
> hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-11_13-41-28_757_445847252207124056
> 7-1/-ext-1
> Loading data to table default.people
> Table default.people stats: [numFiles=2, numRows=2, totalSize=22, 
> rawDataSize=20]
> MapReduce Jobs Launched: 
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.28 sec   HDFS Read: 3924 HDFS 
> Write: 81 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 280 msec
> OK
> Time taken: 14.288 seconds
> hive> exit;
> =
> 
>  
> Jinhui
> 
> 


Why is a single INSERT very slow in Hive?

2017-09-11 Thread Jinhui Qin
Hi,

I am new to Hive. I just created a simple table in hive and inserted two
records, the first insertion took 16.4 sec, while the second took 14.3 sec.
Why is that very slow? is this the normal performance you get in Hive using
INSERT ? Is there a way to improve the performance of a single "insert" in
Hive? Any help would be really appreciated. Thanks!

Here is the record from a terminal in Hive shell:

=

hive> show tables;
OK
Time taken: 2.758 seconds
hive> create table people(id int, name string, age int);
OK
Time taken: 0.283 seconds
hive> insert into table people(1,'Tom A', 20);
Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1505146047428_0098, Tracking URL = http://iop-hadoop-bi.
novalocal:8088/proxy/application_1505146047428_0098/
Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill
job_1505146047428_0098
Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 0
2017-09-11 13:41:01,492 Stage-1 map = 0%,  reduce = 0%
2017-09-11 13:41:06,940 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
2.7 sec
MapReduce Total cumulative CPU time: 2 seconds 700 msec
Ended Job = job_1505146047428_0098
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/
warehouse/people/.hive-staging_hive_2017-09-11_13-40-
52_106_462156758110461544
1-1/-ext-1
Loading data to table default.people
Table default.people stats: [numFiles=1, numRows=1, totalSize=11,
rawDataSize=10]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.7 sec   HDFS Read: 3836 HDFS
Write: 81 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 700 msec
OK
Time taken: 16.417 seconds
hive> insert into table people values(1,'Tom A', 20);
Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1505146047428_0099, Tracking URL = http://iop-hadoop-bi.
novalocal:8088/proxy/application_1505146047428_0099/
Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job  -kill
job_1505146047428_0099
Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 0
2017-09-11 13:41:36,289 Stage-1 map = 0%,  reduce = 0%
2017-09-11 13:41:40,721 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
2.28 sec
MapReduce Total cumulative CPU time: 2 seconds 280 msec
Ended Job = job_1505146047428_0099
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/
warehouse/people/.hive-staging_hive_2017-09-11_13-41-
28_757_445847252207124056
7-1/-ext-1
Loading data to table default.people
Table default.people stats: [numFiles=2, numRows=2, totalSize=22,
rawDataSize=20]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 2.28 sec   HDFS Read: 3924 HDFS
Write: 81 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 280 msec
OK
Time taken: 14.288 seconds
hive> exit;
=


Jinhui


Re: Hive LLAP service is not starting

2017-09-11 Thread Premal Shah
Hi Rajesh,
I thought the recommended way to start/stop LLAP was to use Slider

this is what we do to start

slider create llap --resources $BASEDIR/resources.json --template
$BASEDIR/appConfig.json

and to stop before you can start again

slider stop llap
slider destroy llap --force || slider destroy llap


On Mon, Sep 11, 2017 at 5:36 AM, Rajesh Narayanan <
rajesh.r.naraya...@oracle.com> wrote:

> Hi All,
>
> I am using Hadoop 2.7.3 and Hive 2.2.0 and TEZ 0.9.0 in linux. and able to
> execute hive query in yarn as TEZ type.
>
> When I start hive LLAP service  -./hive --service llap --name @llap
> --instances 1 --cache 1024m --xmx 2048m --size 3225m --loglevel DEBUG
> --args " -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA -XX:-ResizePLAB"
>
> Getting below exception. Could you please suggest what missing here.
>
>
>
> Failed: java.io.FileNotFoundException: 
> /tmp/staging-slider-HHIwk3/lib/tez.tar.gz
> (Is a directory)
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException:
> /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory)
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.run(
> LlapServiceDriver.java:605)
> at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.main(
> LlapServiceDriver.java:113)
> Caused by: java.io.FileNotFoundException: 
> /tmp/staging-slider-HHIwk3/lib/tez.tar.gz
> (Is a directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.(FileInputStream.java:138)
> at org.apache.hadoop.hive.common.CompressionUtils.unTar(
> CompressionUtils.java:152)
> at org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call(
> LlapServiceDriver.java:361)
> at org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call(
> LlapServiceDriver.java:348)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> INFO cli.LlapServiceDriver: LLAP service driver finished
>
>
>
> Thanks
>
> Rajesh
>



-- 
Regards,
Premal Shah.


Hive LLAP service is not starting

2017-09-11 Thread Rajesh Narayanan
Hi All,

I am using Hadoop 2.7.3 and Hive 2.2.0 and TEZ 0.9.0 in linux. and able to 
execute hive query in yarn as TEZ type.

When I start hive LLAP service  -./hive --service llap --name @llap --instances 
1 --cache 1024m --xmx 2048m --size 3225m --loglevel DEBUG --args " -XX:+UseG1GC 
-XX:+ResizeTLAB -XX:+UseNUMA -XX:-ResizePLAB"

Getting below exception. Could you please suggest what missing here.

 

Failed: java.io.FileNotFoundException: 
/tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory)
java.util.concurrent.ExecutionException: java.io.FileNotFoundException: 
/tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.hive.llap.cli.LlapServiceDriver.run(LlapServiceDriver.java:605)
at 
org.apache.hadoop.hive.llap.cli.LlapServiceDriver.main(LlapServiceDriver.java:113)
Caused by: java.io.FileNotFoundException: 
/tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at 
org.apache.hadoop.hive.common.CompressionUtils.unTar(CompressionUtils.java:152)
at 
org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call(LlapServiceDriver.java:361)
at 
org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call(LlapServiceDriver.java:348)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
INFO cli.LlapServiceDriver: LLAP service driver finished

 

Thanks

Rajesh