Erroneous column selection in subquery
Hi all, Noticed something extremely strange and concerning regarding subquery column selection. Here's the query, and the sample results in presto vs hive SELECTipm.ip,ipm.master_id AS mid,COALESCE(ad.cookie, '') AS cid, COALESCE(ad.url_cid, '') AS campaign_id,COALESCE(ad.url_sid, '') AS segment_id,CASEWHEN ad.url_event='clk'THEN 'media_click' ELSE 'media_impression'END AS event,count(*) AS count,CASE WHEN COALESCE(o.name, '') != ''THEN 'first_party_media' ELSE ot.token_type END AS campaign_type,ad.dt, lower(COALESCE(ot.organization_name, o.name)) AS orgFROM(SELECT base_ip,url_event,url_orgid, url_token, url_cid,url_sid,cookie, url,date_time,dtFROM tag.pixelWHEREdt BETWEEN '2017-08-23' AND '2017-08-23'AND user_agent NOT LIKE 'ELB-HealthChecker%' GROUP BY1, 2, 3, 4, 5, 6, 7, 8, 9, 10) adJOIN default.ip_to_master_v2 ipmON ipm.ip = ad.base_iplEFT JOIN default.organization_token otON ot.token = ad.url_tokenLEFT JOIN aa.organization oON ad.url_orgid = o.idWHERE COALESCE(ot.organization_name, o.name, '') != '' GROUP BY1, 2, 3, 4, 5, 6, 8, 9, 10LIMIT 10 Presto ip | mid | cid | campaign_id | segment_id | event | count | campaign_type | dt | org (ommitted) 216.110.25.2 | 93c8a694f6c0d39- | 420ad81704196549e258d903aba10300 | 50 | 2 | media_impression | 1 | first_party_media | 2017-08-23 | ** 14.140.116.135 | 6dfbca80b99c365-60934b2c332bcaa | 96453917a70dcbeb59598e01fe970400 | 43 | 7 | media_impression | 2 | first_party_media | 2017-08-23 | ** Hive103.4.16.84 51 | ba82f9c1e7ef7-51ba82f9c1e7ef7 | 50 | 50 | 2 | media_impression | 2 | first_party_media | 2017-08-23 | 5012.175.247.66 | c4028d27f0ebbb2-ef86eff7815a004 | 109 | 109 | 21 | media_impression | 23 | first_party_media | 2017-08-23 | 109116.197.188.12 | e78268e3ce3e13c-19ffd91bd5236a8 | 43 | 43 | 7 | media_impression | 4 | first_party_media | 2017-08-23 | 43 Alright, so clearly the results are very different in presto compared to hive. What's interesting is that columns 3, 4, and 10 show the same stat: campaign_id. I tried a bunch of different things (different aliasing, grouping, etc.), but nothing worked. The thing that did the trick was pulling out the subquery into its own table DROP TABLE IF EXISTS aa.tmp_subquery;CREATE TABLE IF NOT EXISTS aa.tmp_subquery ASSELECTbase_ip,url_event,url_orgid,url_token, url_cid, url_sid,cookie,url,date_time,dtFROMtag.pixelWHEREdt BETWEEN '{start--mm-dd}' AND '{end--mm-dd}'AND user_agent NOT LIKE 'ELB-HealthChecker%'GROUP BY1, 2, 3, 4, 5, 6, 7, 8, 9, 10; INSERT OVERWRITE TABLE aa.campaign_accounts PARTITION (campaign_type, dt, org) SELECTipm.ip,ipm.master_id AS mid,COALESCE(ad.cookie, '') AS cid, COALESCE(ad.url_cid, '') AS campaign_id,COALESCE(ad.url_sid, '') AS segment_id,CASEWHEN ad.url_event='clk'THEN 'media_click' ELSE 'media_impression'END AS event,count(*) AS count,CASE WHEN COALESCE(o.name, '') != ''THEN 'first_party_media' ELSE ot.token_type END AS campaign_type,ad.dt, lower(COALESCE(ot.organization_name, o.name)) AS orgFROMaa.tmp_subquery adJOIN default.ip_to_master_v2 ipmON ipm.ip = ad.base_iplEFT JOIN default.organization_token otON ot.token = ad.url_tokenLEFT JOIN aa.organization oON ad.url_orgid = o.idWHERE COALESCE(ot.organization_name, o.name, '') != '' GROUP BY1, 2, 3, 4, 5, 6, 8, 9, 10DISTRIBUTE BYcampaign_type, dt, orgSORT BYmid; The hive output matches presto's.Seems like a bug that's easy to miss and can cause some major damage. Anyone encountered this before? Thanks, Nitin
Re: Hive LLAP service is not starting
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: > /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory) LLAP expects to find a tarball where tez.lib.uris is - looks like you've got a directory? Cheers, Gopal
Re: Why is a single INSERT very slow in Hive?
Why do you want to do single inserts? It has been more designed for bulk loads. In any case newer version of Hive 2 using TEZ +llap improve it significantly (also for bulk analysis). Nevertheless, it is good practice to not use single inserts in an analysis systems, but try to combine and bulk-load them. > On 11. Sep 2017, at 21:01, Jinhui Qinwrote: > > > > Hi, > I am new to Hive. I just created a simple table in hive and inserted two > records, the first insertion took 16.4 sec, while the second took 14.3 sec. > Why is that very slow? is this the normal performance you get in Hive using > INSERT ? Is there a way to improve the performance of a single "insert" in > Hive? Any help would be really appreciated. Thanks! > > Here is the record from a terminal in Hive shell: > > = > > hive> show tables; > OK > Time taken: 2.758 seconds > hive> create table people(id int, name string, age int); > OK > Time taken: 0.283 seconds > hive> insert into table people(1,'Tom A', 20); > Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0 > Total jobs = 3 > Launching Job 1 out of 3 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_1505146047428_0098, Tracking URL = > http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0098/ > Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job -kill > job_1505146047428_0098 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0 > 2017-09-11 13:41:01,492 Stage-1 map = 0%, reduce = 0% > 2017-09-11 13:41:06,940 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.7 > sec > MapReduce Total cumulative CPU time: 2 seconds 700 msec > Ended Job = job_1505146047428_0098 > Stage-4 is selected by condition resolver. > Stage-3 is filtered out by condition resolver. > Stage-5 is filtered out by condition resolver. > Moving data to: > hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-11_13-40-52_106_462156758110461544 > 1-1/-ext-1 > Loading data to table default.people > Table default.people stats: [numFiles=1, numRows=1, totalSize=11, > rawDataSize=10] > MapReduce Jobs Launched: > Stage-Stage-1: Map: 1 Cumulative CPU: 2.7 sec HDFS Read: 3836 HDFS Write: > 81 SUCCESS > Total MapReduce CPU Time Spent: 2 seconds 700 msec > OK > Time taken: 16.417 seconds > hive> insert into table people values(1,'Tom A', 20); > Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79 > Total jobs = 3 > Launching Job 1 out of 3 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_1505146047428_0099, Tracking URL = > http://iop-hadoop-bi.novalocal:8088/proxy/application_1505146047428_0099/ > Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job -kill > job_1505146047428_0099 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0 > 2017-09-11 13:41:36,289 Stage-1 map = 0%, reduce = 0% > 2017-09-11 13:41:40,721 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.28 > sec > MapReduce Total cumulative CPU time: 2 seconds 280 msec > Ended Job = job_1505146047428_0099 > Stage-4 is selected by condition resolver. > Stage-3 is filtered out by condition resolver. > Stage-5 is filtered out by condition resolver. > Moving data to: > hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/warehouse/people/.hive-staging_hive_2017-09-11_13-41-28_757_445847252207124056 > 7-1/-ext-1 > Loading data to table default.people > Table default.people stats: [numFiles=2, numRows=2, totalSize=22, > rawDataSize=20] > MapReduce Jobs Launched: > Stage-Stage-1: Map: 1 Cumulative CPU: 2.28 sec HDFS Read: 3924 HDFS > Write: 81 SUCCESS > Total MapReduce CPU Time Spent: 2 seconds 280 msec > OK > Time taken: 14.288 seconds > hive> exit; > = > > > Jinhui > >
Why is a single INSERT very slow in Hive?
Hi, I am new to Hive. I just created a simple table in hive and inserted two records, the first insertion took 16.4 sec, while the second took 14.3 sec. Why is that very slow? is this the normal performance you get in Hive using INSERT ? Is there a way to improve the performance of a single "insert" in Hive? Any help would be really appreciated. Thanks! Here is the record from a terminal in Hive shell: = hive> show tables; OK Time taken: 2.758 seconds hive> create table people(id int, name string, age int); OK Time taken: 0.283 seconds hive> insert into table people(1,'Tom A', 20); Query ID = hive_20170911134052_04680c79-432a-43e0-827b-29a4212fbbc0 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1505146047428_0098, Tracking URL = http://iop-hadoop-bi. novalocal:8088/proxy/application_1505146047428_0098/ Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job -kill job_1505146047428_0098 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2017-09-11 13:41:01,492 Stage-1 map = 0%, reduce = 0% 2017-09-11 13:41:06,940 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.7 sec MapReduce Total cumulative CPU time: 2 seconds 700 msec Ended Job = job_1505146047428_0098 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/ warehouse/people/.hive-staging_hive_2017-09-11_13-40- 52_106_462156758110461544 1-1/-ext-1 Loading data to table default.people Table default.people stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 2.7 sec HDFS Read: 3836 HDFS Write: 81 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 700 msec OK Time taken: 16.417 seconds hive> insert into table people values(1,'Tom A', 20); Query ID = hive_20170911134128_c8f46977-7718-4496-9a98-cce0f89ced79 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1505146047428_0099, Tracking URL = http://iop-hadoop-bi. novalocal:8088/proxy/application_1505146047428_0099/ Kill Command = /usr/iop/4.1.0.0/hadoop/bin/hadoop job -kill job_1505146047428_0099 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2017-09-11 13:41:36,289 Stage-1 map = 0%, reduce = 0% 2017-09-11 13:41:40,721 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.28 sec MapReduce Total cumulative CPU time: 2 seconds 280 msec Ended Job = job_1505146047428_0099 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://iop-hadoop-bi.novalocal:8020/apps/hive/ warehouse/people/.hive-staging_hive_2017-09-11_13-41- 28_757_445847252207124056 7-1/-ext-1 Loading data to table default.people Table default.people stats: [numFiles=2, numRows=2, totalSize=22, rawDataSize=20] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 2.28 sec HDFS Read: 3924 HDFS Write: 81 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 280 msec OK Time taken: 14.288 seconds hive> exit; = Jinhui
Re: Hive LLAP service is not starting
Hi Rajesh, I thought the recommended way to start/stop LLAP was to use Slider this is what we do to start slider create llap --resources $BASEDIR/resources.json --template $BASEDIR/appConfig.json and to stop before you can start again slider stop llap slider destroy llap --force || slider destroy llap On Mon, Sep 11, 2017 at 5:36 AM, Rajesh Narayanan < rajesh.r.naraya...@oracle.com> wrote: > Hi All, > > I am using Hadoop 2.7.3 and Hive 2.2.0 and TEZ 0.9.0 in linux. and able to > execute hive query in yarn as TEZ type. > > When I start hive LLAP service -./hive --service llap --name @llap > --instances 1 --cache 1024m --xmx 2048m --size 3225m --loglevel DEBUG > --args " -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA -XX:-ResizePLAB" > > Getting below exception. Could you please suggest what missing here. > > > > Failed: java.io.FileNotFoundException: > /tmp/staging-slider-HHIwk3/lib/tez.tar.gz > (Is a directory) > java.util.concurrent.ExecutionException: java.io.FileNotFoundException: > /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory) > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.run( > LlapServiceDriver.java:605) > at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.main( > LlapServiceDriver.java:113) > Caused by: java.io.FileNotFoundException: > /tmp/staging-slider-HHIwk3/lib/tez.tar.gz > (Is a directory) > at java.io.FileInputStream.open0(Native Method) > at java.io.FileInputStream.open(FileInputStream.java:195) > at java.io.FileInputStream.(FileInputStream.java:138) > at org.apache.hadoop.hive.common.CompressionUtils.unTar( > CompressionUtils.java:152) > at org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call( > LlapServiceDriver.java:361) > at org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call( > LlapServiceDriver.java:348) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > INFO cli.LlapServiceDriver: LLAP service driver finished > > > > Thanks > > Rajesh > -- Regards, Premal Shah.
Hive LLAP service is not starting
Hi All, I am using Hadoop 2.7.3 and Hive 2.2.0 and TEZ 0.9.0 in linux. and able to execute hive query in yarn as TEZ type. When I start hive LLAP service -./hive --service llap --name @llap --instances 1 --cache 1024m --xmx 2048m --size 3225m --loglevel DEBUG --args " -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA -XX:-ResizePLAB" Getting below exception. Could you please suggest what missing here. Failed: java.io.FileNotFoundException: /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory) java.util.concurrent.ExecutionException: java.io.FileNotFoundException: /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory) at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.run(LlapServiceDriver.java:605) at org.apache.hadoop.hive.llap.cli.LlapServiceDriver.main(LlapServiceDriver.java:113) Caused by: java.io.FileNotFoundException: /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.hadoop.hive.common.CompressionUtils.unTar(CompressionUtils.java:152) at org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call(LlapServiceDriver.java:361) at org.apache.hadoop.hive.llap.cli.LlapServiceDriver$1.call(LlapServiceDriver.java:348) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) INFO cli.LlapServiceDriver: LLAP service driver finished Thanks Rajesh