Get arguments' names in Hive's UDF
Hi all, Is there any api to retrieve the parameter's column name in GenericUDF? For example: Select UDFTEST(columnA,columnB) from test; I want to get the column names("columnA" and "columnB") in UDFTEST's initialize function via ObjectInspector but I did not find any viable solution.
Possible to specify reducers for each stage?
Hi all, Is it possible to specify reducer number for each stage ? how? thanks!
Re: Performance difference between tuning reducer num and partition table
Hi Dean, Thanks for your reply. If I don't set the number of reducers in the 1st run , the number of reducers will be much smaller and the performance will be worse. The total output file size is about 200MB, I see that many reduce output files are empty, only 10 of them have data. Another question is that , is there any documentation about the job specific parameters of MapReduce and Hive? 2013/6/29 Dean Wampler > What happens if you don't set the number of reducers in the 1st run? How > many reducers are executed. If it's a much smaller number, the extra > overhead could matter. Another clue is the size of the files the first run > produced, i.e., do you have 30 small (much less than a block size) files? > > On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 wrote: > >> Hi Stephen, >> >> My query is actually more complex , hive will generate 2 mapreduces, >> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / >> 30 reducers (reducer num is set manually) >> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 >> reducers for each partition >> >> I do not know whether they could achieve the same performance if the >> reducers num is set properly. >> >> >> 2013/6/29 Stephen Sprague >> >>> great question. your parallelization seems to trump hadoop's.I >>> guess i'd ask what are the _total_ number of Mappers and Reducers that run >>> on your cluster for these two scenarios? I'd be curious if there are the >>> same. >>> >>> >>> >>> >>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 wrote: >>> >>>> Hi all, >>>> >>>> Here is the scenario, suppose I have 2 tables A and B, I would like to >>>> perform a simple join on them, >>>> >>>> We can do it like this: >>>> >>>> INSERT OVERWRITE TABLE C >>>> SELECT FROM A JOIN B on A.id=B.id >>>> >>>> In order to speed up this query since table A and B have lots of data, >>>> another approach is : >>>> >>>> Say I partition table A and B into 10 partitions respectively, and >>>> write the query like this >>>> >>>> INSERT OVERWRITE TABLE C PARTITION(pid=1) >>>> SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 >>>> >>>> then I run this query 10 times concurrently (pid ranges from 1 to 10) >>>> >>>> And my question is that , in my observation of some more complex >>>> queries, the second solution is about 15% faster than the first solution, >>>> is it simply because the setting of reducer num is not optimal? >>>> If the resource is not a limit and it is possible to set the proper >>>> reducer nums in the first solution , can they achieve the same performance? >>>> Is there any other fact that can cause performance difference between >>>> them(non-partition VS partition+concurrent) besides the job parameter >>>> issues? >>>> >>>> Thanks! >>>> >>> >>> >> > > > -- > Dean Wampler, Ph.D. > @deanwampler > http://polyglotprogramming.com
Re: Performance difference between tuning reducer num and partition table
Hi Stephen, My query is actually more complex , hive will generate 2 mapreduces, in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / 30 reducers (reducer num is set manually) in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 reducers for each partition I do not know whether they could achieve the same performance if the reducers num is set properly. 2013/6/29 Stephen Sprague > great question. your parallelization seems to trump hadoop's.I guess > i'd ask what are the _total_ number of Mappers and Reducers that run on > your cluster for these two scenarios? I'd be curious if there are the > same. > > > > > On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 wrote: > >> Hi all, >> >> Here is the scenario, suppose I have 2 tables A and B, I would like to >> perform a simple join on them, >> >> We can do it like this: >> >> INSERT OVERWRITE TABLE C >> SELECT FROM A JOIN B on A.id=B.id >> >> In order to speed up this query since table A and B have lots of data, >> another approach is : >> >> Say I partition table A and B into 10 partitions respectively, and write >> the query like this >> >> INSERT OVERWRITE TABLE C PARTITION(pid=1) >> SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 >> >> then I run this query 10 times concurrently (pid ranges from 1 to 10) >> >> And my question is that , in my observation of some more complex queries, >> the second solution is about 15% faster than the first solution, >> is it simply because the setting of reducer num is not optimal? >> If the resource is not a limit and it is possible to set the proper >> reducer nums in the first solution , can they achieve the same performance? >> Is there any other fact that can cause performance difference between >> them(non-partition VS partition+concurrent) besides the job parameter >> issues? >> >> Thanks! >> > >
Performance difference between tuning reducer num and partition table
Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks!
Re: Overwrite by selected data from table itself?
I've done in this way many times , there must be some errors in your script , you may paste your script here. 2013/5/30 Stephen Sprague > i think it's a clever idea. Can you reproduce this behavior via a simple > example and show it here? I ran a test on hive 0.80 and it worked as you > would expect. > > Regards, > Stephen. > > hisql>select * from junk; > +-+ > | _c0 | > +-+ > | 1 | > +-+ > 1 affected > > hisql>insert overwrite table junk select * from junk where '_c0' != 1; > > hisql>select count(*) from junk; > +-+ > | _c0 | > +-+ > | 0 | > +-+ > 1 affected > > > On Wed, May 29, 2013 at 2:16 AM, Bhathiya Jayasekara < > tobhathi...@gmail.com> wrote: > >> Hi all, >> >> I have this scenario to remove certain rows from a hive table. As far as >> I understand, hive doesn't provide that functionality. >> >> So, I'm trying to select inverse of what I want to delete and overwrite >> the table with that. What do you think of this approach? >> >> I tried to do it but seems it doesn't work as I expect. All data is still >> available in the table. what's the reason if it's the correct behavior? >> >> Thanks. >> > >
How to change the separator of INSERT OVERWRITE LOCAL DIRECTORY
Hi all, I am wondering how to change the fields separator of INSERT OVERWRITE LOCAL DIRECTORY , does anyone have experience doing this ? thanks!
Re: How to change the separator of input reocrd in TRANSFORM of Hive
Oh sorry , I find the solution on wiki: https://cwiki.apache.org/Hive/languagemanual-transform.html by specifying the inrowformat and outrowformat . 2013/5/24 Felix.徐 > Hi all, > > I am trying to use Transform in Hive, but I do not find a way to change > the separator between fields of input records in Transform. > > I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY > '\0' > > However, while using > SELECT TRANSFORM(id,name) USING 'python script.py' > AS (id,name) > FROM A > > I find that the separator of input is TAB instead of '\0' , > does anyone know how to change it to '\0'? Thanks. >
How to change the separator of input reocrd in TRANSFORM of Hive
Hi all, I am trying to use Transform in Hive, but I do not find a way to change the separator between fields of input records in Transform. I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY '\0' However, while using SELECT TRANSFORM(id,name) USING 'python script.py' AS (id,name) FROM A I find that the separator of input is TAB instead of '\0' , does anyone know how to change it to '\0'? Thanks.
Bugs exist in SEMI JOIN?
Hi, I am using the version 0.9.0 and my tables are the same with TPC-H benchmark: Here is a simple query(works correctly): *Q1* INSERT OVERWRITE TABLE customer_orders_statistics SELECT C_CUSTKEY FROM CUSTOMER LEFT SEMI JOIN( SELECT O_CUSTKEY FROM ORDERS WHERE unix_timestamp(O_ORDERDATE, '-MM-dd') > unix_timestamp('1995-12-31','-MM-dd') ) tempTable ON tempTable.O_CUSTKEY=CUSTOMER.C_CUSTKEY it means inserting the key of customers who has orders since 1995-12-31 into another table. But if I write the query like this: *Q2* INSERT OVERWRITE TABLE customer_orders_statistics SELECT C_CUSTKEY FROM CUSTOMER LEFT SEMI JOIN ORDERS ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY *AND *unix_timestamp(ORDERS.O_ORDERDATE, '-MM-dd') > unix_timestamp('1995-12-31','-MM-dd') I will get exception from Hive: FAILED: Hive Internal Error: java.lang.NullPointerException(null) java.lang.NullPointerException at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1566) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.pushJoinFilters(SemanticAnalyzer.java:5254) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6754) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Also,If I write the query like this: *Q3* INSERT OVERWRITE TABLE customer_orders_statistics SELECT C_CUSTKEY FROM CUSTOMER LEFT SEMI JOIN ORDERS ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY *WHERE *unix_timestamp(ORDERS.O_ORDERDATE, '-MM-dd') > unix_timestamp('1995-12-31','-MM-dd') Then this query can be executed(wondering the right hand of SEMI JOIN can be referenced in WHERE clause now?), but the result is wrong(comparing to *Q1, Q1*'s result is the same with mysql).
What is the rule of job name generation in Hive?
Hi,all..I find that the job names of Hive are like this " INSERT OVERWRITE TABLE u...userID,neighborid(Stage-4) " What is the rule of generating such a name?
Re: How to get job names and stages of a query?
I actually want to get the job name of stages by api.. 在 2012年3月20日 下午2:23,Manish Bhoge 写道: > ** > Whenever you submit a Sql a job I'd get generated. You can open the job > tracker localhost:50030/jobtracker.asp > It shows jobs are running and rest of the other details. > Thanks, > Manish > Sent from my BlackBerry, pls excuse typo > ------ > *From: * Felix.徐 > *Date: *Tue, 20 Mar 2012 12:58:53 +0800 > *To: * > *ReplyTo: * user@hive.apache.org > *Subject: *How to get job names and stages of a query? > > Hi,all > I want to track the progress of a query, how can I get the job name > including stages of a query? >
How to track query status in hive via thrift or anything else?
Hi,all .. I didn't find any helpful api in ThriftHive that can track the execution status of hive(or job progress).I want to get execution progress of queries from hive?How to do that?Thanks!
Re: Is it possible to get the progress of a query through thrift server?
Can you provide some references? Thanks very much! 在 2012年3月12日 下午11:28,Edward Capriolo 写道: > Yes. You have access to the job counters through thrift, as well as a > method to test if query is done. > > Edward > > On Mon, Mar 12, 2012 at 11:12 AM, Felix.徐 wrote: > > Hi all, > > I want to build a website to monitor the execution of queries sent to > hive , > > is there any way to realize it? >
Is it possible to get the progress of a query through thrift server?
Hi all, I want to build a website to monitor the execution of queries sent to hive , is there any way to realize it?
Re: Showing wrong count after importing table in Hive
Hi, I meet the same problem once, then I change the amount of imported columns it works fine. Sometimes blank rows would be generated by sqoop..I do not actually know what the problem really is.. 2012/2/9 Bhavesh Shah > > > > >Hello All, > > I have imported near about 10 tables in Hive from MS SQL Server. But when > I try to cross check the records in Hive in one of the Table I have found > more record when I run the query (select count(*) from tblName;). > > Then I have drop the that Table and again imported it in Hive. I have > observed in Console Logs that (Retrieved 203 records). And then I tried > again for (select count(*) from tblName;) and I got the count as 298. > > I dont understand this why this happens. Is anything is wrong in query or > it happens due to some incorrect command of sqoop-import. > > All other table records are fine. > > I got stuck here and I had spend much time to search for this. Pls help me > out from this. > > > -- > Thanks and Regards, > Bhavesh Shah > >