Get arguments' names in Hive's UDF

2013-07-21 Thread Felix .
Hi all,

Is there any api to retrieve the parameter's column name in GenericUDF?
For example:

Select UDFTEST(columnA,columnB) from test;

I want to get the column names("columnA" and "columnB") in
UDFTEST's initialize function via ObjectInspector but I did not find any
viable solution.


Possible to specify reducers for each stage?

2013-07-02 Thread Felix .
Hi all,

Is it possible to specify reducer number for each stage ? how?

thanks!


Re: Performance difference between tuning reducer num and partition table

2013-06-30 Thread Felix .
Hi Dean,

Thanks for your reply. If I don't set the number of reducers in the 1st run
, the number of reducers will be much smaller and the performance will be
worse. The total output file size is about 200MB, I see that many reduce
output files are empty, only 10 of them have data.

Another question is that , is there any documentation about the job
specific parameters of MapReduce and Hive?




2013/6/29 Dean Wampler 

> What happens if you don't set the number of reducers in the 1st run? How
> many reducers are executed. If it's a much smaller number, the extra
> overhead could matter. Another clue is the size of the files the first run
> produced, i.e., do you have 30 small (much less than a block size) files?
>
> On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐  wrote:
>
>> Hi Stephen,
>>
>> My query is actually more complex , hive will generate 2 mapreduces,
>> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
>> 30 reducers (reducer num is set manually)
>> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
>> reducers for each partition
>>
>> I do not know whether they could achieve the same performance if the
>> reducers num is set properly.
>>
>>
>> 2013/6/29 Stephen Sprague 
>>
>>> great question.  your parallelization seems to trump hadoop's.I
>>> guess i'd ask what are the _total_ number of Mappers and Reducers that run
>>> on your cluster for these two scenarios?   I'd be curious if there are the
>>> same.
>>>
>>>
>>>
>>>
>>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐  wrote:
>>>
>>>> Hi all,
>>>>
>>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>>> perform a simple join on them,
>>>>
>>>> We can do it like this:
>>>>
>>>> INSERT OVERWRITE TABLE C
>>>> SELECT  FROM A JOIN B on A.id=B.id
>>>>
>>>> In order to speed up this query since table A and B have lots of data,
>>>> another approach is :
>>>>
>>>> Say I partition table A and B into 10 partitions respectively, and
>>>> write the query like this
>>>>
>>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>>> SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>>>
>>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>>>
>>>> And my question is that , in my observation of some more complex
>>>> queries, the second solution is about 15% faster than the first solution,
>>>> is it simply because the setting of reducer num is not optimal?
>>>> If the resource is not a limit and it is possible to set the proper
>>>> reducer nums in the first solution , can they achieve the same performance?
>>>> Is there any other fact that can cause performance difference between
>>>> them(non-partition VS partition+concurrent) besides the job parameter
>>>> issues?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>
>
> --
> Dean Wampler, Ph.D.
> @deanwampler
> http://polyglotprogramming.com


Re: Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix .
Hi Stephen,

My query is actually more complex , hive will generate 2 mapreduces,
in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
30 reducers (reducer num is set manually)
in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
reducers for each partition

I do not know whether they could achieve the same performance if the
reducers num is set properly.


2013/6/29 Stephen Sprague 

> great question.  your parallelization seems to trump hadoop's.I guess
> i'd ask what are the _total_ number of Mappers and Reducers that run on
> your cluster for these two scenarios?   I'd be curious if there are the
> same.
>
>
>
>
> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐  wrote:
>
>> Hi all,
>>
>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>> perform a simple join on them,
>>
>> We can do it like this:
>>
>> INSERT OVERWRITE TABLE C
>> SELECT  FROM A JOIN B on A.id=B.id
>>
>> In order to speed up this query since table A and B have lots of data,
>> another approach is :
>>
>> Say I partition table A and B into 10 partitions respectively, and write
>> the query like this
>>
>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>> SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>
>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>
>> And my question is that , in my observation of some more complex queries,
>> the second solution is about 15% faster than the first solution,
>> is it simply because the setting of reducer num is not optimal?
>> If the resource is not a limit and it is possible to set the proper
>> reducer nums in the first solution , can they achieve the same performance?
>> Is there any other fact that can cause performance difference between
>> them(non-partition VS partition+concurrent) besides the job parameter
>> issues?
>>
>> Thanks!
>>
>
>


Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix .
Hi all,

Here is the scenario, suppose I have 2 tables A and B, I would like to
perform a simple join on them,

We can do it like this:

INSERT OVERWRITE TABLE C
SELECT  FROM A JOIN B on A.id=B.id

In order to speed up this query since table A and B have lots of data,
another approach is :

Say I partition table A and B into 10 partitions respectively, and write
the query like this

INSERT OVERWRITE TABLE C PARTITION(pid=1)
SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

then I run this query 10 times concurrently (pid ranges from 1 to 10)

And my question is that , in my observation of some more complex queries,
the second solution is about 15% faster than the first solution,
is it simply because the setting of reducer num is not optimal?
If the resource is not a limit and it is possible to set the proper reducer
nums in the first solution , can they achieve the same performance? Is
there any other fact that can cause performance difference between
them(non-partition VS partition+concurrent) besides the job parameter
issues?

Thanks!


Re: Overwrite by selected data from table itself?

2013-05-29 Thread Felix .
I've done in this way many times , there must be some errors in your script
, you may paste your script here.


2013/5/30 Stephen Sprague 

> i think it's a clever idea.   Can you reproduce this behavior via a simple
> example and show it here?   I ran a test on hive 0.80 and it worked as you
> would expect.
>
> Regards,
> Stephen.
>
> hisql>select * from junk;
> +-+
> | _c0 |
> +-+
> | 1   |
> +-+
> 1 affected
>
> hisql>insert overwrite table junk select * from junk where '_c0' != 1;
>
> hisql>select count(*) from junk;
> +-+
> | _c0 |
> +-+
> | 0   |
> +-+
> 1 affected
>
>
> On Wed, May 29, 2013 at 2:16 AM, Bhathiya Jayasekara <
> tobhathi...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have this scenario to remove certain rows from a hive table. As far as
>> I understand, hive doesn't provide that functionality.
>>
>> So, I'm trying to select inverse of what I want to delete and overwrite
>> the table with that. What do you think of this approach?
>>
>> I tried to do it but seems it doesn't work as I expect. All data is still
>> available in the table. what's the reason if it's the correct behavior?
>>
>> Thanks.
>>
>
>


How to change the separator of INSERT OVERWRITE LOCAL DIRECTORY

2013-05-29 Thread Felix .
Hi all,

I am wondering how to change the fields separator of INSERT OVERWRITE LOCAL
DIRECTORY , does anyone have experience doing this ? thanks!


Re: How to change the separator of input reocrd in TRANSFORM of Hive

2013-05-23 Thread Felix .
Oh sorry , I find the solution on wiki:
https://cwiki.apache.org/Hive/languagemanual-transform.html by specifying
the inrowformat and outrowformat .


2013/5/24 Felix.徐 

> Hi all,
>
> I am trying to use Transform in Hive, but I do not find a way to change
> the separator between fields of input records in Transform.
>
> I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY
> '\0'
>
> However, while using
> SELECT TRANSFORM(id,name) USING 'python script.py'
> AS (id,name)
> FROM A
>
> I find that the separator of input is TAB instead of '\0' ,
> does anyone know how to change it to '\0'? Thanks.
>


How to change the separator of input reocrd in TRANSFORM of Hive

2013-05-23 Thread Felix .
Hi all,

I am trying to use Transform in Hive, but I do not find a way to change the
separator between fields of input records in Transform.

I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY '\0'

However, while using
SELECT TRANSFORM(id,name) USING 'python script.py'
AS (id,name)
FROM A

I find that the separator of input is TAB instead of '\0' ,
does anyone know how to change it to '\0'? Thanks.


Bugs exist in SEMI JOIN?

2012-11-21 Thread Felix .
Hi,
I am using the version 0.9.0 and my tables are the same with TPC-H
benchmark:

Here is a simple query(works correctly):

*Q1*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN(
  SELECT O_CUSTKEY FROM ORDERS WHERE unix_timestamp(O_ORDERDATE,
'-MM-dd') > unix_timestamp('1995-12-31','-MM-dd')
 ) tempTable ON tempTable.O_CUSTKEY=CUSTOMER.C_CUSTKEY

it means inserting the key of customers who has orders since 1995-12-31
into another table.
But if I write the query like this:

*Q2*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 *AND *unix_timestamp(ORDERS.O_ORDERDATE, '-MM-dd') >
unix_timestamp('1995-12-31','-MM-dd')

I will get exception from Hive:


FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1566)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.pushJoinFilters(SemanticAnalyzer.java:5254)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6754)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Also,If I write the query like this:
*Q3*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 *WHERE *unix_timestamp(ORDERS.O_ORDERDATE, '-MM-dd') >
unix_timestamp('1995-12-31','-MM-dd')

Then this query can be executed(wondering the right hand of SEMI JOIN can
be referenced in WHERE clause now?), but the result is wrong(comparing to *Q1,
Q1*'s result is the same with mysql).


What is the rule of job name generation in Hive?

2012-03-22 Thread Felix .
Hi,all..I find that the job names of Hive are like this " INSERT OVERWRITE
TABLE u...userID,neighborid(Stage-4) "
What is the rule of generating such a name?


Re: How to get job names and stages of a query?

2012-03-20 Thread Felix .
I actually want to get the job name of stages by api..

在 2012年3月20日 下午2:23,Manish Bhoge 写道:

> **
> Whenever you submit a Sql a job I'd get generated. You can open the job
> tracker localhost:50030/jobtracker.asp
> It shows jobs are running and rest of the other details.
> Thanks,
> Manish
> Sent from my BlackBerry, pls excuse typo
> ------
> *From: * Felix.徐 
> *Date: *Tue, 20 Mar 2012 12:58:53 +0800
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *How to get job names and stages of a query?
>
> Hi,all
> I want to track the progress of a query, how can I get the job name
> including stages of a query?
>


How to track query status in hive via thrift or anything else?

2012-03-14 Thread Felix .
Hi,all ..
I didn't find any helpful api in ThriftHive that can track the execution
status of hive(or job progress).I want to get execution progress of queries
from hive?How to do that?Thanks!


Re: Is it possible to get the progress of a query through thrift server?

2012-03-12 Thread Felix .
Can you provide some references? Thanks very much!

在 2012年3月12日 下午11:28,Edward Capriolo 写道:

> Yes. You have access to the job counters through thrift, as well as a
> method to test if query is done.
>
> Edward
>
> On Mon, Mar 12, 2012 at 11:12 AM, Felix.徐  wrote:
> > Hi all,
> > I want to build a website to monitor the execution of queries sent to
> hive ,
> > is there any way to realize it?
>


Is it possible to get the progress of a query through thrift server?

2012-03-12 Thread Felix .
Hi all,
I want to build a website to monitor the execution of queries sent to hive
, is there any way to realize it?


Re: Showing wrong count after importing table in Hive

2012-02-08 Thread Felix .
Hi, I meet the same problem once, then I change the amount of imported
 columns it works fine. Sometimes blank rows would be generated by sqoop..I
do not actually know what the problem really is..

2012/2/9 Bhavesh Shah 

>
>
>
>
>Hello All,
>
> I have imported near about 10 tables in Hive from MS SQL Server. But when
> I try to cross check the records in Hive in one of the Table I have found
> more record when I run the query (select count(*) from tblName;).
>
> Then I have drop the that Table and again imported it in Hive. I have
> observed in Console Logs that (Retrieved 203 records). And then I tried
> again for (select count(*) from tblName;) and I got the count as 298.
>
> I dont understand this why this happens. Is anything is wrong in query or
> it happens due to some incorrect command of sqoop-import.
>
> All other table records are fine.
>
> I got stuck here and I had spend much time to search for this. Pls help me
> out from this.
>
>
> --
> Thanks and Regards,
> Bhavesh Shah
>
>