from:"Felix . 徐"

Get arguments' names in Hive's UDF

2013-07-21 Thread Felix . 徐

Hi all,

Is there any api to retrieve the parameter's column name in GenericUDF?
For example:

Select UDFTEST(columnA,columnB) from test;

I want to get the column names("columnA" and "columnB") in
UDFTEST's initialize function via ObjectInspector but I did not find any
viable solution.

Possible to specify reducers for each stage?

2013-07-02 Thread Felix . 徐

Hi all,

Is it possible to specify reducer number for each stage ? how?

thanks!

Re: Performance difference between tuning reducer num and partition table

2013-06-30 Thread Felix . 徐

Hi Dean,

Thanks for your reply. If I don't set the number of reducers in the 1st run
, the number of reducers will be much smaller and the performance will be
worse. The total output file size is about 200MB, I see that many reduce
output files are empty, only 10 of them have data.

Another question is that , is there any documentation about the job
specific parameters of MapReduce and Hive?




2013/6/29 Dean Wampler 

> What happens if you don't set the number of reducers in the 1st run? How
> many reducers are executed. If it's a much smaller number, the extra
> overhead could matter. Another clue is the size of the files the first run
> produced, i.e., do you have 30 small (much less than a block size) files?
>
> On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐  wrote:
>
>> Hi Stephen,
>>
>> My query is actually more complex , hive will generate 2 mapreduces,
>> in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
>> 30 reducers (reducer num is set manually)
>> in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
>> reducers for each partition
>>
>> I do not know whether they could achieve the same performance if the
>> reducers num is set properly.
>>
>>
>> 2013/6/29 Stephen Sprague 
>>
>>> great question.  your parallelization seems to trump hadoop's.I
>>> guess i'd ask what are the _total_ number of Mappers and Reducers that run
>>> on your cluster for these two scenarios?   I'd be curious if there are the
>>> same.
>>>
>>>
>>>
>>>
>>> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐  wrote:
>>>
>>>> Hi all,
>>>>
>>>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>>>> perform a simple join on them,
>>>>
>>>> We can do it like this:
>>>>
>>>> INSERT OVERWRITE TABLE C
>>>> SELECT  FROM A JOIN B on A.id=B.id
>>>>
>>>> In order to speed up this query since table A and B have lots of data,
>>>> another approach is :
>>>>
>>>> Say I partition table A and B into 10 partitions respectively, and
>>>> write the query like this
>>>>
>>>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>>>> SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>>>
>>>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>>>
>>>> And my question is that , in my observation of some more complex
>>>> queries, the second solution is about 15% faster than the first solution,
>>>> is it simply because the setting of reducer num is not optimal?
>>>> If the resource is not a limit and it is possible to set the proper
>>>> reducer nums in the first solution , can they achieve the same performance?
>>>> Is there any other fact that can cause performance difference between
>>>> them(non-partition VS partition+concurrent) besides the job parameter
>>>> issues?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>
>
> --
> Dean Wampler, Ph.D.
> @deanwampler
> http://polyglotprogramming.com

Re: Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix . 徐

Hi Stephen,

My query is actually more complex , hive will generate 2 mapreduces,
in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
30 reducers (reducer num is set manually)
in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
reducers for each partition

I do not know whether they could achieve the same performance if the
reducers num is set properly.


2013/6/29 Stephen Sprague 

> great question.  your parallelization seems to trump hadoop's.I guess
> i'd ask what are the _total_ number of Mappers and Reducers that run on
> your cluster for these two scenarios?   I'd be curious if there are the
> same.
>
>
>
>
> On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐  wrote:
>
>> Hi all,
>>
>> Here is the scenario, suppose I have 2 tables A and B, I would like to
>> perform a simple join on them,
>>
>> We can do it like this:
>>
>> INSERT OVERWRITE TABLE C
>> SELECT  FROM A JOIN B on A.id=B.id
>>
>> In order to speed up this query since table A and B have lots of data,
>> another approach is :
>>
>> Say I partition table A and B into 10 partitions respectively, and write
>> the query like this
>>
>> INSERT OVERWRITE TABLE C PARTITION(pid=1)
>> SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1
>>
>> then I run this query 10 times concurrently (pid ranges from 1 to 10)
>>
>> And my question is that , in my observation of some more complex queries,
>> the second solution is about 15% faster than the first solution,
>> is it simply because the setting of reducer num is not optimal?
>> If the resource is not a limit and it is possible to set the proper
>> reducer nums in the first solution , can they achieve the same performance?
>> Is there any other fact that can cause performance difference between
>> them(non-partition VS partition+concurrent) besides the job parameter
>> issues?
>>
>> Thanks!
>>
>
>

Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix . 徐

Hi all,

Here is the scenario, suppose I have 2 tables A and B, I would like to
perform a simple join on them,

We can do it like this:

INSERT OVERWRITE TABLE C
SELECT  FROM A JOIN B on A.id=B.id

In order to speed up this query since table A and B have lots of data,
another approach is :

Say I partition table A and B into 10 partitions respectively, and write
the query like this

INSERT OVERWRITE TABLE C PARTITION(pid=1)
SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

then I run this query 10 times concurrently (pid ranges from 1 to 10)

And my question is that , in my observation of some more complex queries,
the second solution is about 15% faster than the first solution,
is it simply because the setting of reducer num is not optimal?
If the resource is not a limit and it is possible to set the proper reducer
nums in the first solution , can they achieve the same performance? Is
there any other fact that can cause performance difference between
them(non-partition VS partition+concurrent) besides the job parameter
issues?

Thanks!

Re: Overwrite by selected data from table itself?

2013-05-29 Thread Felix . 徐

I've done in this way many times , there must be some errors in your script
, you may paste your script here.


2013/5/30 Stephen Sprague 

> i think it's a clever idea.   Can you reproduce this behavior via a simple
> example and show it here?   I ran a test on hive 0.80 and it worked as you
> would expect.
>
> Regards,
> Stephen.
>
> hisql>select * from junk;
> +-+
> | _c0 |
> +-+
> | 1   |
> +-+
> 1 affected
>
> hisql>insert overwrite table junk select * from junk where '_c0' != 1;
>
> hisql>select count(*) from junk;
> +-+
> | _c0 |
> +-+
> | 0   |
> +-+
> 1 affected
>
>
> On Wed, May 29, 2013 at 2:16 AM, Bhathiya Jayasekara <
> tobhathi...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have this scenario to remove certain rows from a hive table. As far as
>> I understand, hive doesn't provide that functionality.
>>
>> So, I'm trying to select inverse of what I want to delete and overwrite
>> the table with that. What do you think of this approach?
>>
>> I tried to do it but seems it doesn't work as I expect. All data is still
>> available in the table. what's the reason if it's the correct behavior?
>>
>> Thanks.
>>
>
>

How to change the separator of INSERT OVERWRITE LOCAL DIRECTORY

2013-05-29 Thread Felix . 徐

Hi all,

I am wondering how to change the fields separator of INSERT OVERWRITE LOCAL
DIRECTORY , does anyone have experience doing this ? thanks!

Re: How to change the separator of input reocrd in TRANSFORM of Hive

2013-05-23 Thread Felix . 徐

Oh sorry , I find the solution on wiki:
https://cwiki.apache.org/Hive/languagemanual-transform.html by specifying
the inrowformat and outrowformat .


2013/5/24 Felix.徐 

> Hi all,
>
> I am trying to use Transform in Hive, but I do not find a way to change
> the separator between fields of input records in Transform.
>
> I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY
> '\0'
>
> However, while using
> SELECT TRANSFORM(id,name) USING 'python script.py'
> AS (id,name)
> FROM A
>
> I find that the separator of input is TAB instead of '\0' ,
> does anyone know how to change it to '\0'? Thanks.
>

How to change the separator of input reocrd in TRANSFORM of Hive

2013-05-23 Thread Felix . 徐

Hi all,

I am trying to use Transform in Hive, but I do not find a way to change the
separator between fields of input records in Transform.

I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY '\0'

However, while using
SELECT TRANSFORM(id,name) USING 'python script.py'
AS (id,name)
FROM A

I find that the separator of input is TAB instead of '\0' ,
does anyone know how to change it to '\0'? Thanks.

Bugs exist in SEMI JOIN?

2012-11-21 Thread Felix . 徐

Hi,
I am using the version 0.9.0 and my tables are the same with TPC-H
benchmark:

Here is a simple query(works correctly):

*Q1*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN(
  SELECT O_CUSTKEY FROM ORDERS WHERE unix_timestamp(O_ORDERDATE,
'-MM-dd') > unix_timestamp('1995-12-31','-MM-dd')
 ) tempTable ON tempTable.O_CUSTKEY=CUSTOMER.C_CUSTKEY

it means inserting the key of customers who has orders since 1995-12-31
into another table.
But if I write the query like this:

*Q2*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 *AND *unix_timestamp(ORDERS.O_ORDERDATE, '-MM-dd') >
unix_timestamp('1995-12-31','-MM-dd')

I will get exception from Hive:


FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1566)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.pushJoinFilters(SemanticAnalyzer.java:5254)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6754)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7531)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:689)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


Also,If I write the query like this:
*Q3*
INSERT OVERWRITE TABLE customer_orders_statistics
 SELECT C_CUSTKEY FROM CUSTOMER
 LEFT SEMI JOIN ORDERS
 ON CUSTOMER.C_CUSTKEY=ORDERS.O_CUSTKEY
 *WHERE *unix_timestamp(ORDERS.O_ORDERDATE, '-MM-dd') >
unix_timestamp('1995-12-31','-MM-dd')

Then this query can be executed(wondering the right hand of SEMI JOIN can
be referenced in WHERE clause now?), but the result is wrong(comparing to *Q1,
Q1*'s result is the same with mysql).

What is the rule of job name generation in Hive?

2012-03-22 Thread Felix . 徐

Hi,all..I find that the job names of Hive are like this " INSERT OVERWRITE
TABLE u...userID,neighborid(Stage-4) "
What is the rule of generating such a name?

Re: How to get job names and stages of a query?

2012-03-20 Thread Felix . 徐

I actually want to get the job name of stages by api..

在 2012年3月20日 下午2:23，Manish Bhoge 写道：

> **
> Whenever you submit a Sql a job I'd get generated. You can open the job
> tracker localhost:50030/jobtracker.asp
> It shows jobs are running and rest of the other details.
> Thanks,
> Manish
> Sent from my BlackBerry, pls excuse typo
> ------
> *From: * Felix.徐 
> *Date: *Tue, 20 Mar 2012 12:58:53 +0800
> *To: *
> *ReplyTo: * user@hive.apache.org
> *Subject: *How to get job names and stages of a query?
>
> Hi,all
> I want to track the progress of a query, how can I get the job name
> including stages of a query?
>

How to track query status in hive via thrift or anything else?

2012-03-14 Thread Felix . 徐

Hi,all ..
I didn't find any helpful api in ThriftHive that can track the execution
status of hive(or job progress).I want to get execution progress of queries
from hive?How to do that?Thanks!

Re: Is it possible to get the progress of a query through thrift server?

2012-03-12 Thread Felix . 徐

Can you provide some references? Thanks very much!

在 2012年3月12日 下午11:28，Edward Capriolo 写道：

> Yes. You have access to the job counters through thrift, as well as a
> method to test if query is done.
>
> Edward
>
> On Mon, Mar 12, 2012 at 11:12 AM, Felix.徐  wrote:
> > Hi all,
> > I want to build a website to monitor the execution of queries sent to
> hive ,
> > is there any way to realize it?
>

Is it possible to get the progress of a query through thrift server?

2012-03-12 Thread Felix . 徐

Hi all,
I want to build a website to monitor the execution of queries sent to hive
, is there any way to realize it?

Re: Showing wrong count after importing table in Hive

2012-02-08 Thread Felix . 徐

Hi, I meet the same problem once, then I change the amount of imported
 columns it works fine. Sometimes blank rows would be generated by sqoop..I
do not actually know what the problem really is..

2012/2/9 Bhavesh Shah 

>
>
>
>
>Hello All,
>
> I have imported near about 10 tables in Hive from MS SQL Server. But when
> I try to cross check the records in Hive in one of the Table I have found
> more record when I run the query (select count(*) from tblName;).
>
> Then I have drop the that Table and again imported it in Hive. I have
> observed in Console Logs that (Retrieved 203 records). And then I tried
> again for (select count(*) from tblName;) and I got the count as 298.
>
> I dont understand this why this happens. Is anything is wrong in query or
> it happens due to some incorrect command of sqoop-import.
>
> All other table records are fine.
>
> I got stuck here and I had spend much time to search for this. Pls help me
> out from this.
>
>
> --
> Thanks and Regards,
> Bhavesh Shah
>
>

Get arguments' names in Hive's UDF

Possible to specify reducers for each stage?

Re: Performance difference between tuning reducer num and partition table

Re: Performance difference between tuning reducer num and partition table

Performance difference between tuning reducer num and partition table

Re: Overwrite by selected data from table itself?

How to change the separator of INSERT OVERWRITE LOCAL DIRECTORY

Re: How to change the separator of input reocrd in TRANSFORM of Hive

How to change the separator of input reocrd in TRANSFORM of Hive

Bugs exist in SEMI JOIN?

What is the rule of job name generation in Hive?

Re: How to get job names and stages of a query?

How to track query status in hive via thrift or anything else?

Re: Is it possible to get the progress of a query through thrift server?

Is it possible to get the progress of a query through thrift server?

Re: Showing wrong count after importing table in Hive

16 matches

Site Navigation

Mail list logo

Footer information