Re: question on create database

2015-04-02 Thread Chen Song
Got it. Thanks.

On Thu, Apr 2, 2015 at 11:29 AM, Alan Gates  wrote:

> When someone creates a table in your 'abc' database it should by default
> be in '/my/preferred/directory/_tablename_'.  However, users can specify
> locations for their tables which may not be in that directory.  AFAIK
> there's no way to prevent that.
>
> Alan.
>
>   Chen Song 
>  April 2, 2015 at 8:15
> I have a dumb question on DDL statement "create database"
>
> Say if I create a database
> CREATE DATABASE abc
> LOCATION '/my/preferred/directory';
> When later on someone needs to create a table in this database, is there a
> way to force the location of the table to be under
> /my/preferred/directory?
>
> I searched around but could not find a way to enforce this.
>
>
> --
> Chen Song
>
>


-- 
Chen Song


question on create database

2015-04-02 Thread Chen Song
I have a dumb question on DDL statement "create database"

Say if I create a database

CREATE DATABASE abcLOCATION '/my/preferred/directory';

When later on someone needs to create a table in this database, is there a
way to force the location of the table to be under /my/preferred/directory?

I searched around but could not find a way to enforce this.


-- 
Chen Song


Re: question on HIVE-5891

2014-11-18 Thread Chen Song
I haven't found a workaround yet.

On Thu, Nov 13, 2014 at 11:25 AM, Stéphane Verlet  wrote:

> Chen
>
> Did you find a workarround ? Anybody else have a suggestion ?
>
> Thank you
>
> Stephane
>
> On Mon, Aug 4, 2014 at 9:00 AM, Chen Song  wrote:
>
>> I am using cdh5 distribution and It doesn't look like this jira
>>
>> https://issues.apache.org/jira/browse/HIVE-5891
>>
>> is backported into cdh 5.1.0.
>>
>> Is there a workaround to modify the query that is subject to this problem?
>>
>> --
>> Chen Song
>>
>>
>


-- 
Chen Song


Re: mapreduce.job.queuename doesn't work with hiveserver2

2014-10-10 Thread Chen Song
Thanks for the reply. I am using Hue so I may need to find a way to hack
this into Hue.

On Fri, Oct 10, 2014 at 7:39 AM, Navdeep Agrawal <
navdeep_agra...@symantec.com> wrote:

> You can change the mapreduce.job.queuename by appending it to connection
> string after ‘?’ like jdbc:hive2//url:1/db?mapreduce.job.queuename=q1
>
>
>
> *From:* Chen Song [mailto:chen.song...@gmail.com]
> *Sent:* Friday, October 10, 2014 2:38 AM
> *To:* user
> *Subject:* mapreduce.job.queuename doesn't work with hiveserver2
>
>
>
> By setting mapreduce.job.queuename in hive-site.xml, all queries submitted
> via hiveserver or hive cli all go to the specified queue.
>
>
> However, hiveserver2 doesn't appear to respect this value at all. The
> queue is chosen to be the user who submitted query. I tried with beeline
> and jdbc.
>
>
>
> Is there a way I can configure a specific queue for queries going through
> hiveserver2?
>
>
>
> --
> Chen Song
>
>
>



-- 
Chen Song


mapreduce.job.queuename doesn't work with hiveserver2

2014-10-09 Thread Chen Song
By setting mapreduce.job.queuename in hive-site.xml, all queries submitted
via hiveserver or hive cli all go to the specified queue.

However, hiveserver2 doesn't appear to respect this value at all. The queue
is chosen to be the user who submitted query. I tried with beeline and jdbc.

Is there a way I can configure a specific queue for queries going through
hiveserver2?

-- 
Chen Song


Re: hive auto join conversion

2014-08-12 Thread Chen Song
Yeah, I was trying the same thing, though a little big ugly.

My query needs to LJ/J with multiple tables. When there are 1 or 2 LJ/Js,
rewriting works but when there are > 3 tables, the got the same exception
triggered by the following bug.

https://issues.apache.org/jira/browse/HIVE-5891

Chen


On Wed, Jul 30, 2014 at 10:07 PM, Eugene Koifman 
wrote:

> would manually rewriting the query from (T1 union all T2) LOJ S to
> equivalent (T1 LOJ S) union all (T2 LOJ S) help work around this issue?
>
>
> On Wed, Jul 30, 2014 at 6:19 PM, Chen Song  wrote:
>
>> I tried that and I got the following error.
>>
>> FAILED: SemanticException [Error 10227]: Not all clauses are supported
>> with mapjoin hint. Please remove mapjoin hint.
>>
>> I then tried turning off auto join conversion.
>>
>> set hive.auto.convert.join=false
>>
>> But no luck, same error.
>>
>> Looks like it is a known issue,
>>
>>
>> http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdp2.0.0.2-5-2.html
>>
>> Chen
>>
>>
>>
>>
>> On Wed, Jul 30, 2014 at 9:10 PM, Navis류승우  wrote:
>>
>>> Could you do it with hive.ignore.mapjoin.hint=false? Mapjoin hint is
>>> ignored from hive-0.11.0 by default (see
>>> https://issues.apache.org/jira/browse/HIVE-4042)
>>>
>>> Thanks,
>>> Navis
>>>
>>>
>>> 2014-07-31 10:04 GMT+09:00 Chen Song :
>>>
>>> I am using cdh5 with hive 0.12. We have some hive jobs migrated from
>>>> hive 0.10 and they are written like below:
>>>>
>>>> select /*+ MAPJOIN(sup) */ c1, c2, sup.c
>>>> from
>>>> (
>>>> select key, c1, c2 from table1
>>>> union all
>>>> select key, c1, c2 from table2
>>>> ) table
>>>> left outer join
>>>> sup
>>>> on (table.c1 = sup.key)
>>>> distribute by c1
>>>>
>>>> In Hive 0.10 (CDH4), Hive translates the left outer join into a map
>>>> join (map only job), followed by a regular MR job for distribute by.
>>>>
>>>> In Hive 0.12 (CDH5), Hive is not able to convert the join into a map
>>>> join. Instead it launches a common map reduce for the join, followed by
>>>> another mr for distribute by. However, when I take out the union all
>>>> operator, Hive seems to be able to create a single MR job, with map join on
>>>> map phase, and reduce for distribute by.
>>>>
>>>> I read a bit on
>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>>>> and found out that there are some restrictions on map side join
>>>> starting Hive 0.11. The following are not supported.
>>>>
>>>>
>>>>    - Union Followed by a MapJoin
>>>>- Lateral View Followed by a MapJoin
>>>>- Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By)
>>>>Followed by MapJoin
>>>>- MapJoin Followed by Union
>>>>- MapJoin Followed by Join
>>>>- MapJoin Followed by MapJoin
>>>>
>>>>
>>>> So if one side of the table (big side) is a union of some tables and
>>>> the other side is a small table, Hive would not be able to do a map join at
>>>> all? Is that correct?
>>>>
>>>> If correct, what should I do to make the job backward compatible?
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>
>>
>>
>> --
>> Chen Song
>>
>>
>
>
> --
>
> Thanks,
> Eugene
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.




-- 
Chen Song


question on HIVE-5891

2014-08-04 Thread Chen Song
I am using cdh5 distribution and It doesn't look like this jira

https://issues.apache.org/jira/browse/HIVE-5891

is backported into cdh 5.1.0.

Is there a workaround to modify the query that is subject to this problem?

-- 
Chen Song


Re: hive auto join conversion

2014-07-30 Thread Chen Song
I tried that and I got the following error.

FAILED: SemanticException [Error 10227]: Not all clauses are supported with
mapjoin hint. Please remove mapjoin hint.

I then tried turning off auto join conversion.

set hive.auto.convert.join=false

But no luck, same error.

Looks like it is a known issue,

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdp2.0.0.2-5-2.html

Chen




On Wed, Jul 30, 2014 at 9:10 PM, Navis류승우  wrote:

> Could you do it with hive.ignore.mapjoin.hint=false? Mapjoin hint is
> ignored from hive-0.11.0 by default (see
> https://issues.apache.org/jira/browse/HIVE-4042)
>
> Thanks,
> Navis
>
>
> 2014-07-31 10:04 GMT+09:00 Chen Song :
>
> I am using cdh5 with hive 0.12. We have some hive jobs migrated from hive
>> 0.10 and they are written like below:
>>
>> select /*+ MAPJOIN(sup) */ c1, c2, sup.c
>> from
>> (
>> select key, c1, c2 from table1
>> union all
>> select key, c1, c2 from table2
>> ) table
>> left outer join
>> sup
>> on (table.c1 = sup.key)
>> distribute by c1
>>
>> In Hive 0.10 (CDH4), Hive translates the left outer join into a map join
>> (map only job), followed by a regular MR job for distribute by.
>>
>> In Hive 0.12 (CDH5), Hive is not able to convert the join into a map
>> join. Instead it launches a common map reduce for the join, followed by
>> another mr for distribute by. However, when I take out the union all
>> operator, Hive seems to be able to create a single MR job, with map join on
>> map phase, and reduce for distribute by.
>>
>> I read a bit on
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>> and found out that there are some restrictions on map side join starting
>> Hive 0.11. The following are not supported.
>>
>>
>>- Union Followed by a MapJoin
>>- Lateral View Followed by a MapJoin
>>- Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By)
>>Followed by MapJoin
>>- MapJoin Followed by Union
>>- MapJoin Followed by Join
>>- MapJoin Followed by MapJoin
>>
>>
>> So if one side of the table (big side) is a union of some tables and the
>> other side is a small table, Hive would not be able to do a map join at
>> all? Is that correct?
>>
>> If correct, what should I do to make the job backward compatible?
>>
>> --
>> Chen Song
>>
>>
>


-- 
Chen Song


hive auto join conversion

2014-07-30 Thread Chen Song
I am using cdh5 with hive 0.12. We have some hive jobs migrated from hive
0.10 and they are written like below:

select /*+ MAPJOIN(sup) */ c1, c2, sup.c
from
(
select key, c1, c2 from table1
union all
select key, c1, c2 from table2
) table
left outer join
sup
on (table.c1 = sup.key)
distribute by c1

In Hive 0.10 (CDH4), Hive translates the left outer join into a map join
(map only job), followed by a regular MR job for distribute by.

In Hive 0.12 (CDH5), Hive is not able to convert the join into a map join.
Instead it launches a common map reduce for the join, followed by another
mr for distribute by. However, when I take out the union all operator, Hive
seems to be able to create a single MR job, with map join on map phase, and
reduce for distribute by.

I read a bit on
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
and found out that there are some restrictions on map side join starting
Hive 0.11. The following are not supported.


   - Union Followed by a MapJoin
   - Lateral View Followed by a MapJoin
   - Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed
   by MapJoin
   - MapJoin Followed by Union
   - MapJoin Followed by Join
   - MapJoin Followed by MapJoin


So if one side of the table (big side) is a union of some tables and the
other side is a small table, Hive would not be able to do a map join at
all? Is that correct?

If correct, what should I do to make the job backward compatible?

-- 
Chen Song


php thrift client for hiveserver2

2014-05-05 Thread Chen Song
I am using CDH5 with hive 0.12.0.

I dig a bit myself in internet but could not find sample php code to
connect to hiveserver2. I know there are php files generated by thrift but
they seem broken, as shown below.

php -l /usr/lib/hive/lib/php/packages/hive_service/TCLIService.php
PHP Parse error:  syntax error, unexpected ';', expecting T_STRING or
T_NS_SEPARATOR or '{' in
/usr/lib/hive/lib/php/packages/hive_service/TCLIService.php on line 2

Parse error: syntax error, unexpected ';', expecting T_STRING or
T_NS_SEPARATOR or '{' in
/usr/lib/hive/lib/php/packages/hive_service/TCLIService.php on line 2
Errors parsing /usr/lib/hive/lib/php/packages/hive_service/TCLIService.php

Is php thrift no longer supported by hiveserver2?

Chen


Re: the php client seems to be broken

2014-04-23 Thread Chen Song
I am also experiencing this syntactical issue in
/usr/lib/hive/lib/php/packages/hive_service/ThriftHive.php. Anyone knows
how to fix this?

[Wed Apr 23 16:23:23 2014] [error] [client 127.0.0.1] PHP Parse error:
 syntax error, unexpected ';', expecting T_STRING or T_NS_SEPARATOR or '{'
in /usr/lib/hive/lib/php/packages/hive_service/ThriftHive.php on line 2

The first few lines in php file reads as

 wrote:

>  I have been researching for a week now on how to use the PHP hive client.
> https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-PHP
> but the $GLOBALS['THRIFT_ROOT'] = '/lib/php/'; is broken. one thing is
> the autoloader is no generated properly. second
> packages/hive_service/ThriftHive.php' has namespace; at the top so a
> namesapce is not even defined.
> Third if I just start indlucing files instead of using the autoloader I
> get error " Fatal error: Class 'Thrift\Exception\TException' not found in
> /lib/php/packages/hive_metastore/metastore/Types.php on line 4891" even
> though the file is included.
>
> Any one know how to resolve these issues?
>
>
>
> On 03/17/2014 11:45 AM, Jeremy wrote:
>
> I am trying to use the php hive thift client and there is name space and
> autoload issues.
> Autoload is not generated and I have not found any instructions how to
> properly generate it.
> Also some files contain namespace; which is a syntax issue.
> Is any one aware of these issues and know how to fix them?
>
>
>
>
>


-- 
Chen Song


Re: single MR stage for join and group by

2013-08-02 Thread Chen Song
We are currently using cloudera's CDH4.2.0 based on Hive 0.10.0. So neither
optimizations were incorporated.

Thank you all for the valuable feedback and this is really helpful. I will
look further into the details of the JIRA.







On Fri, Aug 2, 2013 at 12:14 AM, Yin Huai  wrote:

> If the join is a reduce side join,
> https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query
> and generate a single MR job. The optimizer introduced by HIVE-2206 is in
> trunk. Currently, it only handles the same column(s).
>
> If the join is a MapJoin, hive 0.11 can generate a single MR job (In this
> case, if join and group by use the same column(s) does not matter). To
> enable it, you need to ...
> set hive.auto.convert.join=true
> set hive.auto.convert.join.noconditionaltask=true;
> set hive.optimize.mapjoin.mapreduce=true;
> and also make sure hive.auto.convert.join.noconditionaltask.size is larger
> than the size of the small table.
> For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the
> flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will
> not need to set hive.optimize.mapjoin.mapreduce.
>
> Thanks,
>
> Yin
>
>
> On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague wrote:
>
>> and what version of hive are you running your test on?  i do believe -
>> not certain - that hive 0.11 includes the optimization you seek.
>>
>>
>> On Thu, Aug 1, 2013 at 10:19 AM, Chen Song wrote:
>>
>>> Suppose we have 2 simple tables
>>>
>>> A
>>> id int
>>> value string
>>>
>>> B
>>> id
>>>
>>> When hive translates the following query
>>>
>>> select max(A.value), A.id from A join B on A.id = A.id group by A.id;
>>>
>>> It launches 2 stages, one for the join and one for the group by.
>>>
>>> My understanding is that if the join key set is a sub set of the group
>>> by key set, it can be achieved in the same map reduce job. If that is
>>> correct in theory, could it be a feature in hive?
>>>
>>> Chen
>>>
>>>
>>
>


-- 
Chen Song


single MR stage for join and group by

2013-08-01 Thread Chen Song
Suppose we have 2 simple tables

A
id int
value string

B
id

When hive translates the following query

select max(A.value), A.id from A join B on A.id = A.id group by A.id;

It launches 2 stages, one for the join and one for the group by.

My understanding is that if the join key set is a sub set of the group by
key set, it can be achieved in the same map reduce job. If that is correct
in theory, could it be a feature in hive?

Chen


Re: Hive producing difference outputs

2012-12-19 Thread Chen Song
resent as no one has replied yet :)

On Thu, Nov 15, 2012 at 11:18 AM, Chen Song  wrote:

> Hi Folks
>
> We are getting inconsistent output when running some semantically same
> Hive queries. We are using *CDH3u3* with *Hive 0.7.1.*
>
> The query I am running performs a map-side join of two subqueries (s1 and
> s2)--s1 is a multi-table join and s2 is a union of a table against itself.
> If we add in mapside join to s1 (which is necessary for performance) the
> query produces different (and incorrect) results.
>
> Basically, s1 returns one row and s2 returns 2 rows. By joining them, it
> should give us two rows. See below for output and queries (I abridged the
> queries a bit for readability).
>
> NOTE: I searched online and found anther thread reporting a similar issue
> as what I have seen,
> http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/%3CCADejTpz5rSf-Sxi9HNoCTRHVazOk=F+bu6XPL=k7tbknqj9...@mail.gmail.com%3E.
> Unfortunately, this thread never got answered. It looks like a bug in Hive.
> Greatly appreciate if anyone give thoughts on this issue.
>
>
> Q1 output (incorrect, missing one row):
> 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
> 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 1 0 NULL
>
> Q2 output (correct):
> 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
> 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 0
> 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094
> 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 1
>
> Q1 and Q2 only differs in the way that s1 in Q1 has a mapside join.
>
>- Q1 (incorrect results, highlighted portion is the only difference
>from Q2):
>
>
>  SELECT /*+MAPJOIN(table) */ *
> FROM
> (
>   SELECT /*+ MAPJOIN(tableA,tableB) */
>id,
>...
>   FROM main_table ctg
>   JOIN sample smp
> ON ctg.id = smp.id
>   LEFT OUTER JOIN tableA
> ON smp.publisher_id = tableA.id
>   LEFT OUTER JOIN tableB
> ON smp.inventory_source_id = tableB.id
>   WHERE ctg.date_time >= '2012-08-15 00:00:00' and
> ctg.date_time <= '2012-08-15 23:59:59' and
> ctg.dd = '2012-08-15' and
> smp.date_time >= '2012-08-15 00:00:00' and
> smp.date_time <= '2012-08-15 23:59:59' and
> smp.dd = '2012-08-15'
> )sub
> JOIN
>  (
>   select
> id,
> parent_category_id,
> cast(0 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
>   union all
>   select
> id,
> parent_category_id,
> cast(1 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
> )table
>  ON table.id = sub.id
>
>
>- Q2 (correct results):
>
>  SELECT /*+MAPJOIN(table) */ *
> FROM
> (
>   SELECT
>id,
>...
>   FROM main_table ctg
>   JOIN sample smp
> ON ctg.id = smp.id
>   LEFT OUTER JOIN tableA
> ON smp.publisher_id = tableA.id
>   LEFT OUTER JOIN tableB
> ON smp.inventory_source_id = tableB.id
>   WHERE ctg.date_time >= '2012-08-15 00:00:00' and
> ctg.date_time <= '2012-08-15 23:59:59' and
> ctg.dd = '2012-08-15' and
> smp.date_time >= '2012-08-15 00:00:00' and
> smp.date_time <= '2012-08-15 23:59:59' and
> smp.dd = '2012-08-15'
> )sub
> JOIN
>  (
>   select
> id,
> parent_category_id,
> cast(0 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
>   union all
>   select
> id,
> parent_category_id,
> cast(1 as tinyint) as is_parent
>   from table_category
>   where is_system = 1
> )table
>  ON table.id = sub.id
>
> --
> Chen Song
>
>
>


-- 
Chen Song


Re: map side join with group by

2012-12-13 Thread Chen Song
Thanks Nitin. This is all I want to clarify :)

Chen

On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar wrote:

> to improve the speed of the job they created map only joins so that all
> the records associated with a key fall to a map .. reducers slows it down.
> If the reducer has to do some more job then they launch another job.
>
> bear in mind, when we say map only join we are absolutely sure that speed
> will increase in case data in one of the tables is in the few hundred MB
> ranges. If this has to do with reduce in hand, the processing logic
> completely changes and it also slows down.
>
> Launching a new job for group by is a neat way to measure how much time
> you spent on just join and another on group by so you can easily see two
> different things.
>
> There is no way you can ask a mapjoin to launch a reducer as it is not
> supposed to do.
>
> If you have such case (may be if you think that it will improve
> performance), please feel free to raise a jira and get it reviewed. if its
> valid I think people will provide more ideas
>
>
> On Fri, Dec 14, 2012 at 12:42 AM, Chen Song wrote:
>
>> Nitin
>>
>> Yeah. My original question is that is there a way to force Hive (or
>> rather to say, is it possible) to execute map side join at mapper phase and
>> group by in reduce phase. So instead of launching a map only job (join) and
>> map reduce job (group by), doing it altogether in a single MR job. This is
>> obviously not what Hive does but I am wondering if it is a nice feature to
>> have.
>>
>> The point you made (different keys in join and group by) only matters
>> when it is the time in reduce phase, right? As map side join takes care of
>> join at mapper phase, it sounds to me natural that group by can be done in
>> the reduce phase in the same job. The only hassle that I can think of is
>> that map output have to be resorted (based on group by keys).
>>
>> Chen
>>
>> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar wrote:
>>
>>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>>>
>>>
>>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote:
>>>
>>>> Understood that fact that it is impossible in the same MR job if both
>>>> join and group by are gonna happen in the reduce phase (because the join
>>>> keys and group by keys are different). But for map side join, the joins
>>>> would be complete by the end of the map phase, and outputs should be ready
>>>> to be distributed to reducers based on group by keys.
>>>>
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar 
>>>> wrote:
>>>>
>>>>> Thats because for the first job the join keys are different and second
>>>>> job group by keys are different, you just cant assume join keys and group
>>>>> keys will be same so they are two different jobs
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote:
>>>>>
>>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>>> point is that when a query has a map join and group by, even in its
>>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>>> map join and group by cannot be accomplished in one MR job.
>>>>>>
>>>>>> Best,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <
>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>
>>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>>> understood it correctly
>>>>>>>
>>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>>> after that to execute the group by part it launches the second job.
>>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>>> queries
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>>> grover.markgro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Chen,
>>>>>>>> I think we would need some more information.
>>>>>>>>
>>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but

Re: map side join with group by

2012-12-13 Thread Chen Song
Nitin

Yeah. My original question is that is there a way to force Hive (or rather
to say, is it possible) to execute map side join at mapper phase and group
by in reduce phase. So instead of launching a map only job (join) and map
reduce job (group by), doing it altogether in a single MR job. This is
obviously not what Hive does but I am wondering if it is a nice feature to
have.

The point you made (different keys in join and group by) only matters when
it is the time in reduce phase, right? As map side join takes care of join
at mapper phase, it sounds to me natural that group by can be done in the
reduce phase in the same job. The only hassle that I can think of is that
map output have to be resorted (based on group by keys).

Chen

On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar wrote:

> chen in mapside join .. there are no reducers .. its MAP ONLY job
>
>
> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote:
>
>> Understood that fact that it is impossible in the same MR job if both
>> join and group by are gonna happen in the reduce phase (because the join
>> keys and group by keys are different). But for map side join, the joins
>> would be complete by the end of the map phase, and outputs should be ready
>> to be distributed to reducers based on group by keys.
>>
>> Chen
>>
>>
>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar wrote:
>>
>>> Thats because for the first job the join keys are different and second
>>> job group by keys are different, you just cant assume join keys and group
>>> keys will be same so they are two different jobs
>>>
>>>
>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote:
>>>
>>>> Yeah, my abridged version of query might be a little broken but my
>>>> point is that when a query has a map join and group by, even in its
>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>> map join and group by cannot be accomplished in one MR job.
>>>>
>>>> Best,
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar 
>>>> wrote:
>>>>
>>>>> I think Chen wanted to know why this is two phased query if I
>>>>> understood it correctly
>>>>>
>>>>> When you run a mapside join .. it just performs the join query ..
>>>>> after that to execute the group by part it launches the second job.
>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>> queries
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>> grover.markgro...@gmail.com> wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>> I think we would need some more information.
>>>>>>
>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>> sense when the right table is the one being "mapped" (in other words,
>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>>> explanation.
>>>>>>
>>>>>> In your query, the where clause on a column called "hour", at this
>>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>>> column on table1, that predicate would get pushed up (if you have
>>>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>>>> to check the explain plan). If however, the where clause is on a
>>>>>> column in the right table (table2 in your example), it can't be pushed
>>>>>> up since a column of the right table can have different values before
>>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>>>> to be applied in a separate MR job.
>>>>>>
>>>>>> This is just my understanding, the full proof answer would lie in
>>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>>
>>>>>> And for completeness, there is a conditional task (starting Hive 0

Re: map side join with group by

2012-12-13 Thread Chen Song
Understood that fact that it is impossible in the same MR job if both join
and group by are gonna happen in the reduce phase (because the join keys
and group by keys are different). But for map side join, the joins would be
complete by the end of the map phase, and outputs should be ready to be
distributed to reducers based on group by keys.

Chen

On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar wrote:

> Thats because for the first job the join keys are different and second job
> group by keys are different, you just cant assume join keys and group keys
> will be same so they are two different jobs
>
>
> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song  wrote:
>
>> Yeah, my abridged version of query might be a little broken but my point
>> is that when a query has a map join and group by, even in its simplified
>> incarnation, it will launch two jobs. I was just wondering why map join and
>> group by cannot be accomplished in one MR job.
>>
>> Best,
>> Chen
>>
>>
>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar wrote:
>>
>>> I think Chen wanted to know why this is two phased query if I understood
>>> it correctly
>>>
>>> When you run a mapside join .. it just performs the join query .. after
>>> that to execute the group by part it launches the second job.
>>> I may be wrong but this is how I saw it whenever I executed group by
>>> queries
>>>
>>>
>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>> grover.markgro...@gmail.com> wrote:
>>>
>>>> Hi Chen,
>>>> I think we would need some more information.
>>>>
>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>> there is not such table in the query. Moreover, Map joins only make
>>>> sense when the right table is the one being "mapped" (in other words,
>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>> explanation.
>>>>
>>>> In your query, the where clause on a column called "hour", at this
>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>> column on table1, that predicate would get pushed up (if you have
>>>> hive.optimize.ppd property set to true), so it could possibly be done
>>>> in 1 MR job (I am not sure if that's presently the case, you will have
>>>> to check the explain plan). If however, the where clause is on a
>>>> column in the right table (table2 in your example), it can't be pushed
>>>> up since a column of the right table can have different values before
>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>>>> to be applied in a separate MR job.
>>>>
>>>> This is just my understanding, the full proof answer would lie in
>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>
>>>> And for completeness, there is a conditional task (starting Hive 0.7)
>>>> that will convert your joins automatically to map joins where
>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>> property.
>>>>
>>>> Mark
>>>>
>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song 
>>>> wrote:
>>>> > I have a silly question on how Hive interpretes a simple query with
>>>> both map
>>>> > side join and group by.
>>>> >
>>>> > Below query will translate into two jobs, with the 1st one as a map
>>>> only job
>>>> > doing the join and storing the output in a intermediary location, and
>>>> the
>>>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>>>> and
>>>> > doing the group by.
>>>> >
>>>> > SELECT
>>>> > /*+ MAPJOIN(d) */
>>>> > table.a, sum(table2.b)
>>>> > from table
>>>> > LEFT OUTER JOIN table2
>>>> > ON table.id = table2.id
>>>> > where hour = '2012-12-11 11'
>>>> > group by table.a
>>>> >
>>>> > Why can't this be done within a single map reduce job? As what I can
>>>> see
>>>> > from the query plan is that all 2nd job mapper do is taking the 1st
>>>> job's
>>>> > mapper output.
>>>> >
>>>> > --
>>>> > Chen Song
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song


Re: map side join with group by

2012-12-13 Thread Chen Song
Yeah, my abridged version of query might be a little broken but my point is
that when a query has a map join and group by, even in its simplified
incarnation, it will launch two jobs. I was just wondering why map join and
group by cannot be accomplished in one MR job.

Best,
Chen

On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar wrote:

> I think Chen wanted to know why this is two phased query if I understood
> it correctly
>
> When you run a mapside join .. it just performs the join query .. after
> that to execute the group by part it launches the second job.
> I may be wrong but this is how I saw it whenever I executed group by
> queries
>
>
> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover 
> wrote:
>
>> Hi Chen,
>> I think we would need some more information.
>>
>> The query is referring to a table called "d" in the MAPJOIN hint but
>> there is not such table in the query. Moreover, Map joins only make
>> sense when the right table is the one being "mapped" (in other words,
>> being kept in memory) in case of a Left Outer Join, similarly if the
>> left table is the one being "mapped" in case of a Right Outer Join.
>> Let me know if this is not clear, I'd be happy to offer a better
>> explanation.
>>
>> In your query, the where clause on a column called "hour", at this
>> point I am unsure if that's a column of table1 or table2. If it's
>> column on table1, that predicate would get pushed up (if you have
>> hive.optimize.ppd property set to true), so it could possibly be done
>> in 1 MR job (I am not sure if that's presently the case, you will have
>> to check the explain plan). If however, the where clause is on a
>> column in the right table (table2 in your example), it can't be pushed
>> up since a column of the right table can have different values before
>> and after the LEFT OUTER JOIN. Therefore, the where clause would need
>> to be applied in a separate MR job.
>>
>> This is just my understanding, the full proof answer would lie in
>> checking out the explain plans and the Semantic Analyzer code.
>>
>> And for completeness, there is a conditional task (starting Hive 0.7)
>> that will convert your joins automatically to map joins where
>> applicable. This can be enabled by enabling hive.auto.convert.join
>> property.
>>
>> Mark
>>
>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song 
>> wrote:
>> > I have a silly question on how Hive interpretes a simple query with
>> both map
>> > side join and group by.
>> >
>> > Below query will translate into two jobs, with the 1st one as a map
>> only job
>> > doing the join and storing the output in a intermediary location, and
>> the
>> > 2nd one as a map-reduce job taking the output of the 1st job as input
>> and
>> > doing the group by.
>> >
>> > SELECT
>> > /*+ MAPJOIN(d) */
>> > table.a, sum(table2.b)
>> > from table
>> > LEFT OUTER JOIN table2
>> > ON table.id = table2.id
>> > where hour = '2012-12-11 11'
>> > group by table.a
>> >
>> > Why can't this be done within a single map reduce job? As what I can see
>> > from the query plan is that all 2nd job mapper do is taking the 1st
>> job's
>> > mapper output.
>> >
>> > --
>> > Chen Song
>> >
>> >
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song


map side join with group by

2012-12-12 Thread Chen Song
I have a silly question on how Hive interpretes a simple query with both
map side join and group by.

Below query will translate into two jobs, with the 1st one as a map only
job doing the join and storing the output in a intermediary location, and
the 2nd one as a map-reduce job taking the output of the 1st job as input
and doing the group by.

SELECT
/*+ MAPJOIN(d) */
table.a, sum(table2.b)
from table
LEFT OUTER JOIN table2
ON table.id = table2.id
where hour = '2012-12-11 11'
group by table.a

Why can't this be done within a single map reduce job? As what I can see
from the query plan is that all 2nd job mapper do is taking the 1st job's
mapper output.

-- 
Chen Song


Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

2012-10-29 Thread Chen Song
Is there anything interesting on hive.log?

If you are running Hive shell with root user, by default, the hive.log
should be in /tmp/hive.log.

BTW, are you using MySQL as your metastore?

On Mon, Oct 29, 2012 at 12:22 PM, Robin Verlangen  wrote:

> I have to add, while trying to do other operations on hive by hand, it
> also fails.
>
> For example:
> *hive> drop table my_other_table;*
> *FAILED: Error in metadata: java.lang.reflect.UndeclaredThrowableException
> *
> *FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask*
>
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> <http://goo.gl/Lt7BC>
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/10/29 Robin Verlangen 
>
>> Hi Chen,
>>
>> The user that ran this job is root and all hdfs folders are also owned by
>> root.
>>
>> Best regards,
>>
>> Robin Verlangen
>> *Software engineer*
>> *
>> *
>> W http://www.robinverlangen.nl
>> E ro...@us2.nl
>>
>> <http://goo.gl/Lt7BC>
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>
>>
>> 2012/10/29 Chen Song 
>>
>>> Looks to me the permission issue.
>>>
>>> Can you check if the user (which ran the hive query) has write
>>> permission on */user/hive/warehouse/mydatabase.db/mytable?*
>>>
>>> On Mon, Oct 29, 2012 at 8:38 AM, Robin Verlangen  wrote:
>>>
>>>> Hi there,
>>>>
>>>> Since today our Hive jobs suddenly fail (nothing changed actually). The
>>>> end looks like this:
>>>>
>>>> *MapReduce Total cumulative CPU time: 38 minutes 31 seconds 260 msec*
>>>> *Ended Job = job_201210291304_0015*
>>>> *Loading data to table mydatabase.mytable*
>>>> *rmr: DEPRECATED: Please use 'rm -r' instead.*
>>>> *Deleted /user/hive/warehouse/mydatabase.db/mytable*
>>>> *Failed with exception null*
>>>> *FAILED: Execution Error, return code 1 from
>>>> org.apache.hadoop.hive.ql.exec.MoveTask*
>>>> *MapReduce Jobs Launched:*
>>>> *Job 0: Map: 196  Reduce: 3   Cumulative CPU: 2311.26 sec   HDFS Read:
>>>> 0 HDFS Write: 0 SUCCESS*
>>>> *Total MapReduce CPU Time Spent: 38 minutes 31 seconds 260 msec*
>>>>
>>>> Does anyone have a clue how to resolve this?
>>>>
>>>> Best regards,
>>>>
>>>> Robin Verlangen
>>>> *Software engineer*
>>>> *
>>>> *
>>>> W http://www.robinverlangen.nl
>>>> E ro...@us2.nl
>>>>
>>>> <http://goo.gl/Lt7BC>
>>>>
>>>> Disclaimer: The information contained in this message and attachments
>>>> is intended solely for the attention and use of the named addressee and may
>>>> be confidential. If you are not the intended recipient, you are reminded
>>>> that the information remains the property of the sender. You must not use,
>>>> disclose, distribute, copy, print or rely on this e-mail. If you have
>>>> received this message in error, please contact the sender immediately and
>>>> irrevocably delete this message and any copies.
>>>>
>>>>
>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>>
>>
>


-- 
Chen Song


Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

2012-10-29 Thread Chen Song
Looks to me the permission issue.

Can you check if the user (which ran the hive query) has write permission
on */user/hive/warehouse/mydatabase.db/mytable?*

On Mon, Oct 29, 2012 at 8:38 AM, Robin Verlangen  wrote:

> Hi there,
>
> Since today our Hive jobs suddenly fail (nothing changed actually). The
> end looks like this:
>
> *MapReduce Total cumulative CPU time: 38 minutes 31 seconds 260 msec*
> *Ended Job = job_201210291304_0015*
> *Loading data to table mydatabase.mytable*
> *rmr: DEPRECATED: Please use 'rm -r' instead.*
> *Deleted /user/hive/warehouse/mydatabase.db/mytable*
> *Failed with exception null*
> *FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.MoveTask*
> *MapReduce Jobs Launched:*
> *Job 0: Map: 196  Reduce: 3   Cumulative CPU: 2311.26 sec   HDFS Read: 0
> HDFS Write: 0 SUCCESS*
> *Total MapReduce CPU Time Spent: 38 minutes 31 seconds 260 msec*
>
> Does anyone have a clue how to resolve this?
>
> Best regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> <http://goo.gl/Lt7BC>
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>


-- 
Chen Song


Re: general question on hive client log

2012-10-25 Thread Chen Song
Thanks Nitin.

I also found this wiki. But it only documented the hive client log. What
about other log files created during a query?

Chen

On Thu, Oct 25, 2012 at 7:30 AM, Nitin Pawar wrote:

> by default hive reads the logging related configs from
> hive-log4j.default in conf directory.
> if you want the cwiki page, here it is
>
> https://cwiki.apache.org/Hive/languagemanual-cli.html#LanguageManualCli-Logging
>
>
>
>
> On Thu, Oct 25, 2012 at 1:15 AM, Chen Song  wrote:
> > I have searched online but could not find a comprehensive introduction on
> > how Hive organize logs and naming convention. By that, I mean hive client
> > specifically, not hadoop logs.
> >
> > I have seen various logs as below.
> >
> > 1. Hive shell log
> > /tmp/{user.name}/hive.log
> >
> > 2. Hive history file for a job
> > /tmp/{user.name}/hive_job_log_{job id}.txt
> >
> > 3. Hive execution log
> > /tmp/{user.name}/hive_{some random unique number}.log
> >
> > 4. Other logs
> > /tmp/{user.name}/hive_{some random unique number}.pipeout but always
> empty.
> >
> > Can anyone point me to some page where this is documented in details?
> Thanks
> > very much.
> >
> > --
> > Chen Song
> >
> >
>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song


Re: Error in semantic analysis: Unable to fetch table

2012-10-19 Thread Chen Song
Hi Keith

Can you get anything when you run "show databases" from php?

Chen

On Fri, Oct 19, 2012 at 2:48 AM, Keith Wiley  wrote:

> I never needed the -S.  I would just call it with -e, but yeah, that's the
> basic idea.  The output is gathered in the remaining optional arguments to
> exec().
>
> On Oct 18, 2012, at 22:15 , MiaoMiao wrote:
>
> > What php, you are using exec('hive -S -e "select * from some_table"')
> > or something like that in php?
> >
> > On Fri, Oct 19, 2012 at 4:46 AM, Keith Wiley 
> wrote:
> >> If I run a select command on the command line with -e it works.  If I
> run the same command inside hive (without -e) it still works...but if I
> exec it with a -e from php (so obviously the apache user), I get "Error in
> semantic analysis: Unable to fetch table [tablename]".  Bear in mind that
> this overall system worked perfectly for weeks.  I haven't modified the
> code or anything.  What did change was I had to reinstall hadoop and hive
> and I'm sure some configuration parameter is not correct now.  I found some
> Google references to this error but no clear-but solutions.
> >>
> >> What does this error mean and how do I fix it?
> >>
> >> Thanks.
>
>
>
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>--  Yoda
>
> 
>
>


-- 
Chen Song


Re: can i define an udf which can process more than one argument?

2012-10-19 Thread Chen Song
UDF can process multiple arguments. You just want to define the function
signature in that way.

For example,

evaluate(String s1, String s2, boolean t)


On Thu, Oct 18, 2012 at 11:36 PM, Chris Gong  wrote:

> **
> i find that UDF can only process one argument like
>public long evaluate(String ip) {
>
>   }
>
> can i define a udf like  COALESCE(T v1, T v2, …) or  if(boolean
> testCondition, T valueTrue, T valueFalseOrNull)?
>
> ------
> Chris Gong
>



-- 
Chen Song


Re: ERROR: Hive subquery showing

2012-09-27 Thread Chen Song
Sorry that I misunderstood the question. I think Phil's query will do the
trick.

On Thu, Sep 27, 2012 at 4:46 PM, Philip Tromans
wrote:

> How about:
> select name from ABC order by grp desc limit 1?
>
> Phil.
> On Sep 27, 2012 9:02 PM, "yogesh dhari"  wrote:
>
>>  Hi Bejoy,
>>
>> I tried this one also but here it throws horrible error:
>>
>> i.e:
>>
>> hive: select name from ABD where grp=MAX(grp);
>>
>> FAILED: Hive Internal Error: java.lang.NullPointerException(null)
>> java.lang.NullPointerException
>> at
>> org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:214)
>> at
>> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:684)
>> at
>> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:805)
>> at
>> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
>> at
>> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
>> at
>> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
>> at
>> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
>> at
>> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:161)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7506)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7464)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1513)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1494)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5886)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6524)
>> at
>> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7282)
>> at
>> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243)
>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430)
>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337)
>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
>> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
>> at
>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:601)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>> Regards
>> Yogesh Kumar
>>
>> --
>> Subject: Re: ERROR: Hive subquery showing
>> To: user@hive.apache.org
>> From: bejoy...@yahoo.com
>> Date: Thu, 27 Sep 2012 19:48:25 +
>>
>> Hi yogesh
>>
>> What about a query like this
>> select name from ABC WHERE grp=MAX(grp);
>>
>> Regards
>> Bejoy KS
>>
>> Sent from handheld, please excuse typos.
>> --
>> *From: * Chen Song 
>> *Date: *Thu, 27 Sep 2012 15:33:11 -0400
>> *To: *
>> *ReplyTo: * user@hive.apache.org
>> *Subject: *Re: ERROR: Hive subquery showing
>>
>> Can you try this?
>> *
>> *
>> *select name from ( select MAX(grp) as name from ABC ) gy ;*
>>
>> On Thu, Sep 27, 2012 at 3:29 PM, yogesh dhari wrote:
>>
>>  Hi all,
>>
>> I have a table called ABC, like
>>
>> namegrp
>> A 1
>> B 2
>> C 4
>> D 8
>>
>> I want the output like the name having greatest grp i.e D;
>>
>> I wrote a query:
>> *
>> select name from ( select MAX(grp) from ABC ) gy ;*
>>
>> but it gives error
>>
>> FAILED: Error in semantic analysis: Line 1:7 Invalid table alias or
>> column reference 'name': (possible column names are: _col0)
>>
>> Please help and suggest why it is so, and what would be the query;
>>
>>
>> Thanks & regards
>> Yogesh Kumar
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>>


-- 
Chen Song


Re: ERROR: Hive subquery showing

2012-09-27 Thread Chen Song
Can you try this?
*
*
*select name from ( select MAX(grp) as name from ABC ) gy ;*

On Thu, Sep 27, 2012 at 3:29 PM, yogesh dhari  wrote:

>  Hi all,
>
> I have a table called ABC, like
>
> namegrp
> A 1
> B 2
> C 4
> D 8
>
> I want the output like the name having greatest grp i.e D;
>
> I wrote a query:
> *
> select name from ( select MAX(grp) from ABC ) gy ;*
>
> but it gives error
>
> FAILED: Error in semantic analysis: Line 1:7 Invalid table alias or column
> reference 'name': (possible column names are: _col0)
>
> Please help and suggest why it is so, and what would be the query;
>
>
> Thanks & regards
> Yogesh Kumar
>
>
>
>
>
>


-- 
Chen Song


Re: Re: size of RCFile in hive

2012-09-27 Thread Chen Song
You can force reduce phase by adding distribute by or order by clause after
your select query.

On Thu, Sep 27, 2012 at 2:03 PM, 王锋  wrote:

> but it's map only job
>
>
> At 2012-09-27 05:39:39,"Chen Song"  wrote:
>
> As far as I know, the number of files emitted would be determined by the
> number of mappers for a map only job and the number of reducers for a map
> reduce job.
>
> So it totally depends how your query translates into a MR job.
>
> You can enforce it by setting the property
>
> *mapred.reduce.tasks=1*
>
> Chen
>
> On Wed, Sep 19, 2012 at 11:25 PM, 王锋  wrote:
>
>> Hi
>>I tried to convert and merge many small text files using RCFiles using
>> hivesql,but hive  produced some small rcfiles.
>> set hive.exec.compress.output=true;
>> set mapred.output.compress=true;
>> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
>> set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec;
>> hive.merge.mapfiles=true
>> hive.merge.mapredfiles=true
>> hive.merge.size.per.task=64000
>> hive.merge.size.smallfiles.avgsize=8000
>> insert  overwrite table rctable select .
>>
>>
>>   the settings:
>> hive.merge.mapfiles=true
>> hive.merge.mapredfiles=true
>> hive.merge.size.per.task=64000
>> hive.merge.size.smallfiles.avgsize=8000
>> didn't work.
>>
>>
>> who could tell me how to solve it?
>
>
>
>
> --
> Chen Song
>
>
>
>
>


-- 
Chen Song


Re: size of RCFile in hive

2012-09-26 Thread Chen Song
As far as I know, the number of files emitted would be determined by the
number of mappers for a map only job and the number of reducers for a map
reduce job.

So it totally depends how your query translates into a MR job.

You can enforce it by setting the property

*mapred.reduce.tasks=1*

Chen

On Wed, Sep 19, 2012 at 11:25 PM, 王锋  wrote:

> Hi
>I tried to convert and merge many small text files using RCFiles using
> hivesql,but hive  produced some small rcfiles.
> set hive.exec.compress.output=true;
> set mapred.output.compress=true;
> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec;
> hive.merge.mapfiles=true
> hive.merge.mapredfiles=true
> hive.merge.size.per.task=64000
> hive.merge.size.smallfiles.avgsize=8000
> insert  overwrite table rctable select .
>
>
>   the settings:
> hive.merge.mapfiles=true
> hive.merge.mapredfiles=true
> hive.merge.size.per.task=64000
> hive.merge.size.smallfiles.avgsize=8000
> didn't work.
>
>
> who could tell me how to solve it?




-- 
Chen Song


Re: How can I get the constant value from the ObjectInspector in the UDF

2012-09-26 Thread Chen Song
pecified here.
> 5) I looks around the javadoc of ObjectInspector class. I know at run time
> the real class of the 2nd parameter is WritableIntObjectInspector. I can
> get the type, but how I can get the real value of it?
> 6) This is kind of ConstantsObjectInspector, should be able to give the
> value to me, as it already knows the type is int. What how?
> 7) I don't want to try to get the value at the evaluate stage. Can I get
> this value at the initialize stage?
>
> Thanks
>
> Yong
>
>
>
>
> --
> Chen Song
>
>
>


-- 
Chen Song


Re: How can I get the constant value from the ObjectInspector in the UDF

2012-09-26 Thread Chen Song
Hi Yong

The way GenericUDF works is as follows.

*ObjectInspector initialize(ObjectInspector[] arguments) *is called only
once for one GenericUDF instance used in your Hive query. This phase is for
preparation steps of UDF, such as syntax check and type inference.

*Object evaluate(DeferredObject[] arguments)* is called to evaluate against
actual arguments. This should be where the actual calculation happens and
where you can get the real values you talked about.

Thanks,
Chen

On Wed, Sep 26, 2012 at 4:17 AM, java8964 java8964 wrote:

>  Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version.
>
> I am trying to write a hive UDF function as to calculate the moving sum.
> Right now, I am having trouble to get the constrant value passed in in the
> initialization stage.
>
> For example, let's assume the function is like the following format:
>
> msum(salary, 10) - salary is a int type column
>
> which means the end user wants to calculate the last 10 rows of salary.
>
> I kind of know how to implement this UDF. But I have one problem right now.
>
> 1) This is not a UDAF, as each row will return one data back as the moving
> sum.
> 2) I create an UDF class extends from the GenericUDF.
> 3) I can get the column type from the ObjectInspector[] passed to me in
> the initialize() method to verify that 'salary' and 10 both needs to be
> numeric type (later one needs to be integer)
> 4) But I also want to get the real value of 10, in this case, in the
> initialize() stage, so I can create the corresponding data structure based
> on the value end user specified here.
> 5) I looks around the javadoc of ObjectInspector class. I know at run time
> the real class of the 2nd parameter is WritableIntObjectInspector. I can
> get the type, but how I can get the real value of it?
> 6) This is kind of ConstantsObjectInspector, should be able to give the
> value to me, as it already knows the type is int. What how?
> 7) I don't want to try to get the value at the evaluate stage. Can I get
> this value at the initialize stage?
>
> Thanks
>
> Yong
>



-- 
Chen Song


Question on escaped characters in Hive Shell

2012-03-01 Thread Chen Song
Hi All

I have a question on quoted hive query when executed with 'hive -e'. The 
command I ran looks like:

> hive -e "select regexp_extract(col1, '\\d+') from A where col2='some value' 
> limit 5"

When the query get passed into hive, it is interpreted as

> select regexp_extract(col1, \d+') from A where col2='some value' limit 5

As you can see, the regex patten is translated from '\\d+' to '\d+' due to the 
escape character.

By using single quote at the outer level and double quotes within query itself, 
I can get around this problem. It also works by putting the query in a file and 
running with 'hive -f'. But both approaches needs amendments to the application 
and sounds a little tactical.

My questions are

1) Does hive handle escaping or the linux shell interprete it before passing to 
hive. 
2) Other than the 2 workarounds I mentioned, is there any other option?

Thank you for your feedback in advance.

- Chen

Re: question on hive multiple insert

2012-01-20 Thread Chen Song
Please disregard my previous email. I figured out that the correct syntax 
should be

FROM a JOIN b on a.col1 = b.col1
INSERT a.col1, a.col2
INSERT ...

Thanks
Chen



 From: Chen Song 
To: hive user list  
Sent: Friday, January 20, 2012 10:47 AM
Subject: question on hive multiple insert
 

Hi

I am reading on Hive's multiple insert syntax manual and wondering if it is 
possible to utilize join in any individual insert. e.g., 

FROM a
INSERT a.col1, a.col2 join b on a.col1 = b.col1
INSERT ...

Apparently, Hive doesn't like this query and give syntax error.

In other words, my question in general is, for multiple insert, can we specify 
multiple tables as data sources (of course, can benefit from one pass scan on 
only one table), or only one table from which the data can be pulled from?

Best,
Chen

question on hive multiple insert

2012-01-20 Thread Chen Song
Hi

I am reading on Hive's multiple insert syntax manual and wondering if it is 
possible to utilize join in any individual insert. e.g., 

FROM a
INSERT a.col1, a.col2 join b on a.col1 = b.col1
INSERT ...

Apparently, Hive doesn't like this query and give syntax error.

In other words, my question in general is, for multiple insert, can we specify 
multiple tables as data sources (of course, can benefit from one pass scan on 
only one table), or only one table from which the data can be pulled from?

Best,
Chen

Re: pass entire row as parameter in hive UDF

2011-11-01 Thread Chen Song
Can this be only used in regular select statement or also as arguments to UDF? 
In this case, how shall I define my UDF/GenericUDF method signature to accept 
column in this form? Will Hive automatically expand the column list and pass 
them to customized UDF?


If there is any example, that would be very helpful.

Thanks
Chen



From: Steven Wong 
To: "user@hive.apache.org" ; hive dev list 

Sent: Tuesday, November 1, 2011 10:20 PM
Subject: RE: pass entire row as parameter in hive UDF


Would 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification
 work for you in the meantime?
 
 
From:Chen Song [mailto:chens_alb...@yahoo.com] 
Sent: Monday, October 31, 2011 9:15 AM
To: hive dev list; hive user list
Subject: pass entire row as parameter in hive UDF
 
Hi All
 
In HIVE, I would like to write a UDF that accepts a sequence of parameters. Due 
to that the number of parameters is large and the particular function that I am 
writing is specific to a set of tables (joined in some way in the SQL), I am 
wondering if there is a way to pass the entire row as a wildcard parameter and 
then queried for its fields in UDF within this context, as shown in the below 
example.
 
select my_function(*) as my_column from t1, t2, etc where [a set of join 
conditions].
 
I did some investigation and found there was a JIRA opened for this.
 
https://issues.apache.org/jira/browse/HIVE-1459
 
This ticket is opened as a follow up to ticket HIVE-287 to support star 
expansion in general and seems still open. If anyone knows a way to pass the 
entire row as a context in UDF, that would be very helpful.
 
Regards,
Chen

pass entire row as parameter in hive UDF

2011-10-31 Thread Chen Song
Hi All

In HIVE, I would like to write a UDF that accepts a sequence of parameters. Due 
to that the number of parameters is large and the particular function that I am 
writing is specific to a set of tables (joined in some way in the SQL), I am 
wondering if there is a way to pass the entire row as a wildcard parameter and 
then queried for its fields in UDF within this context, as shown in the below 
example.


select my_function(*) as my_column from t1, t2, etc where [a set of join 
conditions].

I did some investigation and found there was a JIRA opened for this.

https://issues.apache.org/jira/browse/HIVE-1459

This ticket is opened as a follow up to ticket HIVE-287 to support star 
expansion in general and seems still open. If anyone knows a way to pass the 
entire row as a context in UDF, that would be very helpful.


Regards,
Chen