Re: question on create database
Got it. Thanks. On Thu, Apr 2, 2015 at 11:29 AM, Alan Gates wrote: > When someone creates a table in your 'abc' database it should by default > be in '/my/preferred/directory/_tablename_'. However, users can specify > locations for their tables which may not be in that directory. AFAIK > there's no way to prevent that. > > Alan. > > Chen Song > April 2, 2015 at 8:15 > I have a dumb question on DDL statement "create database" > > Say if I create a database > CREATE DATABASE abc > LOCATION '/my/preferred/directory'; > When later on someone needs to create a table in this database, is there a > way to force the location of the table to be under > /my/preferred/directory? > > I searched around but could not find a way to enforce this. > > > -- > Chen Song > > -- Chen Song
question on create database
I have a dumb question on DDL statement "create database" Say if I create a database CREATE DATABASE abcLOCATION '/my/preferred/directory'; When later on someone needs to create a table in this database, is there a way to force the location of the table to be under /my/preferred/directory? I searched around but could not find a way to enforce this. -- Chen Song
Re: question on HIVE-5891
I haven't found a workaround yet. On Thu, Nov 13, 2014 at 11:25 AM, Stéphane Verlet wrote: > Chen > > Did you find a workarround ? Anybody else have a suggestion ? > > Thank you > > Stephane > > On Mon, Aug 4, 2014 at 9:00 AM, Chen Song wrote: > >> I am using cdh5 distribution and It doesn't look like this jira >> >> https://issues.apache.org/jira/browse/HIVE-5891 >> >> is backported into cdh 5.1.0. >> >> Is there a workaround to modify the query that is subject to this problem? >> >> -- >> Chen Song >> >> > -- Chen Song
Re: mapreduce.job.queuename doesn't work with hiveserver2
Thanks for the reply. I am using Hue so I may need to find a way to hack this into Hue. On Fri, Oct 10, 2014 at 7:39 AM, Navdeep Agrawal < navdeep_agra...@symantec.com> wrote: > You can change the mapreduce.job.queuename by appending it to connection > string after ‘?’ like jdbc:hive2//url:1/db?mapreduce.job.queuename=q1 > > > > *From:* Chen Song [mailto:chen.song...@gmail.com] > *Sent:* Friday, October 10, 2014 2:38 AM > *To:* user > *Subject:* mapreduce.job.queuename doesn't work with hiveserver2 > > > > By setting mapreduce.job.queuename in hive-site.xml, all queries submitted > via hiveserver or hive cli all go to the specified queue. > > > However, hiveserver2 doesn't appear to respect this value at all. The > queue is chosen to be the user who submitted query. I tried with beeline > and jdbc. > > > > Is there a way I can configure a specific queue for queries going through > hiveserver2? > > > > -- > Chen Song > > > -- Chen Song
mapreduce.job.queuename doesn't work with hiveserver2
By setting mapreduce.job.queuename in hive-site.xml, all queries submitted via hiveserver or hive cli all go to the specified queue. However, hiveserver2 doesn't appear to respect this value at all. The queue is chosen to be the user who submitted query. I tried with beeline and jdbc. Is there a way I can configure a specific queue for queries going through hiveserver2? -- Chen Song
Re: hive auto join conversion
Yeah, I was trying the same thing, though a little big ugly. My query needs to LJ/J with multiple tables. When there are 1 or 2 LJ/Js, rewriting works but when there are > 3 tables, the got the same exception triggered by the following bug. https://issues.apache.org/jira/browse/HIVE-5891 Chen On Wed, Jul 30, 2014 at 10:07 PM, Eugene Koifman wrote: > would manually rewriting the query from (T1 union all T2) LOJ S to > equivalent (T1 LOJ S) union all (T2 LOJ S) help work around this issue? > > > On Wed, Jul 30, 2014 at 6:19 PM, Chen Song wrote: > >> I tried that and I got the following error. >> >> FAILED: SemanticException [Error 10227]: Not all clauses are supported >> with mapjoin hint. Please remove mapjoin hint. >> >> I then tried turning off auto join conversion. >> >> set hive.auto.convert.join=false >> >> But no luck, same error. >> >> Looks like it is a known issue, >> >> >> http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdp2.0.0.2-5-2.html >> >> Chen >> >> >> >> >> On Wed, Jul 30, 2014 at 9:10 PM, Navis류승우 wrote: >> >>> Could you do it with hive.ignore.mapjoin.hint=false? Mapjoin hint is >>> ignored from hive-0.11.0 by default (see >>> https://issues.apache.org/jira/browse/HIVE-4042) >>> >>> Thanks, >>> Navis >>> >>> >>> 2014-07-31 10:04 GMT+09:00 Chen Song : >>> >>> I am using cdh5 with hive 0.12. We have some hive jobs migrated from >>>> hive 0.10 and they are written like below: >>>> >>>> select /*+ MAPJOIN(sup) */ c1, c2, sup.c >>>> from >>>> ( >>>> select key, c1, c2 from table1 >>>> union all >>>> select key, c1, c2 from table2 >>>> ) table >>>> left outer join >>>> sup >>>> on (table.c1 = sup.key) >>>> distribute by c1 >>>> >>>> In Hive 0.10 (CDH4), Hive translates the left outer join into a map >>>> join (map only job), followed by a regular MR job for distribute by. >>>> >>>> In Hive 0.12 (CDH5), Hive is not able to convert the join into a map >>>> join. Instead it launches a common map reduce for the join, followed by >>>> another mr for distribute by. However, when I take out the union all >>>> operator, Hive seems to be able to create a single MR job, with map join on >>>> map phase, and reduce for distribute by. >>>> >>>> I read a bit on >>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins >>>> and found out that there are some restrictions on map side join >>>> starting Hive 0.11. The following are not supported. >>>> >>>> >>>> - Union Followed by a MapJoin >>>>- Lateral View Followed by a MapJoin >>>>- Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) >>>>Followed by MapJoin >>>>- MapJoin Followed by Union >>>>- MapJoin Followed by Join >>>>- MapJoin Followed by MapJoin >>>> >>>> >>>> So if one side of the table (big side) is a union of some tables and >>>> the other side is a small table, Hive would not be able to do a map join at >>>> all? Is that correct? >>>> >>>> If correct, what should I do to make the job backward compatible? >>>> >>>> -- >>>> Chen Song >>>> >>>> >>> >> >> >> -- >> Chen Song >> >> > > > -- > > Thanks, > Eugene > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. -- Chen Song
question on HIVE-5891
I am using cdh5 distribution and It doesn't look like this jira https://issues.apache.org/jira/browse/HIVE-5891 is backported into cdh 5.1.0. Is there a workaround to modify the query that is subject to this problem? -- Chen Song
Re: hive auto join conversion
I tried that and I got the following error. FAILED: SemanticException [Error 10227]: Not all clauses are supported with mapjoin hint. Please remove mapjoin hint. I then tried turning off auto join conversion. set hive.auto.convert.join=false But no luck, same error. Looks like it is a known issue, http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdp2.0.0.2-5-2.html Chen On Wed, Jul 30, 2014 at 9:10 PM, Navis류승우 wrote: > Could you do it with hive.ignore.mapjoin.hint=false? Mapjoin hint is > ignored from hive-0.11.0 by default (see > https://issues.apache.org/jira/browse/HIVE-4042) > > Thanks, > Navis > > > 2014-07-31 10:04 GMT+09:00 Chen Song : > > I am using cdh5 with hive 0.12. We have some hive jobs migrated from hive >> 0.10 and they are written like below: >> >> select /*+ MAPJOIN(sup) */ c1, c2, sup.c >> from >> ( >> select key, c1, c2 from table1 >> union all >> select key, c1, c2 from table2 >> ) table >> left outer join >> sup >> on (table.c1 = sup.key) >> distribute by c1 >> >> In Hive 0.10 (CDH4), Hive translates the left outer join into a map join >> (map only job), followed by a regular MR job for distribute by. >> >> In Hive 0.12 (CDH5), Hive is not able to convert the join into a map >> join. Instead it launches a common map reduce for the join, followed by >> another mr for distribute by. However, when I take out the union all >> operator, Hive seems to be able to create a single MR job, with map join on >> map phase, and reduce for distribute by. >> >> I read a bit on >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins >> and found out that there are some restrictions on map side join starting >> Hive 0.11. The following are not supported. >> >> >>- Union Followed by a MapJoin >>- Lateral View Followed by a MapJoin >>- Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) >>Followed by MapJoin >>- MapJoin Followed by Union >>- MapJoin Followed by Join >>- MapJoin Followed by MapJoin >> >> >> So if one side of the table (big side) is a union of some tables and the >> other side is a small table, Hive would not be able to do a map join at >> all? Is that correct? >> >> If correct, what should I do to make the job backward compatible? >> >> -- >> Chen Song >> >> > -- Chen Song
hive auto join conversion
I am using cdh5 with hive 0.12. We have some hive jobs migrated from hive 0.10 and they are written like below: select /*+ MAPJOIN(sup) */ c1, c2, sup.c from ( select key, c1, c2 from table1 union all select key, c1, c2 from table2 ) table left outer join sup on (table.c1 = sup.key) distribute by c1 In Hive 0.10 (CDH4), Hive translates the left outer join into a map join (map only job), followed by a regular MR job for distribute by. In Hive 0.12 (CDH5), Hive is not able to convert the join into a map join. Instead it launches a common map reduce for the join, followed by another mr for distribute by. However, when I take out the union all operator, Hive seems to be able to create a single MR job, with map join on map phase, and reduce for distribute by. I read a bit on https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins and found out that there are some restrictions on map side join starting Hive 0.11. The following are not supported. - Union Followed by a MapJoin - Lateral View Followed by a MapJoin - Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin - MapJoin Followed by Union - MapJoin Followed by Join - MapJoin Followed by MapJoin So if one side of the table (big side) is a union of some tables and the other side is a small table, Hive would not be able to do a map join at all? Is that correct? If correct, what should I do to make the job backward compatible? -- Chen Song
php thrift client for hiveserver2
I am using CDH5 with hive 0.12.0. I dig a bit myself in internet but could not find sample php code to connect to hiveserver2. I know there are php files generated by thrift but they seem broken, as shown below. php -l /usr/lib/hive/lib/php/packages/hive_service/TCLIService.php PHP Parse error: syntax error, unexpected ';', expecting T_STRING or T_NS_SEPARATOR or '{' in /usr/lib/hive/lib/php/packages/hive_service/TCLIService.php on line 2 Parse error: syntax error, unexpected ';', expecting T_STRING or T_NS_SEPARATOR or '{' in /usr/lib/hive/lib/php/packages/hive_service/TCLIService.php on line 2 Errors parsing /usr/lib/hive/lib/php/packages/hive_service/TCLIService.php Is php thrift no longer supported by hiveserver2? Chen
Re: the php client seems to be broken
I am also experiencing this syntactical issue in /usr/lib/hive/lib/php/packages/hive_service/ThriftHive.php. Anyone knows how to fix this? [Wed Apr 23 16:23:23 2014] [error] [client 127.0.0.1] PHP Parse error: syntax error, unexpected ';', expecting T_STRING or T_NS_SEPARATOR or '{' in /usr/lib/hive/lib/php/packages/hive_service/ThriftHive.php on line 2 The first few lines in php file reads as wrote: > I have been researching for a week now on how to use the PHP hive client. > https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-PHP > but the $GLOBALS['THRIFT_ROOT'] = '/lib/php/'; is broken. one thing is > the autoloader is no generated properly. second > packages/hive_service/ThriftHive.php' has namespace; at the top so a > namesapce is not even defined. > Third if I just start indlucing files instead of using the autoloader I > get error " Fatal error: Class 'Thrift\Exception\TException' not found in > /lib/php/packages/hive_metastore/metastore/Types.php on line 4891" even > though the file is included. > > Any one know how to resolve these issues? > > > > On 03/17/2014 11:45 AM, Jeremy wrote: > > I am trying to use the php hive thift client and there is name space and > autoload issues. > Autoload is not generated and I have not found any instructions how to > properly generate it. > Also some files contain namespace; which is a syntax issue. > Is any one aware of these issues and know how to fix them? > > > > > -- Chen Song
Re: single MR stage for join and group by
We are currently using cloudera's CDH4.2.0 based on Hive 0.10.0. So neither optimizations were incorporated. Thank you all for the valuable feedback and this is really helpful. I will look further into the details of the JIRA. On Fri, Aug 2, 2013 at 12:14 AM, Yin Huai wrote: > If the join is a reduce side join, > https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query > and generate a single MR job. The optimizer introduced by HIVE-2206 is in > trunk. Currently, it only handles the same column(s). > > If the join is a MapJoin, hive 0.11 can generate a single MR job (In this > case, if join and group by use the same column(s) does not matter). To > enable it, you need to ... > set hive.auto.convert.join=true > set hive.auto.convert.join.noconditionaltask=true; > set hive.optimize.mapjoin.mapreduce=true; > and also make sure hive.auto.convert.join.noconditionaltask.size is larger > than the size of the small table. > For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the > flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will > not need to set hive.optimize.mapjoin.mapreduce. > > Thanks, > > Yin > > > On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague wrote: > >> and what version of hive are you running your test on? i do believe - >> not certain - that hive 0.11 includes the optimization you seek. >> >> >> On Thu, Aug 1, 2013 at 10:19 AM, Chen Song wrote: >> >>> Suppose we have 2 simple tables >>> >>> A >>> id int >>> value string >>> >>> B >>> id >>> >>> When hive translates the following query >>> >>> select max(A.value), A.id from A join B on A.id = A.id group by A.id; >>> >>> It launches 2 stages, one for the join and one for the group by. >>> >>> My understanding is that if the join key set is a sub set of the group >>> by key set, it can be achieved in the same map reduce job. If that is >>> correct in theory, could it be a feature in hive? >>> >>> Chen >>> >>> >> > -- Chen Song
single MR stage for join and group by
Suppose we have 2 simple tables A id int value string B id When hive translates the following query select max(A.value), A.id from A join B on A.id = A.id group by A.id; It launches 2 stages, one for the join and one for the group by. My understanding is that if the join key set is a sub set of the group by key set, it can be achieved in the same map reduce job. If that is correct in theory, could it be a feature in hive? Chen
Re: Hive producing difference outputs
resent as no one has replied yet :) On Thu, Nov 15, 2012 at 11:18 AM, Chen Song wrote: > Hi Folks > > We are getting inconsistent output when running some semantically same > Hive queries. We are using *CDH3u3* with *Hive 0.7.1.* > > The query I am running performs a map-side join of two subqueries (s1 and > s2)--s1 is a multi-table join and s2 is a union of a table against itself. > If we add in mapside join to s1 (which is necessary for performance) the > query produces different (and incorrect) results. > > Basically, s1 returns one row and s2 returns 2 rows. By joining them, it > should give us two rows. See below for output and queries (I abridged the > queries a bit for readability). > > NOTE: I searched online and found anther thread reporting a similar issue > as what I have seen, > http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/%3CCADejTpz5rSf-Sxi9HNoCTRHVazOk=F+bu6XPL=k7tbknqj9...@mail.gmail.com%3E. > Unfortunately, this thread never got answered. It looks like a bug in Hive. > Greatly appreciate if anyone give thoughts on this issue. > > > Q1 output (incorrect, missing one row): > 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 > 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 1 0 NULL > > Q2 output (correct): > 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 > 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 0 > 2012-08-15 18:56:38 2258929231696355094 1 10 2258929231696355094 > 2012-08-15 18:56:38 1315 0 US 728 90 0 97525 1 0 NULL 10 NULL 1 > > Q1 and Q2 only differs in the way that s1 in Q1 has a mapside join. > >- Q1 (incorrect results, highlighted portion is the only difference >from Q2): > > > SELECT /*+MAPJOIN(table) */ * > FROM > ( > SELECT /*+ MAPJOIN(tableA,tableB) */ >id, >... > FROM main_table ctg > JOIN sample smp > ON ctg.id = smp.id > LEFT OUTER JOIN tableA > ON smp.publisher_id = tableA.id > LEFT OUTER JOIN tableB > ON smp.inventory_source_id = tableB.id > WHERE ctg.date_time >= '2012-08-15 00:00:00' and > ctg.date_time <= '2012-08-15 23:59:59' and > ctg.dd = '2012-08-15' and > smp.date_time >= '2012-08-15 00:00:00' and > smp.date_time <= '2012-08-15 23:59:59' and > smp.dd = '2012-08-15' > )sub > JOIN > ( > select > id, > parent_category_id, > cast(0 as tinyint) as is_parent > from table_category > where is_system = 1 > union all > select > id, > parent_category_id, > cast(1 as tinyint) as is_parent > from table_category > where is_system = 1 > )table > ON table.id = sub.id > > >- Q2 (correct results): > > SELECT /*+MAPJOIN(table) */ * > FROM > ( > SELECT >id, >... > FROM main_table ctg > JOIN sample smp > ON ctg.id = smp.id > LEFT OUTER JOIN tableA > ON smp.publisher_id = tableA.id > LEFT OUTER JOIN tableB > ON smp.inventory_source_id = tableB.id > WHERE ctg.date_time >= '2012-08-15 00:00:00' and > ctg.date_time <= '2012-08-15 23:59:59' and > ctg.dd = '2012-08-15' and > smp.date_time >= '2012-08-15 00:00:00' and > smp.date_time <= '2012-08-15 23:59:59' and > smp.dd = '2012-08-15' > )sub > JOIN > ( > select > id, > parent_category_id, > cast(0 as tinyint) as is_parent > from table_category > where is_system = 1 > union all > select > id, > parent_category_id, > cast(1 as tinyint) as is_parent > from table_category > where is_system = 1 > )table > ON table.id = sub.id > > -- > Chen Song > > > -- Chen Song
Re: map side join with group by
Thanks Nitin. This is all I want to clarify :) Chen On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar wrote: > to improve the speed of the job they created map only joins so that all > the records associated with a key fall to a map .. reducers slows it down. > If the reducer has to do some more job then they launch another job. > > bear in mind, when we say map only join we are absolutely sure that speed > will increase in case data in one of the tables is in the few hundred MB > ranges. If this has to do with reduce in hand, the processing logic > completely changes and it also slows down. > > Launching a new job for group by is a neat way to measure how much time > you spent on just join and another on group by so you can easily see two > different things. > > There is no way you can ask a mapjoin to launch a reducer as it is not > supposed to do. > > If you have such case (may be if you think that it will improve > performance), please feel free to raise a jira and get it reviewed. if its > valid I think people will provide more ideas > > > On Fri, Dec 14, 2012 at 12:42 AM, Chen Song wrote: > >> Nitin >> >> Yeah. My original question is that is there a way to force Hive (or >> rather to say, is it possible) to execute map side join at mapper phase and >> group by in reduce phase. So instead of launching a map only job (join) and >> map reduce job (group by), doing it altogether in a single MR job. This is >> obviously not what Hive does but I am wondering if it is a nice feature to >> have. >> >> The point you made (different keys in join and group by) only matters >> when it is the time in reduce phase, right? As map side join takes care of >> join at mapper phase, it sounds to me natural that group by can be done in >> the reduce phase in the same job. The only hassle that I can think of is >> that map output have to be resorted (based on group by keys). >> >> Chen >> >> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar wrote: >> >>> chen in mapside join .. there are no reducers .. its MAP ONLY job >>> >>> >>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote: >>> >>>> Understood that fact that it is impossible in the same MR job if both >>>> join and group by are gonna happen in the reduce phase (because the join >>>> keys and group by keys are different). But for map side join, the joins >>>> would be complete by the end of the map phase, and outputs should be ready >>>> to be distributed to reducers based on group by keys. >>>> >>>> Chen >>>> >>>> >>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar >>>> wrote: >>>> >>>>> Thats because for the first job the join keys are different and second >>>>> job group by keys are different, you just cant assume join keys and group >>>>> keys will be same so they are two different jobs >>>>> >>>>> >>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote: >>>>> >>>>>> Yeah, my abridged version of query might be a little broken but my >>>>>> point is that when a query has a map join and group by, even in its >>>>>> simplified incarnation, it will launch two jobs. I was just wondering why >>>>>> map join and group by cannot be accomplished in one MR job. >>>>>> >>>>>> Best, >>>>>> Chen >>>>>> >>>>>> >>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar < >>>>>> nitinpawar...@gmail.com> wrote: >>>>>> >>>>>>> I think Chen wanted to know why this is two phased query if I >>>>>>> understood it correctly >>>>>>> >>>>>>> When you run a mapside join .. it just performs the join query .. >>>>>>> after that to execute the group by part it launches the second job. >>>>>>> I may be wrong but this is how I saw it whenever I executed group by >>>>>>> queries >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover < >>>>>>> grover.markgro...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Chen, >>>>>>>> I think we would need some more information. >>>>>>>> >>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
Re: map side join with group by
Nitin Yeah. My original question is that is there a way to force Hive (or rather to say, is it possible) to execute map side join at mapper phase and group by in reduce phase. So instead of launching a map only job (join) and map reduce job (group by), doing it altogether in a single MR job. This is obviously not what Hive does but I am wondering if it is a nice feature to have. The point you made (different keys in join and group by) only matters when it is the time in reduce phase, right? As map side join takes care of join at mapper phase, it sounds to me natural that group by can be done in the reduce phase in the same job. The only hassle that I can think of is that map output have to be resorted (based on group by keys). Chen On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar wrote: > chen in mapside join .. there are no reducers .. its MAP ONLY job > > > On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote: > >> Understood that fact that it is impossible in the same MR job if both >> join and group by are gonna happen in the reduce phase (because the join >> keys and group by keys are different). But for map side join, the joins >> would be complete by the end of the map phase, and outputs should be ready >> to be distributed to reducers based on group by keys. >> >> Chen >> >> >> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar wrote: >> >>> Thats because for the first job the join keys are different and second >>> job group by keys are different, you just cant assume join keys and group >>> keys will be same so they are two different jobs >>> >>> >>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote: >>> >>>> Yeah, my abridged version of query might be a little broken but my >>>> point is that when a query has a map join and group by, even in its >>>> simplified incarnation, it will launch two jobs. I was just wondering why >>>> map join and group by cannot be accomplished in one MR job. >>>> >>>> Best, >>>> Chen >>>> >>>> >>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar >>>> wrote: >>>> >>>>> I think Chen wanted to know why this is two phased query if I >>>>> understood it correctly >>>>> >>>>> When you run a mapside join .. it just performs the join query .. >>>>> after that to execute the group by part it launches the second job. >>>>> I may be wrong but this is how I saw it whenever I executed group by >>>>> queries >>>>> >>>>> >>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover < >>>>> grover.markgro...@gmail.com> wrote: >>>>> >>>>>> Hi Chen, >>>>>> I think we would need some more information. >>>>>> >>>>>> The query is referring to a table called "d" in the MAPJOIN hint but >>>>>> there is not such table in the query. Moreover, Map joins only make >>>>>> sense when the right table is the one being "mapped" (in other words, >>>>>> being kept in memory) in case of a Left Outer Join, similarly if the >>>>>> left table is the one being "mapped" in case of a Right Outer Join. >>>>>> Let me know if this is not clear, I'd be happy to offer a better >>>>>> explanation. >>>>>> >>>>>> In your query, the where clause on a column called "hour", at this >>>>>> point I am unsure if that's a column of table1 or table2. If it's >>>>>> column on table1, that predicate would get pushed up (if you have >>>>>> hive.optimize.ppd property set to true), so it could possibly be done >>>>>> in 1 MR job (I am not sure if that's presently the case, you will have >>>>>> to check the explain plan). If however, the where clause is on a >>>>>> column in the right table (table2 in your example), it can't be pushed >>>>>> up since a column of the right table can have different values before >>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need >>>>>> to be applied in a separate MR job. >>>>>> >>>>>> This is just my understanding, the full proof answer would lie in >>>>>> checking out the explain plans and the Semantic Analyzer code. >>>>>> >>>>>> And for completeness, there is a conditional task (starting Hive 0
Re: map side join with group by
Understood that fact that it is impossible in the same MR job if both join and group by are gonna happen in the reduce phase (because the join keys and group by keys are different). But for map side join, the joins would be complete by the end of the map phase, and outputs should be ready to be distributed to reducers based on group by keys. Chen On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar wrote: > Thats because for the first job the join keys are different and second job > group by keys are different, you just cant assume join keys and group keys > will be same so they are two different jobs > > > On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote: > >> Yeah, my abridged version of query might be a little broken but my point >> is that when a query has a map join and group by, even in its simplified >> incarnation, it will launch two jobs. I was just wondering why map join and >> group by cannot be accomplished in one MR job. >> >> Best, >> Chen >> >> >> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar wrote: >> >>> I think Chen wanted to know why this is two phased query if I understood >>> it correctly >>> >>> When you run a mapside join .. it just performs the join query .. after >>> that to execute the group by part it launches the second job. >>> I may be wrong but this is how I saw it whenever I executed group by >>> queries >>> >>> >>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover < >>> grover.markgro...@gmail.com> wrote: >>> >>>> Hi Chen, >>>> I think we would need some more information. >>>> >>>> The query is referring to a table called "d" in the MAPJOIN hint but >>>> there is not such table in the query. Moreover, Map joins only make >>>> sense when the right table is the one being "mapped" (in other words, >>>> being kept in memory) in case of a Left Outer Join, similarly if the >>>> left table is the one being "mapped" in case of a Right Outer Join. >>>> Let me know if this is not clear, I'd be happy to offer a better >>>> explanation. >>>> >>>> In your query, the where clause on a column called "hour", at this >>>> point I am unsure if that's a column of table1 or table2. If it's >>>> column on table1, that predicate would get pushed up (if you have >>>> hive.optimize.ppd property set to true), so it could possibly be done >>>> in 1 MR job (I am not sure if that's presently the case, you will have >>>> to check the explain plan). If however, the where clause is on a >>>> column in the right table (table2 in your example), it can't be pushed >>>> up since a column of the right table can have different values before >>>> and after the LEFT OUTER JOIN. Therefore, the where clause would need >>>> to be applied in a separate MR job. >>>> >>>> This is just my understanding, the full proof answer would lie in >>>> checking out the explain plans and the Semantic Analyzer code. >>>> >>>> And for completeness, there is a conditional task (starting Hive 0.7) >>>> that will convert your joins automatically to map joins where >>>> applicable. This can be enabled by enabling hive.auto.convert.join >>>> property. >>>> >>>> Mark >>>> >>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song >>>> wrote: >>>> > I have a silly question on how Hive interpretes a simple query with >>>> both map >>>> > side join and group by. >>>> > >>>> > Below query will translate into two jobs, with the 1st one as a map >>>> only job >>>> > doing the join and storing the output in a intermediary location, and >>>> the >>>> > 2nd one as a map-reduce job taking the output of the 1st job as input >>>> and >>>> > doing the group by. >>>> > >>>> > SELECT >>>> > /*+ MAPJOIN(d) */ >>>> > table.a, sum(table2.b) >>>> > from table >>>> > LEFT OUTER JOIN table2 >>>> > ON table.id = table2.id >>>> > where hour = '2012-12-11 11' >>>> > group by table.a >>>> > >>>> > Why can't this be done within a single map reduce job? As what I can >>>> see >>>> > from the query plan is that all 2nd job mapper do is taking the 1st >>>> job's >>>> > mapper output. >>>> > >>>> > -- >>>> > Chen Song >>>> > >>>> > >>>> >>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> >> >> -- >> Chen Song >> >> >> > > > -- > Nitin Pawar > -- Chen Song
Re: map side join with group by
Yeah, my abridged version of query might be a little broken but my point is that when a query has a map join and group by, even in its simplified incarnation, it will launch two jobs. I was just wondering why map join and group by cannot be accomplished in one MR job. Best, Chen On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar wrote: > I think Chen wanted to know why this is two phased query if I understood > it correctly > > When you run a mapside join .. it just performs the join query .. after > that to execute the group by part it launches the second job. > I may be wrong but this is how I saw it whenever I executed group by > queries > > > On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover > wrote: > >> Hi Chen, >> I think we would need some more information. >> >> The query is referring to a table called "d" in the MAPJOIN hint but >> there is not such table in the query. Moreover, Map joins only make >> sense when the right table is the one being "mapped" (in other words, >> being kept in memory) in case of a Left Outer Join, similarly if the >> left table is the one being "mapped" in case of a Right Outer Join. >> Let me know if this is not clear, I'd be happy to offer a better >> explanation. >> >> In your query, the where clause on a column called "hour", at this >> point I am unsure if that's a column of table1 or table2. If it's >> column on table1, that predicate would get pushed up (if you have >> hive.optimize.ppd property set to true), so it could possibly be done >> in 1 MR job (I am not sure if that's presently the case, you will have >> to check the explain plan). If however, the where clause is on a >> column in the right table (table2 in your example), it can't be pushed >> up since a column of the right table can have different values before >> and after the LEFT OUTER JOIN. Therefore, the where clause would need >> to be applied in a separate MR job. >> >> This is just my understanding, the full proof answer would lie in >> checking out the explain plans and the Semantic Analyzer code. >> >> And for completeness, there is a conditional task (starting Hive 0.7) >> that will convert your joins automatically to map joins where >> applicable. This can be enabled by enabling hive.auto.convert.join >> property. >> >> Mark >> >> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song >> wrote: >> > I have a silly question on how Hive interpretes a simple query with >> both map >> > side join and group by. >> > >> > Below query will translate into two jobs, with the 1st one as a map >> only job >> > doing the join and storing the output in a intermediary location, and >> the >> > 2nd one as a map-reduce job taking the output of the 1st job as input >> and >> > doing the group by. >> > >> > SELECT >> > /*+ MAPJOIN(d) */ >> > table.a, sum(table2.b) >> > from table >> > LEFT OUTER JOIN table2 >> > ON table.id = table2.id >> > where hour = '2012-12-11 11' >> > group by table.a >> > >> > Why can't this be done within a single map reduce job? As what I can see >> > from the query plan is that all 2nd job mapper do is taking the 1st >> job's >> > mapper output. >> > >> > -- >> > Chen Song >> > >> > >> > > > > -- > Nitin Pawar > -- Chen Song
map side join with group by
I have a silly question on how Hive interpretes a simple query with both map side join and group by. Below query will translate into two jobs, with the 1st one as a map only job doing the join and storing the output in a intermediary location, and the 2nd one as a map-reduce job taking the output of the 1st job as input and doing the group by. SELECT /*+ MAPJOIN(d) */ table.a, sum(table2.b) from table LEFT OUTER JOIN table2 ON table.id = table2.id where hour = '2012-12-11 11' group by table.a Why can't this be done within a single map reduce job? As what I can see from the query plan is that all 2nd job mapper do is taking the 1st job's mapper output. -- Chen Song
Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Is there anything interesting on hive.log? If you are running Hive shell with root user, by default, the hive.log should be in /tmp/hive.log. BTW, are you using MySQL as your metastore? On Mon, Oct 29, 2012 at 12:22 PM, Robin Verlangen wrote: > I have to add, while trying to do other operations on hive by hand, it > also fails. > > For example: > *hive> drop table my_other_table;* > *FAILED: Error in metadata: java.lang.reflect.UndeclaredThrowableException > * > *FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask* > > > Best regards, > > Robin Verlangen > *Software engineer* > * > * > W http://www.robinverlangen.nl > E ro...@us2.nl > > <http://goo.gl/Lt7BC> > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > > > 2012/10/29 Robin Verlangen > >> Hi Chen, >> >> The user that ran this job is root and all hdfs folders are also owned by >> root. >> >> Best regards, >> >> Robin Verlangen >> *Software engineer* >> * >> * >> W http://www.robinverlangen.nl >> E ro...@us2.nl >> >> <http://goo.gl/Lt7BC> >> >> Disclaimer: The information contained in this message and attachments is >> intended solely for the attention and use of the named addressee and may be >> confidential. If you are not the intended recipient, you are reminded that >> the information remains the property of the sender. You must not use, >> disclose, distribute, copy, print or rely on this e-mail. If you have >> received this message in error, please contact the sender immediately and >> irrevocably delete this message and any copies. >> >> >> >> 2012/10/29 Chen Song >> >>> Looks to me the permission issue. >>> >>> Can you check if the user (which ran the hive query) has write >>> permission on */user/hive/warehouse/mydatabase.db/mytable?* >>> >>> On Mon, Oct 29, 2012 at 8:38 AM, Robin Verlangen wrote: >>> >>>> Hi there, >>>> >>>> Since today our Hive jobs suddenly fail (nothing changed actually). The >>>> end looks like this: >>>> >>>> *MapReduce Total cumulative CPU time: 38 minutes 31 seconds 260 msec* >>>> *Ended Job = job_201210291304_0015* >>>> *Loading data to table mydatabase.mytable* >>>> *rmr: DEPRECATED: Please use 'rm -r' instead.* >>>> *Deleted /user/hive/warehouse/mydatabase.db/mytable* >>>> *Failed with exception null* >>>> *FAILED: Execution Error, return code 1 from >>>> org.apache.hadoop.hive.ql.exec.MoveTask* >>>> *MapReduce Jobs Launched:* >>>> *Job 0: Map: 196 Reduce: 3 Cumulative CPU: 2311.26 sec HDFS Read: >>>> 0 HDFS Write: 0 SUCCESS* >>>> *Total MapReduce CPU Time Spent: 38 minutes 31 seconds 260 msec* >>>> >>>> Does anyone have a clue how to resolve this? >>>> >>>> Best regards, >>>> >>>> Robin Verlangen >>>> *Software engineer* >>>> * >>>> * >>>> W http://www.robinverlangen.nl >>>> E ro...@us2.nl >>>> >>>> <http://goo.gl/Lt7BC> >>>> >>>> Disclaimer: The information contained in this message and attachments >>>> is intended solely for the attention and use of the named addressee and may >>>> be confidential. If you are not the intended recipient, you are reminded >>>> that the information remains the property of the sender. You must not use, >>>> disclose, distribute, copy, print or rely on this e-mail. If you have >>>> received this message in error, please contact the sender immediately and >>>> irrevocably delete this message and any copies. >>>> >>>> >>> >>> >>> -- >>> Chen Song >>> >>> >>> >> > -- Chen Song
Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
Looks to me the permission issue. Can you check if the user (which ran the hive query) has write permission on */user/hive/warehouse/mydatabase.db/mytable?* On Mon, Oct 29, 2012 at 8:38 AM, Robin Verlangen wrote: > Hi there, > > Since today our Hive jobs suddenly fail (nothing changed actually). The > end looks like this: > > *MapReduce Total cumulative CPU time: 38 minutes 31 seconds 260 msec* > *Ended Job = job_201210291304_0015* > *Loading data to table mydatabase.mytable* > *rmr: DEPRECATED: Please use 'rm -r' instead.* > *Deleted /user/hive/warehouse/mydatabase.db/mytable* > *Failed with exception null* > *FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MoveTask* > *MapReduce Jobs Launched:* > *Job 0: Map: 196 Reduce: 3 Cumulative CPU: 2311.26 sec HDFS Read: 0 > HDFS Write: 0 SUCCESS* > *Total MapReduce CPU Time Spent: 38 minutes 31 seconds 260 msec* > > Does anyone have a clue how to resolve this? > > Best regards, > > Robin Verlangen > *Software engineer* > * > * > W http://www.robinverlangen.nl > E ro...@us2.nl > > <http://goo.gl/Lt7BC> > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > -- Chen Song
Re: general question on hive client log
Thanks Nitin. I also found this wiki. But it only documented the hive client log. What about other log files created during a query? Chen On Thu, Oct 25, 2012 at 7:30 AM, Nitin Pawar wrote: > by default hive reads the logging related configs from > hive-log4j.default in conf directory. > if you want the cwiki page, here it is > > https://cwiki.apache.org/Hive/languagemanual-cli.html#LanguageManualCli-Logging > > > > > On Thu, Oct 25, 2012 at 1:15 AM, Chen Song wrote: > > I have searched online but could not find a comprehensive introduction on > > how Hive organize logs and naming convention. By that, I mean hive client > > specifically, not hadoop logs. > > > > I have seen various logs as below. > > > > 1. Hive shell log > > /tmp/{user.name}/hive.log > > > > 2. Hive history file for a job > > /tmp/{user.name}/hive_job_log_{job id}.txt > > > > 3. Hive execution log > > /tmp/{user.name}/hive_{some random unique number}.log > > > > 4. Other logs > > /tmp/{user.name}/hive_{some random unique number}.pipeout but always > empty. > > > > Can anyone point me to some page where this is documented in details? > Thanks > > very much. > > > > -- > > Chen Song > > > > > > > > -- > Nitin Pawar > -- Chen Song
Re: Error in semantic analysis: Unable to fetch table
Hi Keith Can you get anything when you run "show databases" from php? Chen On Fri, Oct 19, 2012 at 2:48 AM, Keith Wiley wrote: > I never needed the -S. I would just call it with -e, but yeah, that's the > basic idea. The output is gathered in the remaining optional arguments to > exec(). > > On Oct 18, 2012, at 22:15 , MiaoMiao wrote: > > > What php, you are using exec('hive -S -e "select * from some_table"') > > or something like that in php? > > > > On Fri, Oct 19, 2012 at 4:46 AM, Keith Wiley > wrote: > >> If I run a select command on the command line with -e it works. If I > run the same command inside hive (without -e) it still works...but if I > exec it with a -e from php (so obviously the apache user), I get "Error in > semantic analysis: Unable to fetch table [tablename]". Bear in mind that > this overall system worked perfectly for weeks. I haven't modified the > code or anything. What did change was I had to reinstall hadoop and hive > and I'm sure some configuration parameter is not correct now. I found some > Google references to this error but no clear-but solutions. > >> > >> What does this error mean and how do I fix it? > >> > >> Thanks. > > > > > Keith Wiley kwi...@keithwiley.com keithwiley.com > music.keithwiley.com > > "Luminous beings are we, not this crude matter." >-- Yoda > > > > -- Chen Song
Re: can i define an udf which can process more than one argument?
UDF can process multiple arguments. You just want to define the function signature in that way. For example, evaluate(String s1, String s2, boolean t) On Thu, Oct 18, 2012 at 11:36 PM, Chris Gong wrote: > ** > i find that UDF can only process one argument like >public long evaluate(String ip) { > > } > > can i define a udf like COALESCE(T v1, T v2, …) or if(boolean > testCondition, T valueTrue, T valueFalseOrNull)? > > ------ > Chris Gong > -- Chen Song
Re: ERROR: Hive subquery showing
Sorry that I misunderstood the question. I think Phil's query will do the trick. On Thu, Sep 27, 2012 at 4:46 PM, Philip Tromans wrote: > How about: > select name from ABC order by grp desc limit 1? > > Phil. > On Sep 27, 2012 9:02 PM, "yogesh dhari" wrote: > >> Hi Bejoy, >> >> I tried this one also but here it throws horrible error: >> >> i.e: >> >> hive: select name from ABD where grp=MAX(grp); >> >> FAILED: Hive Internal Error: java.lang.NullPointerException(null) >> java.lang.NullPointerException >> at >> org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:214) >> at >> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:684) >> at >> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:805) >> at >> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) >> at >> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88) >> at >> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125) >> at >> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102) >> at >> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:161) >> at >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7506) >> at >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7464) >> at >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1513) >> at >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:1494) >> at >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5886) >> at >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6524) >> at >> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7282) >> at >> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) >> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:430) >> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:337) >> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:889) >> at >> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255) >> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212) >> at >> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) >> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671) >> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:601) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >> Regards >> Yogesh Kumar >> >> -- >> Subject: Re: ERROR: Hive subquery showing >> To: user@hive.apache.org >> From: bejoy...@yahoo.com >> Date: Thu, 27 Sep 2012 19:48:25 + >> >> Hi yogesh >> >> What about a query like this >> select name from ABC WHERE grp=MAX(grp); >> >> Regards >> Bejoy KS >> >> Sent from handheld, please excuse typos. >> -- >> *From: * Chen Song >> *Date: *Thu, 27 Sep 2012 15:33:11 -0400 >> *To: * >> *ReplyTo: * user@hive.apache.org >> *Subject: *Re: ERROR: Hive subquery showing >> >> Can you try this? >> * >> * >> *select name from ( select MAX(grp) as name from ABC ) gy ;* >> >> On Thu, Sep 27, 2012 at 3:29 PM, yogesh dhari wrote: >> >> Hi all, >> >> I have a table called ABC, like >> >> namegrp >> A 1 >> B 2 >> C 4 >> D 8 >> >> I want the output like the name having greatest grp i.e D; >> >> I wrote a query: >> * >> select name from ( select MAX(grp) from ABC ) gy ;* >> >> but it gives error >> >> FAILED: Error in semantic analysis: Line 1:7 Invalid table alias or >> column reference 'name': (possible column names are: _col0) >> >> Please help and suggest why it is so, and what would be the query; >> >> >> Thanks & regards >> Yogesh Kumar >> >> >> >> >> >> >> >> >> -- >> Chen Song >> >> >> -- Chen Song
Re: ERROR: Hive subquery showing
Can you try this? * * *select name from ( select MAX(grp) as name from ABC ) gy ;* On Thu, Sep 27, 2012 at 3:29 PM, yogesh dhari wrote: > Hi all, > > I have a table called ABC, like > > namegrp > A 1 > B 2 > C 4 > D 8 > > I want the output like the name having greatest grp i.e D; > > I wrote a query: > * > select name from ( select MAX(grp) from ABC ) gy ;* > > but it gives error > > FAILED: Error in semantic analysis: Line 1:7 Invalid table alias or column > reference 'name': (possible column names are: _col0) > > Please help and suggest why it is so, and what would be the query; > > > Thanks & regards > Yogesh Kumar > > > > > > -- Chen Song
Re: Re: size of RCFile in hive
You can force reduce phase by adding distribute by or order by clause after your select query. On Thu, Sep 27, 2012 at 2:03 PM, 王锋 wrote: > but it's map only job > > > At 2012-09-27 05:39:39,"Chen Song" wrote: > > As far as I know, the number of files emitted would be determined by the > number of mappers for a map only job and the number of reducers for a map > reduce job. > > So it totally depends how your query translates into a MR job. > > You can enforce it by setting the property > > *mapred.reduce.tasks=1* > > Chen > > On Wed, Sep 19, 2012 at 11:25 PM, 王锋 wrote: > >> Hi >>I tried to convert and merge many small text files using RCFiles using >> hivesql,but hive produced some small rcfiles. >> set hive.exec.compress.output=true; >> set mapred.output.compress=true; >> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; >> set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec; >> hive.merge.mapfiles=true >> hive.merge.mapredfiles=true >> hive.merge.size.per.task=64000 >> hive.merge.size.smallfiles.avgsize=8000 >> insert overwrite table rctable select . >> >> >> the settings: >> hive.merge.mapfiles=true >> hive.merge.mapredfiles=true >> hive.merge.size.per.task=64000 >> hive.merge.size.smallfiles.avgsize=8000 >> didn't work. >> >> >> who could tell me how to solve it? > > > > > -- > Chen Song > > > > > -- Chen Song
Re: size of RCFile in hive
As far as I know, the number of files emitted would be determined by the number of mappers for a map only job and the number of reducers for a map reduce job. So it totally depends how your query translates into a MR job. You can enforce it by setting the property *mapred.reduce.tasks=1* Chen On Wed, Sep 19, 2012 at 11:25 PM, 王锋 wrote: > Hi >I tried to convert and merge many small text files using RCFiles using > hivesql,but hive produced some small rcfiles. > set hive.exec.compress.output=true; > set mapred.output.compress=true; > set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; > set io.compression.codecs=com.hadoop.compression.lzo.LzoCodec; > hive.merge.mapfiles=true > hive.merge.mapredfiles=true > hive.merge.size.per.task=64000 > hive.merge.size.smallfiles.avgsize=8000 > insert overwrite table rctable select . > > > the settings: > hive.merge.mapfiles=true > hive.merge.mapredfiles=true > hive.merge.size.per.task=64000 > hive.merge.size.smallfiles.avgsize=8000 > didn't work. > > > who could tell me how to solve it? -- Chen Song
Re: How can I get the constant value from the ObjectInspector in the UDF
pecified here. > 5) I looks around the javadoc of ObjectInspector class. I know at run time > the real class of the 2nd parameter is WritableIntObjectInspector. I can > get the type, but how I can get the real value of it? > 6) This is kind of ConstantsObjectInspector, should be able to give the > value to me, as it already knows the type is int. What how? > 7) I don't want to try to get the value at the evaluate stage. Can I get > this value at the initialize stage? > > Thanks > > Yong > > > > > -- > Chen Song > > > -- Chen Song
Re: How can I get the constant value from the ObjectInspector in the UDF
Hi Yong The way GenericUDF works is as follows. *ObjectInspector initialize(ObjectInspector[] arguments) *is called only once for one GenericUDF instance used in your Hive query. This phase is for preparation steps of UDF, such as syntax check and type inference. *Object evaluate(DeferredObject[] arguments)* is called to evaluate against actual arguments. This should be where the actual calculation happens and where you can get the real values you talked about. Thanks, Chen On Wed, Sep 26, 2012 at 4:17 AM, java8964 java8964 wrote: > Hi, I am using Cloudera release cdh3u3, which has the hive 0.71 version. > > I am trying to write a hive UDF function as to calculate the moving sum. > Right now, I am having trouble to get the constrant value passed in in the > initialization stage. > > For example, let's assume the function is like the following format: > > msum(salary, 10) - salary is a int type column > > which means the end user wants to calculate the last 10 rows of salary. > > I kind of know how to implement this UDF. But I have one problem right now. > > 1) This is not a UDAF, as each row will return one data back as the moving > sum. > 2) I create an UDF class extends from the GenericUDF. > 3) I can get the column type from the ObjectInspector[] passed to me in > the initialize() method to verify that 'salary' and 10 both needs to be > numeric type (later one needs to be integer) > 4) But I also want to get the real value of 10, in this case, in the > initialize() stage, so I can create the corresponding data structure based > on the value end user specified here. > 5) I looks around the javadoc of ObjectInspector class. I know at run time > the real class of the 2nd parameter is WritableIntObjectInspector. I can > get the type, but how I can get the real value of it? > 6) This is kind of ConstantsObjectInspector, should be able to give the > value to me, as it already knows the type is int. What how? > 7) I don't want to try to get the value at the evaluate stage. Can I get > this value at the initialize stage? > > Thanks > > Yong > -- Chen Song
Question on escaped characters in Hive Shell
Hi All I have a question on quoted hive query when executed with 'hive -e'. The command I ran looks like: > hive -e "select regexp_extract(col1, '\\d+') from A where col2='some value' > limit 5" When the query get passed into hive, it is interpreted as > select regexp_extract(col1, \d+') from A where col2='some value' limit 5 As you can see, the regex patten is translated from '\\d+' to '\d+' due to the escape character. By using single quote at the outer level and double quotes within query itself, I can get around this problem. It also works by putting the query in a file and running with 'hive -f'. But both approaches needs amendments to the application and sounds a little tactical. My questions are 1) Does hive handle escaping or the linux shell interprete it before passing to hive. 2) Other than the 2 workarounds I mentioned, is there any other option? Thank you for your feedback in advance. - Chen
Re: question on hive multiple insert
Please disregard my previous email. I figured out that the correct syntax should be FROM a JOIN b on a.col1 = b.col1 INSERT a.col1, a.col2 INSERT ... Thanks Chen From: Chen Song To: hive user list Sent: Friday, January 20, 2012 10:47 AM Subject: question on hive multiple insert Hi I am reading on Hive's multiple insert syntax manual and wondering if it is possible to utilize join in any individual insert. e.g., FROM a INSERT a.col1, a.col2 join b on a.col1 = b.col1 INSERT ... Apparently, Hive doesn't like this query and give syntax error. In other words, my question in general is, for multiple insert, can we specify multiple tables as data sources (of course, can benefit from one pass scan on only one table), or only one table from which the data can be pulled from? Best, Chen
question on hive multiple insert
Hi I am reading on Hive's multiple insert syntax manual and wondering if it is possible to utilize join in any individual insert. e.g., FROM a INSERT a.col1, a.col2 join b on a.col1 = b.col1 INSERT ... Apparently, Hive doesn't like this query and give syntax error. In other words, my question in general is, for multiple insert, can we specify multiple tables as data sources (of course, can benefit from one pass scan on only one table), or only one table from which the data can be pulled from? Best, Chen
Re: pass entire row as parameter in hive UDF
Can this be only used in regular select statement or also as arguments to UDF? In this case, how shall I define my UDF/GenericUDF method signature to accept column in this form? Will Hive automatically expand the column list and pass them to customized UDF? If there is any example, that would be very helpful. Thanks Chen From: Steven Wong To: "user@hive.apache.org" ; hive dev list Sent: Tuesday, November 1, 2011 10:20 PM Subject: RE: pass entire row as parameter in hive UDF Would https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-REGEXColumnSpecification work for you in the meantime? From:Chen Song [mailto:chens_alb...@yahoo.com] Sent: Monday, October 31, 2011 9:15 AM To: hive dev list; hive user list Subject: pass entire row as parameter in hive UDF Hi All In HIVE, I would like to write a UDF that accepts a sequence of parameters. Due to that the number of parameters is large and the particular function that I am writing is specific to a set of tables (joined in some way in the SQL), I am wondering if there is a way to pass the entire row as a wildcard parameter and then queried for its fields in UDF within this context, as shown in the below example. select my_function(*) as my_column from t1, t2, etc where [a set of join conditions]. I did some investigation and found there was a JIRA opened for this. https://issues.apache.org/jira/browse/HIVE-1459 This ticket is opened as a follow up to ticket HIVE-287 to support star expansion in general and seems still open. If anyone knows a way to pass the entire row as a context in UDF, that would be very helpful. Regards, Chen
pass entire row as parameter in hive UDF
Hi All In HIVE, I would like to write a UDF that accepts a sequence of parameters. Due to that the number of parameters is large and the particular function that I am writing is specific to a set of tables (joined in some way in the SQL), I am wondering if there is a way to pass the entire row as a wildcard parameter and then queried for its fields in UDF within this context, as shown in the below example. select my_function(*) as my_column from t1, t2, etc where [a set of join conditions]. I did some investigation and found there was a JIRA opened for this. https://issues.apache.org/jira/browse/HIVE-1459 This ticket is opened as a follow up to ticket HIVE-287 to support star expansion in general and seems still open. If anyone knows a way to pass the entire row as a context in UDF, that would be very helpful. Regards, Chen