problems with beeline and "add file"
Hi, I'm running hiveserver2 on machine X and beeline on machine Y. Everything works fine in terms queries etc. However, if i try "add file blah.py" via beeline cli then it tries on find on machine X instead of shipping it locally from Y. Is this the expected behavior or have I misconfigured something. Regards, -- Rohan Monga
Re: casting complex data types for outputs of custom scripts
sorry for the delayed response. yes the python script follows that. -- Rohan Monga On Tue, Jan 14, 2014 at 4:31 PM, Stephen Sprague wrote: > @OP - first thing i'd ask is does your python script obey the ^A,^B,^C,^D > etc. nesting delimiter pattern. give that your create table does not > specify delimiters those are the defaults. nb. ^A == control-A == \001 > > Cheers, > Stephen. > > > On Tue, Jan 14, 2014 at 3:11 PM, Andre Araujo wrote: >> >> I had a similar issue in the past when trying to cast an empty array to >> array(). By default Hive assumes it's an array(). >> I don't think there's currently a Hive syntax to cast values to complex >> data types. If there's one, I'd love to know what it is :) >> >> >> On 14 January 2014 10:22, rohan monga wrote: >>> >>> Hi, >>> >>> I have a table that is of the following format >>> >>> create table t1 ( f1 int, f2 array> ); >>> >>> Now I have a custom script that does some computation and generates >>> the value for f2 >>> like so >>> >>> from ( >>> from randomtable r >>> map r.g1, r.g2, r.g3 >>> using '/bin/cat' as g1, g2, g3 >>> cluster by g1 ) m >>> insert overwrite table t1 >>> reduce m.g1, m.g2, m.g3 >>> using 'python customScript.py' as ( f1 , f2 ); >>> >>> however f2 is not being loaded properly into t1, it comes up broken or >>> null. What should I do so that f2 is loaded as an array of structs. >>> >>> >>> Thanks, >>> >>> -- >>> Rohan Monga >> >> >> >> >> -- >> André Araújo >> Big Data Consultant/Solutions Architect >> The Pythian Group - Australia - www.pythian.com >> >> Office (calls from within Australia): 1300 366 021 x1270 >> Office (international): +61 2 8016 7000 x270 OR +1 613 565 8696 x1270 >> Mobile: +61 410 323 559 >> Fax: +61 2 9805 0544 >> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk >> >> “Success is not about standing at the top, it's the steps you leave >> behind.” — Iker Pou (rock climber) >> >> -- >> >> >> >
casting complex data types for outputs of custom scripts
Hi, I have a table that is of the following format create table t1 ( f1 int, f2 array> ); Now I have a custom script that does some computation and generates the value for f2 like so from ( from randomtable r map r.g1, r.g2, r.g3 using '/bin/cat' as g1, g2, g3 cluster by g1 ) m insert overwrite table t1 reduce m.g1, m.g2, m.g3 using 'python customScript.py' as ( f1 , f2 ); however f2 is not being loaded properly into t1, it comes up broken or null. What should I do so that f2 is loaded as an array of structs. Thanks, -- Rohan Monga
Re: Invoke a UDAF inside another UDAF
thanks Mark, I ended up going the custom reducer way. I will try out the query you have sent. Regards, -- Rohan Monga On Wed, Feb 1, 2012 at 11:06 AM, Mark Grover wrote: > Rohan, > You could do it one of the following ways: > 1) Write a UDAF that does the avg(f2 - avg_f2) computation. > 2) Write a custom reducer that does the avg(f2 - avg_f2) computation. > 3) Do it with multiple passes over the data. Something like this > (untested): > > select > table.f1, > avg_table.avg_f2, > avg(table.f2-avg_table.avg_f2) > from > ( > select > f1, > avg(f2) as avg_f2 > from > table > group by > f1)avg_table > join > table > ON (avg_table.f1=table.f1) > group by > table.f1, > avg_table.avg_f2; > > Mark > > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > e: mgro...@oanda.com > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > - Original Message - > From: "rohan monga" > To: user@hive.apache.org > Sent: Friday, January 20, 2012 6:00:54 PM > Subject: Re: Invoke a UDAF inside another UDAF > > my bad, i hastily converted the query to a wrong example. > > it should be like this > > select f1, avg(f2) as avg_f2, avg(f2 - avg_f2) from table group by f1; > > In essence, I just want to use the value generated by one UDAF ( in this > case avg(f2) ) as a single number and then apply that value to the group > inside a different UDAF. > For e.g. if I were to use a streaming reducer, it would be something like > this > > avg1 = computeSum(list) / len(list) > return computeSum(x-avg1 for x in list) / len(list) > > As I write this I realize why this might not be possible [ the group > computation being done in one step and the information being lost ] :) > > But why the nullpointer exception? > > Regards, > -- > Rohan Monga > > > > On Fri, Jan 20, 2012 at 2:32 PM, Edward Capriolo < edlinuxg...@gmail.com> > wrote: > > > IMHO You can not possibly nest the percentile calculation because the > results would be meaningless. percentile has to aggregate a set and > pick the Nth element, But if you nest then the inner percentile only > returns one result to the outer percentile, and that is pretty > meaningless. > > (I think someone talked about this on list in the last month or so). > Without seeing your input data and your expected results, i can not > understand what your query wants to do, and suggest an alternative. > > > > > > On 1/20/12, rohan monga < monga.ro...@gmail.com > wrote: > > thanks edward that seems to work :) > > > > However, I have another query is like this > > > > select a, avg(b) as avg_b, percentile_approx( avg_b - percentile_approx( > b, > > .5), .5 ) from table1 group by a > > > > Here I will loose the group info if I include the inner query in the FROM > > clause, is there a way to get this to work? > > > > Thanks, > > -- > > Rohan Monga > > > > > > On Fri, Jan 20, 2012 at 12:51 PM, Edward Capriolo > > < edlinuxg...@gmail.com >wrote: > > > >> I think if you are grouping by b, b has to be in your select list. Try > >> this. > >> FROM ( > >> select b,count(a) as theCount from table one group by b > >> ) a select mean(theCount); > >> > >> I think that should work. > >> > >> On 1/20/12, rohan monga < monga.ro...@gmail.com > wrote: > >> > Hi, > >> > I am trying to run a query like > >> > "select mean(count(a)) from table1 group by b;" > >> > > >> > I am getting the following error > >> > > >> > FAILED: Hive Internal Error: java.lang.NullPointerException(null) > >> > java.lang.NullPointerException > >> > at > >> > > >> > org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) > >> > at > >> > > >>
Re: Invoke a UDAF inside another UDAF
my bad, i hastily converted the query to a wrong example. it should be like this select f1, avg(f2) as avg_f2, avg(f2 - avg_f2) from table group by f1; In essence, I just want to use the value generated by one UDAF ( in this case avg(f2) ) as a single number and then apply that value to the group inside a different UDAF. For e.g. if I were to use a streaming reducer, it would be something like this avg1 = computeSum(list) / len(list) return computeSum(x-avg1 for x in list) / len(list) As I write this I realize why this might not be possible [ the group computation being done in one step and the information being lost ] :) But why the nullpointer exception? Regards, -- Rohan Monga On Fri, Jan 20, 2012 at 2:32 PM, Edward Capriolo wrote: > IMHO You can not possibly nest the percentile calculation because the > results would be meaningless. percentile has to aggregate a set and > pick the Nth element, But if you nest then the inner percentile only > returns one result to the outer percentile, and that is pretty > meaningless. > > (I think someone talked about this on list in the last month or so). > Without seeing your input data and your expected results, i can not > understand what your query wants to do, and suggest an alternative. > > > > On 1/20/12, rohan monga wrote: > > thanks edward that seems to work :) > > > > However, I have another query is like this > > > > select a, avg(b) as avg_b, percentile_approx( avg_b - percentile_approx( > b, > > .5), .5 ) from table1 group by a > > > > Here I will loose the group info if I include the inner query in the FROM > > clause, is there a way to get this to work? > > > > Thanks, > > -- > > Rohan Monga > > > > > > On Fri, Jan 20, 2012 at 12:51 PM, Edward Capriolo > > wrote: > > > >> I think if you are grouping by b, b has to be in your select list. Try > >> this. > >> FROM ( > >> select b,count(a) as theCount from table one group by b > >> ) a select mean(theCount); > >> > >> I think that should work. > >> > >> On 1/20/12, rohan monga wrote: > >> > Hi, > >> > I am trying to run a query like > >> > "select mean(count(a)) from table1 group by b;" > >> > > >> > I am getting the following error > >> > > >> > FAILED: Hive Internal Error: java.lang.NullPointerException(null) > >> > java.lang.NullPointerException > >> > at > >> > > >> > org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:157) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7447) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7405) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2747) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:3365) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5858) > >> > at > >> > > >> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6480) > >> > at > >> > > >>
Re: Invoke a UDAF inside another UDAF
thanks edward that seems to work :) However, I have another query is like this select a, avg(b) as avg_b, percentile_approx( avg_b - percentile_approx( b, .5), .5 ) from table1 group by a Here I will loose the group info if I include the inner query in the FROM clause, is there a way to get this to work? Thanks, -- Rohan Monga On Fri, Jan 20, 2012 at 12:51 PM, Edward Capriolo wrote: > I think if you are grouping by b, b has to be in your select list. Try > this. > FROM ( > select b,count(a) as theCount from table one group by b > ) a select mean(theCount); > > I think that should work. > > On 1/20/12, rohan monga wrote: > > Hi, > > I am trying to run a query like > > "select mean(count(a)) from table1 group by b;" > > > > I am getting the following error > > > > FAILED: Hive Internal Error: java.lang.NullPointerException(null) > > java.lang.NullPointerException > > at > > > org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151) > > at > > > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656) > > at > > > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777) > > at > > > org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) > > at > > > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88) > > at > > > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125) > > at > > > org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102) > > at > > > org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:157) > > at > > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7447) > > at > > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7405) > > at > > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2747) > > at > > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:3365) > > at > > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5858) > > at > > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6480) > > at > > > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7223) > > at > > > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:240) > > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:428) > > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) > > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) > > at > > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253) > > at > org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210) > > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:401) > > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) > > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:635) > > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:552) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:197) > > > > > > Is there a workaround ? I have tried with hive 0.7.1 and 0.8 > > > > Thanks > > -- > > Rohan Monga > > >
Invoke a UDAF inside another UDAF
Hi, I am trying to run a query like "select mean(count(a)) from table1 group by b;" I am getting the following error FAILED: Hive Internal Error: java.lang.NullPointerException(null) java.lang.NullPointerException at org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:157) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7447) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7405) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2747) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:3365) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5858) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6480) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7223) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:240) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:428) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:401) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:635) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:552) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) Is there a workaround ? I have tried with hive 0.7.1 and 0.8 Thanks -- Rohan Monga
Re: Confusion about IN clause
Yup, sub-queries don't work in the where clause. Is there another document and/or JIRA that i can look at for the implementation of IN clause? Regards, -- Rohan Monga On Mon, Dec 12, 2011 at 10:19 PM, Igor Tatarinov wrote: > I think the doc refers to an IN subquery > WHERE x IN (SELECT blah FROM ...) > > the simple WHERE col IN ('x', 'y', 'z') works fine. > > I imagine none of these work: > http://www.dba-oracle.com/sql/t_subquery_not_in_exists.htm > > > igor > decide.com > > > On Mon, Dec 12, 2011 at 10:09 PM, rohan monga wrote: > >> Hi, >> I though that 'IN' clause was not supported by hive ( version 0.7 ) >> according to the documentation >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-WHEREClause >> >> but a friend of mine showed me that queries like the following >> >> select * from table where row in (1,2,3); >> >> work. and I found this JIRA >> https://issues.apache.org/jira/browse/HIVE-801 which kinda shows that it >> should work. >> >> Was this feature deprecated in the later version ( 0.7.1) or is the >> documentation out of sync? >> >> Thanks, >> -- >> Rohan Monga >> > >
Re: New to hive
i would also recommend cloudera tutorials. -- Rohan Monga On Tue, Dec 13, 2011 at 3:50 AM, Mohammad Tariq wrote: > Hi Praveenesh, > > If you want something other than the wiki, you can use books > like "Hadoop definitive guide" or "Hadoop in action"...These books > have dedicated chapters on hive..In addition, you can also have a look > at this presentation - > "http://www.slideshare.net/cwsteinbach/hive-quick-start-tutorial"; > > Regards, > Mohammad Tariq > > > > On Tue, Dec 13, 2011 at 5:04 PM, praveenesh kumar > wrote: > > Guys, > > > > I have been using hadoop for 3,4 months. > > Tried Map Reduce, HBase, Pig. > > Now wanted to learn HIVE. > > I hope it must be as exciting as other hadoop tools. > > Can someone suggest me some good tutorials, apart from HIVE wiki, > > which I can utilize to understand and work on hive ? > > > > Thanks, > > Praveenesh >
Confusion about IN clause
Hi, I though that 'IN' clause was not supported by hive ( version 0.7 ) according to the documentation https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-WHEREClause but a friend of mine showed me that queries like the following select * from table where row in (1,2,3); work. and I found this JIRA https://issues.apache.org/jira/browse/HIVE-801which kinda shows that it should work. Was this feature deprecated in the later version ( 0.7.1) or is the documentation out of sync? Thanks, -- Rohan Monga
Re: Severely hit by "curse of last reducer"
Hi Mark, Apologies for the thin details on the query :) Here is the error log http://pastebin.com/pqxh4d1u the job tracker doesn't show any errors. I am using hive-0.7, I did set a threshold for the query and sadly i couldn't find any more documentation on skewjoins other than the wiki. Thanks, -- Rohan Monga On Thu, Nov 17, 2011 at 2:02 PM, Mark Grover wrote: > Rohan, > The short answer is: I don't know:-) If you could paste the log, I or someone > else of the mailing list could be able to help. > > BTW, What version of Hive were you using? Did you set the threshold before > running the query? Try to find some documentation online if can tell what all > properties need to be set before Skew Join. My understanding was that the 2 > properties I mentioned below should suffice. > > Mark > > - Original Message - > From: "rohan monga" > To: user@hive.apache.org > Cc: "Ayon Sinha" > Sent: Thursday, November 17, 2011 4:44:17 PM > Subject: Re: Severely hit by "curse of last reducer" > > Hi Mark, > I have tried setting hive.optimize.skewjoin=true, but it get a > NullPointerException after the first stage of the query completes. > Why does that happen? > > Thanks, > -- > Rohan Monga > > > > On Thu, Nov 17, 2011 at 1:37 PM, Mark Grover wrote: >> Ayon, >> I see. From what you explained, skew join seems like what you want. Have you >> tried that already? >> >> Details on how skew join works are in this presentation. Jump to 15 minute >> mark if you want to just listen about skew joins. >> http://www.youtube.com/watch?v=OB4H3Yt5VWM >> >> I bet you could also find something in the mail list archives related to >> Skew Join. >> >> In a nutshell (from the video), >> set hive.optimize.skewjoin=true >> set hive.skewjoin.key= >> >> should do the trick for you. Threshold, I believe, is the number of records >> you consider a large number to defer till later. >> >> Good luck! >> Mark >> >> - Original Message - >> From: "Ayon Sinha" >> To: "Mark Grover" , user@hive.apache.org >> Sent: Wednesday, November 16, 2011 10:53:19 PM >> Subject: Re: Severely hit by "curse of last reducer" >> >> >> >> Only one reducer is always stuck. My table2 is small but using a Mapjoin >> makes my mappers run out of memory. My max reducers is 32 (also max reduce >> capacity). I tried setting num reducers to higher number (even 6000, which >> is appx. combination of dates & names I have) only to have lots of reducers >> with no data. >> So I am quite sure its is some key in stage-1 thats is doing this. >> >> -Ayon >> See My Photos on Flickr >> Also check out my Blog for answers to commonly asked questions. >> >> >> >> >> From: Mark Grover >> To: user@hive.apache.org; Ayon Sinha >> Sent: Wednesday, November 16, 2011 6:54 PM >> Subject: Re: Severely hit by "curse of last reducer" >> >> Hi Ayon, >> Is it one particular reduce task that is slow or the entire reduce phase? >> How many reduce tasks did you have, anyways? >> >> Looking into what the reducer key was might only make sense if a particular >> reduce task was slow. >> >> If your table2 is small enough to fit in memory, you might want to try a map >> join. >> More details at: >> http://www.facebook.com/note.php?note_id=470667928919 >> >> Let me know what you find. >> >> Mark >> >> - Original Message - >> From: "Ayon Sinha" < ayonsi...@yahoo.com > >> To: "Hive Mailinglist" < user@hive.apache.org > >> Sent: Wednesday, November 16, 2011 9:03:23 PM >> Subject: Severely hit by "curse of last reducer" >> >> >> >> Hi, >> Where do I find the log of what reducer key is causing the last reducer to >> go on for hours? The reducer logs don't say much about the key its >> processing. Is there a way to enable a debug mode where it would log the key >> it's processing? >> >> >> My query looks like: >> >> >> select partner_name, dates, sum(coins_granted) from table1 u join table2 p >> on u.partner_id=p.id group by partner_name, dates >> >> >> >> My uncompressed size of table1 is about 30GB. >> >> -Ayon >> See My Photos on Flickr >> Also check out my Blog for answers to commonly asked questions. >> >> >> >
Re: Severely hit by "curse of last reducer"
Hi Mark, I have tried setting hive.optimize.skewjoin=true, but it get a NullPointerException after the first stage of the query completes. Why does that happen? Thanks, -- Rohan Monga On Thu, Nov 17, 2011 at 1:37 PM, Mark Grover wrote: > Ayon, > I see. From what you explained, skew join seems like what you want. Have you > tried that already? > > Details on how skew join works are in this presentation. Jump to 15 minute > mark if you want to just listen about skew joins. > http://www.youtube.com/watch?v=OB4H3Yt5VWM > > I bet you could also find something in the mail list archives related to Skew > Join. > > In a nutshell (from the video), > set hive.optimize.skewjoin=true > set hive.skewjoin.key= > > should do the trick for you. Threshold, I believe, is the number of records > you consider a large number to defer till later. > > Good luck! > Mark > > - Original Message - > From: "Ayon Sinha" > To: "Mark Grover" , user@hive.apache.org > Sent: Wednesday, November 16, 2011 10:53:19 PM > Subject: Re: Severely hit by "curse of last reducer" > > > > Only one reducer is always stuck. My table2 is small but using a Mapjoin > makes my mappers run out of memory. My max reducers is 32 (also max reduce > capacity). I tried setting num reducers to higher number (even 6000, which is > appx. combination of dates & names I have) only to have lots of reducers with > no data. > So I am quite sure its is some key in stage-1 thats is doing this. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > > > > From: Mark Grover > To: user@hive.apache.org; Ayon Sinha > Sent: Wednesday, November 16, 2011 6:54 PM > Subject: Re: Severely hit by "curse of last reducer" > > Hi Ayon, > Is it one particular reduce task that is slow or the entire reduce phase? How > many reduce tasks did you have, anyways? > > Looking into what the reducer key was might only make sense if a particular > reduce task was slow. > > If your table2 is small enough to fit in memory, you might want to try a map > join. > More details at: > http://www.facebook.com/note.php?note_id=470667928919 > > Let me know what you find. > > Mark > > - Original Message - > From: "Ayon Sinha" < ayonsi...@yahoo.com > > To: "Hive Mailinglist" < user@hive.apache.org > > Sent: Wednesday, November 16, 2011 9:03:23 PM > Subject: Severely hit by "curse of last reducer" > > > > Hi, > Where do I find the log of what reducer key is causing the last reducer to go > on for hours? The reducer logs don't say much about the key its processing. > Is there a way to enable a debug mode where it would log the key it's > processing? > > > My query looks like: > > > select partner_name, dates, sum(coins_granted) from table1 u join table2 p on > u.partner_id=p.id group by partner_name, dates > > > > My uncompressed size of table1 is about 30GB. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > >
Re: problems with indices
Hi Prajakta, I get it now. But are there plans to support groupbys of that format, or would I be able to write a UDF of sorts to do that? Also, could you please give me an example of accelerating equi-joins with indices, I haven't been able to get that to work either :( Regards, -- Rohan Monga On Wed, Sep 21, 2011 at 6:15 PM, Prajakta Kalmegh wrote: > Hi Rohan > As of now, we support queries with aggregate function count on columns which > are the same as index key column and are used in group-by construct. For > example, > CREATE TABLE tbl(key int, value int); > CREATE INDEX tbl_key_idx ON TABLE tbl(key) AS > 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED > REBUILD IDXPROPERTIES("AGGREGATES"="count(key)"); > ALTER INDEX tbl_key_idx ON tbl REBUILD; > It is valid to use the following query: > select key, count(key) from tbl where key < 1000 group by key; > But not valid to use: > select value, count(key) from tbl group by value; (I think this is the same > as your case) > As of now, you cannot use the aggregate index in case of join queries. > For the semantic analysis error, I apologize for the typo in the code. It > worked for us as it was uniform in all the classes :) > You forgot to change it in a few more instances and hence the semantic > error. The aggregate index itself creates the column references with a > 'count_Of..' in it. We have fixed the issue and should be updated in the > trunk soon. Thanks for noticing it though. :) > Regards, > Prajakta > > > > > > > > On Wed, Sep 21, 2011 at 12:28 AM, rohan monga wrote: >> >> Hi Prajakta, >> Thanks, I was able to generate a plan which was using indices by >> following your advise. >> However, I want to group on one key and count the others >> "select count(id2) from table_t1 group by id1", would that not be possible >> ? >> >> Also, could you give me an example of using indices to speed up joins? >> >> And for the semantic analysis thing, I sent the error in my first >> email, here are the lines that I modified to get it to work. Basically >> changed "_count_Of" to "_count_of" >> >> Regards, >> -- >> Rohan Monga >> >> >> >> On Wed, Sep 21, 2011 at 4:22 AM, Prajakta Kalmegh >> wrote: >> > Hi Rohan >> > I run your queries on the same version from trunk and did not get any >> > errors. Not sure why you are getting a semantic analysis error. It would >> > be >> > good if you could send me a snapshot of the error message. >> > About the second issue of getting the same plan even if you >> > set hive.optimize.index.groupby to true or false, we do not optimize for >> > cases where the count (aggregate functions) are applied on keys other >> > than >> > the group-by keys. Since you are trying to get a count of id2 and have >> > the >> > index key (as well as the group-by key) on id1, the optimization is not >> > applied. Hence the same plan. >> > Hope this helps. Please let me know if you have any questions. >> > Regards, >> > Prajakta >> > >> > >> >> From: rohan monga >> >> Date: September 19, 2011 11:26:29 PM PDT >> >> To: >> >> Subject: problems with indices >> >> Reply-To: >> >> >> >> Hi, >> >> I have a table and index that look like >> >> >> >> >> >> CREATE TABLE table_t1(id1 int, id2 int) >> >> CREATE INDEX table_t1_idx ON TABLE table_t1(id1) AS >> >> 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED >> >> REBUILD IDXPROPERTIES("AGGREGATES"="count(id2)"); >> >> ALTER INDEX table_t1_idx ON table_t1 REBUILD; >> >> >> >> >> >> Now, I am seeing 2 problems with the query 'set >> >> hive.optimize.index.groupby=false;select count(id2) from table_t1 >> >> group by id1', >> >> firstly, >> >> >> >> >> >> FAILED: Error in semantic analysis: Line 1:11 Invalid table alias or >> >> column reference '`_count_Of_id2`': (possible column names are: id1, >> >> _bucketname, _offsets, _count_of_id2) >> >> >> >> >> >> So, I assumed that it was a typo or something, and I change all 'Of' >> >> to 'of' in the relevant places in the code. >> >> >> >> Then I ran the query again, with the option set to true and then to >> >> false, the run times were almost the same. Also 'explain' on the >> >> queries show identical query plans. >> >> I am using hive revision 1172989 from trunk >> >> >> >> Could someone help me with this? >> >> >> >> Regards, >> >> >> >> -- >> >> Rohan Monga >> > >> > >> > > >
Re: problems with indices
Hi Prajakta, Thanks, I was able to generate a plan which was using indices by following your advise. However, I want to group on one key and count the others "select count(id2) from table_t1 group by id1", would that not be possible ? Also, could you give me an example of using indices to speed up joins? And for the semantic analysis thing, I sent the error in my first email, here are the lines that I modified to get it to work. Basically changed "_count_Of" to "_count_of" Regards, -- Rohan Monga On Wed, Sep 21, 2011 at 4:22 AM, Prajakta Kalmegh wrote: > Hi Rohan > I run your queries on the same version from trunk and did not get any > errors. Not sure why you are getting a semantic analysis error. It would be > good if you could send me a snapshot of the error message. > About the second issue of getting the same plan even if you > set hive.optimize.index.groupby to true or false, we do not optimize for > cases where the count (aggregate functions) are applied on keys other than > the group-by keys. Since you are trying to get a count of id2 and have the > index key (as well as the group-by key) on id1, the optimization is not > applied. Hence the same plan. > Hope this helps. Please let me know if you have any questions. > Regards, > Prajakta > > >> From: rohan monga >> Date: September 19, 2011 11:26:29 PM PDT >> To: >> Subject: problems with indices >> Reply-To: >> >> Hi, >> I have a table and index that look like >> >> >> CREATE TABLE table_t1(id1 int, id2 int) >> CREATE INDEX table_t1_idx ON TABLE table_t1(id1) AS >> 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED >> REBUILD IDXPROPERTIES("AGGREGATES"="count(id2)"); >> ALTER INDEX table_t1_idx ON table_t1 REBUILD; >> >> >> Now, I am seeing 2 problems with the query 'set >> hive.optimize.index.groupby=false;select count(id2) from table_t1 >> group by id1', >> firstly, >> >> >> FAILED: Error in semantic analysis: Line 1:11 Invalid table alias or >> column reference '`_count_Of_id2`': (possible column names are: id1, >> _bucketname, _offsets, _count_of_id2) >> >> >> So, I assumed that it was a typo or something, and I change all 'Of' >> to 'of' in the relevant places in the code. >> >> Then I ran the query again, with the option set to true and then to >> false, the run times were almost the same. Also 'explain' on the >> queries show identical query plans. >> I am using hive revision 1172989 from trunk >> >> Could someone help me with this? >> >> Regards, >> >> -- >> Rohan Monga > > > rohan.diff Description: Binary data
problems with indices
Hi, I have a table and index that look like CREATE TABLE table_t1(id1 int, id2 int) CREATE INDEX table_t1_idx ON TABLE table_t1(id1) AS 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED REBUILD IDXPROPERTIES("AGGREGATES"="count(id2)"); ALTER INDEX table_t1_idx ON table_t1 REBUILD; Now, I am seeing 2 problems with the query 'set hive.optimize.index.groupby=false;select count(id2) from table_t1 group by id1', firstly, FAILED: Error in semantic analysis: Line 1:11 Invalid table alias or column reference '`_count_Of_id2`': (possible column names are: id1, _bucketname, _offsets, _count_of_id2) So, I assumed that it was a typo or something, and I change all 'Of' to 'of' in the relevant places in the code. Then I ran the query again, with the option set to true and then to false, the run times were almost the same. Also 'explain' on the queries show identical query plans. I am using hive revision 1172989 from trunk Could someone help me with this? Regards, -- Rohan Monga
Re: CREATE INDEX
yup, I'd be interested in this too. I'd like to know how the indices are applied/used and which queries can they be useful. Thanks, -- Rohan Monga On Mon, Sep 12, 2011 at 6:16 PM, Adriaan Tijsseling wrote: > Does anyone know the proper documentation for using the "CREATE INDEX" > command in hive? The hive homepage doesn't have the information. > > Cheers, > > Adriaan > >
Problem with custom map-reduce scripts
Hi, I have a hive query like this from ( from mytable map mytable.c1, mytable.c2, mytable.c3 using '/bin/cat' as c1,c2,c3 cluster by c1) mo insert overwrite table mytable2 reduce mo.c1, mo.c2, mo.c3 using 'python convert.py' as x1, x2; Now what is happening is that some of the records from 'mytable' are not making their way into 'mytable2'. However, from ( from (select * from mytable where c1=1) map mytable.c1, mytable.c2, mytable.c3 using '/bin/cat' as c1,c2,c3 cluster by c1) mo insert overwrite table mytable2 reduce mo.c1, mo.c2, mo.c3 using 'python convert.py' as x1, x2; Say, the missing record was for c1=1, then with this query, the record shows up in mytable2. Is there something I am missing ? Regards, -- Rohan Monga
Comments in hive scripts
Hi, I am using hive version 0.7 and I am having an odd issue with comments The following works use default; -- this is good -- select blah from blahness ... However, use default; -- this is good -- select * from blahness; ... doesn't work, is this because hive is parsing ';' before removing comments? or am I missing something? Thanks, -- Rohan Monga