problems with beeline and "add file"

2014-10-27 Thread rohan monga
Hi,
I'm running hiveserver2 on machine X and beeline on machine Y.
Everything works fine in terms queries etc.
However, if i try "add file blah.py" via beeline cli then it tries on find
on machine X instead of shipping it locally from Y.

Is this the expected behavior or have I misconfigured something.

Regards,
--
Rohan Monga


Re: casting complex data types for outputs of custom scripts

2014-01-19 Thread rohan monga
sorry for the delayed response.

yes the python script follows that.

--
Rohan Monga


On Tue, Jan 14, 2014 at 4:31 PM, Stephen Sprague  wrote:
> @OP - first thing i'd ask is does your python script obey the ^A,^B,^C,^D
> etc. nesting delimiter pattern.  give that your create table does not
> specify delimiters those are the defaults.  nb. ^A == control-A == \001
>
> Cheers,
> Stephen.
>
>
> On Tue, Jan 14, 2014 at 3:11 PM, Andre Araujo  wrote:
>>
>> I had a similar issue in the past when trying to cast an empty array to
>> array(). By default Hive assumes it's an array().
>> I don't think there's currently a Hive syntax to cast values to complex
>> data types. If there's one, I'd love to know what it is :)
>>
>>
>> On 14 January 2014 10:22, rohan monga  wrote:
>>>
>>> Hi,
>>>
>>> I have a table that is of the following format
>>>
>>> create table t1 ( f1 int, f2 array> );
>>>
>>> Now I have a custom script that does some computation and generates
>>> the value for f2
>>> like so
>>>
>>> from (
>>> from randomtable r
>>> map r.g1, r.g2, r.g3
>>> using '/bin/cat' as g1, g2, g3
>>> cluster by g1 ) m
>>> insert overwrite table t1
>>> reduce m.g1, m.g2, m.g3
>>> using 'python customScript.py' as ( f1 , f2 );
>>>
>>> however f2 is not being loaded properly into t1, it comes up broken or
>>> null. What should I do so that f2 is loaded as an array of structs.
>>>
>>>
>>> Thanks,
>>>
>>> --
>>> Rohan Monga
>>
>>
>>
>>
>> --
>> André Araújo
>> Big Data Consultant/Solutions Architect
>> The Pythian Group - Australia - www.pythian.com
>>
>> Office (calls from within Australia): 1300 366 021 x1270
>> Office (international): +61 2 8016 7000  x270 OR +1 613 565 8696   x1270
>> Mobile: +61 410 323 559
>> Fax: +61 2 9805 0544
>> IM: pythianaraujo @ AIM/MSN/Y! or ara...@pythian.com @ GTalk
>>
>> “Success is not about standing at the top, it's the steps you leave
>> behind.” — Iker Pou (rock climber)
>>
>> --
>>
>>
>>
>


casting complex data types for outputs of custom scripts

2014-01-13 Thread rohan monga
Hi,

I have a table that is of the following format

create table t1 ( f1 int, f2 array> );

Now I have a custom script that does some computation and generates
the value for f2
like so

from (
from randomtable r
map r.g1, r.g2, r.g3
using '/bin/cat' as g1, g2, g3
cluster by g1 ) m
insert overwrite table t1
reduce m.g1, m.g2, m.g3
using 'python customScript.py' as ( f1 , f2 );

however f2 is not being loaded properly into t1, it comes up broken or
null. What should I do so that f2 is loaded as an array of structs.


Thanks,

--
Rohan Monga


Re: Invoke a UDAF inside another UDAF

2012-02-01 Thread rohan monga
thanks Mark,
I ended up going the custom reducer way. I will try out the query you have
sent.

Regards,
--
Rohan Monga


On Wed, Feb 1, 2012 at 11:06 AM, Mark Grover  wrote:

> Rohan,
> You could do it one of the following ways:
> 1) Write a UDAF that does the avg(f2 - avg_f2) computation.
> 2) Write a custom reducer that does the avg(f2 - avg_f2) computation.
> 3) Do it with multiple passes over the data. Something like this
> (untested):
>
> select
>   table.f1,
>   avg_table.avg_f2,
>   avg(table.f2-avg_table.avg_f2)
> from
> (
> select
>   f1,
>   avg(f2) as avg_f2
> from
>   table
> group by
>   f1)avg_table
> join
> table
> ON (avg_table.f1=table.f1)
> group by
>   table.f1,
>   avg_table.avg_f2;
>
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> - Original Message -
> From: "rohan monga" 
> To: user@hive.apache.org
> Sent: Friday, January 20, 2012 6:00:54 PM
> Subject: Re: Invoke a UDAF inside another UDAF
>
> my bad, i hastily converted the query to a wrong example.
>
> it should be like this
>
> select f1, avg(f2) as avg_f2, avg(f2 - avg_f2) from table group by f1;
>
> In essence, I just want to use the value generated by one UDAF ( in this
> case avg(f2) ) as a single number and then apply that value to the group
> inside a different UDAF.
> For e.g. if I were to use a streaming reducer, it would be something like
> this
>
> avg1 = computeSum(list) / len(list)
> return computeSum(x-avg1 for x in list) / len(list)
>
> As I write this I realize why this might not be possible [ the group
> computation being done in one step and the information being lost ] :)
>
> But why the nullpointer exception?
>
> Regards,
> --
> Rohan Monga
>
>
>
> On Fri, Jan 20, 2012 at 2:32 PM, Edward Capriolo < edlinuxg...@gmail.com> 
> wrote:
>
>
> IMHO You can not possibly nest the percentile calculation because the
> results would be meaningless. percentile has to aggregate a set and
> pick the Nth element, But if you nest then the inner percentile only
> returns one result to the outer percentile, and that is pretty
> meaningless.
>
> (I think someone talked about this on list in the last month or so).
> Without seeing your input data and your expected results, i can not
> understand what your query wants to do, and suggest an alternative.
>
>
>
>
>
> On 1/20/12, rohan monga < monga.ro...@gmail.com > wrote:
> > thanks edward that seems to work :)
> >
> > However, I have another query is like this
> >
> > select a, avg(b) as avg_b, percentile_approx( avg_b - percentile_approx(
> b,
> > .5), .5 ) from table1 group by a
> >
> > Here I will loose the group info if I include the inner query in the FROM
> > clause, is there a way to get this to work?
> >
> > Thanks,
> > --
> > Rohan Monga
> >
> >
> > On Fri, Jan 20, 2012 at 12:51 PM, Edward Capriolo
> > < edlinuxg...@gmail.com >wrote:
> >
> >> I think if you are grouping by b, b has to be in your select list. Try
> >> this.
> >> FROM (
> >> select b,count(a) as theCount from table one group by b
> >> ) a select mean(theCount);
> >>
> >> I think that should work.
> >>
> >> On 1/20/12, rohan monga < monga.ro...@gmail.com > wrote:
> >> > Hi,
> >> > I am trying to run a query like
> >> > "select mean(count(a)) from table1 group by b;"
> >> >
> >> > I am getting the following error
> >> > 
> >> > FAILED: Hive Internal Error: java.lang.NullPointerException(null)
> >> > java.lang.NullPointerException
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
> >> > at
> >> >
> >>

Re: Invoke a UDAF inside another UDAF

2012-01-20 Thread rohan monga
my bad, i hastily converted the query to a wrong example.

it should be like this

select f1, avg(f2) as avg_f2, avg(f2 - avg_f2) from table group by f1;

In essence, I just want to use the value generated by one UDAF ( in this
case avg(f2) ) as a single number and then apply that value to the group
inside a different UDAF.
For e.g. if I were to use a streaming reducer, it would be something like
this

avg1 = computeSum(list) / len(list)
return computeSum(x-avg1 for x in list) / len(list)

As I write this I realize why this might not be possible [ the group
computation being done in one step and the information being lost ] :)

But why the nullpointer exception?

Regards,
--
Rohan Monga


On Fri, Jan 20, 2012 at 2:32 PM, Edward Capriolo wrote:

> IMHO You can not possibly nest the percentile calculation because the
> results would be meaningless. percentile has to aggregate a set and
> pick the Nth element, But if you nest then the inner percentile only
> returns one result to the outer percentile, and that is pretty
> meaningless.
>
> (I think someone talked about this on list in the last month or so).
> Without seeing your input data and your expected results, i can not
> understand what your query wants to do, and suggest an alternative.
>
>
>
> On 1/20/12, rohan monga  wrote:
> > thanks edward that seems to work  :)
> >
> > However, I have another query is like this
> >
> > select a, avg(b) as avg_b, percentile_approx( avg_b - percentile_approx(
> b,
> > .5), .5 ) from table1 group by a
> >
> > Here I will loose the group info if I include the inner query in the FROM
> > clause, is there a way to get this to work?
> >
> > Thanks,
> > --
> > Rohan Monga
> >
> >
> > On Fri, Jan 20, 2012 at 12:51 PM, Edward Capriolo
> > wrote:
> >
> >> I think if you are grouping by b, b has to be in your select list. Try
> >> this.
> >> FROM (
> >> select b,count(a) as theCount from table one group by b
> >> ) a select mean(theCount);
> >>
> >> I think that should work.
> >>
> >> On 1/20/12, rohan monga  wrote:
> >> > Hi,
> >> > I am trying to run a query like
> >> > "select mean(count(a)) from table1 group by b;"
> >> >
> >> > I am getting the following error
> >> > 
> >> > FAILED: Hive Internal Error: java.lang.NullPointerException(null)
> >> > java.lang.NullPointerException
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:157)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7447)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7405)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2747)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:3365)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5858)
> >> > at
> >> >
> >>
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6480)
> >> > at
> >> >
> >>

Re: Invoke a UDAF inside another UDAF

2012-01-20 Thread rohan monga
thanks edward that seems to work  :)

However, I have another query is like this

select a, avg(b) as avg_b, percentile_approx( avg_b - percentile_approx( b,
.5), .5 ) from table1 group by a

Here I will loose the group info if I include the inner query in the FROM
clause, is there a way to get this to work?

Thanks,
--
Rohan Monga


On Fri, Jan 20, 2012 at 12:51 PM, Edward Capriolo wrote:

> I think if you are grouping by b, b has to be in your select list. Try
> this.
> FROM (
> select b,count(a) as theCount from table one group by b
> ) a select mean(theCount);
>
> I think that should work.
>
> On 1/20/12, rohan monga  wrote:
> > Hi,
> > I am trying to run a query like
> > "select mean(count(a)) from table1 group by b;"
> >
> > I am getting the following error
> > 
> > FAILED: Hive Internal Error: java.lang.NullPointerException(null)
> > java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151)
> > at
> >
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656)
> > at
> >
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777)
> > at
> >
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
> > at
> >
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
> > at
> >
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
> > at
> >
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
> > at
> >
> org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:157)
> > at
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7447)
> > at
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7405)
> > at
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2747)
> > at
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:3365)
> > at
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5858)
> > at
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6480)
> > at
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7223)
> > at
> >
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:240)
> > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:428)
> > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
> > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
> > at
> > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
> > at
> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210)
> > at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:401)
> > at
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
> > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:635)
> > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:552)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
> > 
> >
> > Is there a workaround ? I have tried with hive 0.7.1 and 0.8
> >
> > Thanks
> > --
> > Rohan Monga
> >
>


Invoke a UDAF inside another UDAF

2012-01-20 Thread rohan monga
Hi,
I am trying to run a query like
"select mean(count(a)) from table1 group by b;"

I am getting the following error

FAILED: Hive Internal Error: java.lang.NullPointerException(null)
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc.newInstance(ExprNodeGenericFuncDesc.java:151)
at
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:656)
at
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:777)
at
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:88)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:125)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:102)
at
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:157)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7447)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:7405)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:2747)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggr1MR(SemanticAnalyzer.java:3365)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:5858)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6480)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7223)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:240)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:428)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:210)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:401)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:635)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:552)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)


Is there a workaround ? I have tried with hive 0.7.1 and 0.8

Thanks
--
Rohan Monga


Re: Confusion about IN clause

2011-12-13 Thread rohan monga
Yup, sub-queries don't work in the where clause.
Is there another document and/or JIRA that i can look at for the
implementation of IN clause?

Regards,
--
Rohan Monga


On Mon, Dec 12, 2011 at 10:19 PM, Igor Tatarinov  wrote:

> I think the doc refers to an IN subquery
> WHERE x IN (SELECT blah FROM ...)
>
> the simple WHERE col IN ('x', 'y', 'z') works fine.
>
> I  imagine none of these work:
> http://www.dba-oracle.com/sql/t_subquery_not_in_exists.htm
>
>
> igor
> decide.com
>
>
> On Mon, Dec 12, 2011 at 10:09 PM, rohan monga wrote:
>
>> Hi,
>> I though that 'IN' clause was not supported by hive ( version 0.7 )
>> according to the documentation
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-WHEREClause
>>
>> but a friend of mine showed me that queries like the following
>> 
>> select * from table where row in (1,2,3);
>> 
>> work. and I found this JIRA
>> https://issues.apache.org/jira/browse/HIVE-801 which kinda shows that it
>> should work.
>>
>> Was this feature deprecated in the later version ( 0.7.1) or is the
>> documentation out of sync?
>>
>> Thanks,
>> --
>> Rohan Monga
>>
>
>


Re: New to hive

2011-12-13 Thread rohan monga
i would also recommend cloudera tutorials.

--
Rohan Monga


On Tue, Dec 13, 2011 at 3:50 AM, Mohammad Tariq  wrote:

> Hi Praveenesh,
>
> If you want something other than the wiki, you can use books
> like "Hadoop definitive guide" or "Hadoop in action"...These books
> have dedicated chapters on hive..In addition, you can also have a look
> at this presentation -
> "http://www.slideshare.net/cwsteinbach/hive-quick-start-tutorial";
>
> Regards,
> Mohammad Tariq
>
>
>
> On Tue, Dec 13, 2011 at 5:04 PM, praveenesh kumar 
> wrote:
> > Guys,
> >
> > I have been using hadoop for 3,4 months.
> > Tried Map Reduce, HBase, Pig.
> > Now wanted to learn HIVE.
> > I hope it must be as exciting as other hadoop tools.
> > Can someone suggest me some good tutorials, apart from HIVE wiki,
> > which I can utilize to understand and work on hive ?
> >
> > Thanks,
> > Praveenesh
>


Confusion about IN clause

2011-12-12 Thread rohan monga
Hi,
I though that 'IN' clause was not supported by hive ( version 0.7 )
according to the documentation
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-WHEREClause

but a friend of mine showed me that queries like the following

select * from table where row in (1,2,3);

work. and I found this JIRA
https://issues.apache.org/jira/browse/HIVE-801which kinda shows that
it should work.

Was this feature deprecated in the later version ( 0.7.1) or is the
documentation out of sync?

Thanks,
--
Rohan Monga


Re: Severely hit by "curse of last reducer"

2011-11-17 Thread rohan monga
Hi Mark,
Apologies for the thin details on the query :)
Here is the error log http://pastebin.com/pqxh4d1u the job tracker
doesn't show any errors.
I am using hive-0.7, I did set a threshold for the query and sadly i
couldn't find any more documentation on skewjoins other than the wiki.

Thanks,
--
Rohan Monga



On Thu, Nov 17, 2011 at 2:02 PM, Mark Grover  wrote:
> Rohan,
> The short answer is: I don't know:-) If you could paste the log, I or someone 
> else of the mailing list could be able to help.
>
> BTW, What version of Hive were you using? Did you set the threshold before 
> running the query? Try to find some documentation online if can tell what all 
> properties need to be set before Skew Join. My understanding was that the 2 
> properties I mentioned below should suffice.
>
> Mark
>
> - Original Message -
> From: "rohan monga" 
> To: user@hive.apache.org
> Cc: "Ayon Sinha" 
> Sent: Thursday, November 17, 2011 4:44:17 PM
> Subject: Re: Severely hit by "curse of last reducer"
>
> Hi Mark,
> I have tried setting hive.optimize.skewjoin=true, but it get a
> NullPointerException after the first stage of the query completes.
> Why does that happen?
>
> Thanks,
> --
> Rohan Monga
>
>
>
> On Thu, Nov 17, 2011 at 1:37 PM, Mark Grover  wrote:
>> Ayon,
>> I see. From what you explained, skew join seems like what you want. Have you 
>> tried that already?
>>
>> Details on how skew join works are in this presentation. Jump to 15 minute 
>> mark if you want to just listen about skew joins.
>> http://www.youtube.com/watch?v=OB4H3Yt5VWM
>>
>> I bet you could also find something in the mail list archives related to 
>> Skew Join.
>>
>> In a nutshell (from the video),
>> set hive.optimize.skewjoin=true
>> set hive.skewjoin.key=
>>
>> should do the trick for you. Threshold, I believe, is the number of records 
>> you consider a large number to defer till later.
>>
>> Good luck!
>> Mark
>>
>> - Original Message -
>> From: "Ayon Sinha" 
>> To: "Mark Grover" , user@hive.apache.org
>> Sent: Wednesday, November 16, 2011 10:53:19 PM
>> Subject: Re: Severely hit by "curse of last reducer"
>>
>>
>>
>> Only one reducer is always stuck. My table2 is small but using a Mapjoin 
>> makes my mappers run out of memory. My max reducers is 32 (also max reduce 
>> capacity). I tried setting num reducers to higher number (even 6000, which 
>> is appx. combination of dates & names I have) only to have lots of reducers 
>> with no data.
>> So I am quite sure its is some key in stage-1 thats is doing this.
>>
>> -Ayon
>> See My Photos on Flickr
>> Also check out my Blog for answers to commonly asked questions.
>>
>>
>>
>>
>> From: Mark Grover 
>> To: user@hive.apache.org; Ayon Sinha 
>> Sent: Wednesday, November 16, 2011 6:54 PM
>> Subject: Re: Severely hit by "curse of last reducer"
>>
>> Hi Ayon,
>> Is it one particular reduce task that is slow or the entire reduce phase? 
>> How many reduce tasks did you have, anyways?
>>
>> Looking into what the reducer key was might only make sense if a particular 
>> reduce task was slow.
>>
>> If your table2 is small enough to fit in memory, you might want to try a map 
>> join.
>> More details at:
>> http://www.facebook.com/note.php?note_id=470667928919
>>
>> Let me know what you find.
>>
>> Mark
>>
>> - Original Message -
>> From: "Ayon Sinha" < ayonsi...@yahoo.com >
>> To: "Hive Mailinglist" < user@hive.apache.org >
>> Sent: Wednesday, November 16, 2011 9:03:23 PM
>> Subject: Severely hit by "curse of last reducer"
>>
>>
>>
>> Hi,
>> Where do I find the log of what reducer key is causing the last reducer to 
>> go on for hours? The reducer logs don't say much about the key its 
>> processing. Is there a way to enable a debug mode where it would log the key 
>> it's processing?
>>
>>
>> My query looks like:
>>
>>
>> select partner_name, dates, sum(coins_granted) from table1 u join table2 p 
>> on u.partner_id=p.id group by partner_name, dates
>>
>>
>>
>> My uncompressed size of table1 is about 30GB.
>>
>> -Ayon
>> See My Photos on Flickr
>> Also check out my Blog for answers to commonly asked questions.
>>
>>
>>
>


Re: Severely hit by "curse of last reducer"

2011-11-17 Thread rohan monga
Hi Mark,
I have tried setting hive.optimize.skewjoin=true, but it get a
NullPointerException after the first stage of the query completes.
Why does that happen?

Thanks,
--
Rohan Monga



On Thu, Nov 17, 2011 at 1:37 PM, Mark Grover  wrote:
> Ayon,
> I see. From what you explained, skew join seems like what you want. Have you 
> tried that already?
>
> Details on how skew join works are in this presentation. Jump to 15 minute 
> mark if you want to just listen about skew joins.
> http://www.youtube.com/watch?v=OB4H3Yt5VWM
>
> I bet you could also find something in the mail list archives related to Skew 
> Join.
>
> In a nutshell (from the video),
> set hive.optimize.skewjoin=true
> set hive.skewjoin.key=
>
> should do the trick for you. Threshold, I believe, is the number of records 
> you consider a large number to defer till later.
>
> Good luck!
> Mark
>
> - Original Message -
> From: "Ayon Sinha" 
> To: "Mark Grover" , user@hive.apache.org
> Sent: Wednesday, November 16, 2011 10:53:19 PM
> Subject: Re: Severely hit by "curse of last reducer"
>
>
>
> Only one reducer is always stuck. My table2 is small but using a Mapjoin 
> makes my mappers run out of memory. My max reducers is 32 (also max reduce 
> capacity). I tried setting num reducers to higher number (even 6000, which is 
> appx. combination of dates & names I have) only to have lots of reducers with 
> no data.
> So I am quite sure its is some key in stage-1 thats is doing this.
>
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
>
> From: Mark Grover 
> To: user@hive.apache.org; Ayon Sinha 
> Sent: Wednesday, November 16, 2011 6:54 PM
> Subject: Re: Severely hit by "curse of last reducer"
>
> Hi Ayon,
> Is it one particular reduce task that is slow or the entire reduce phase? How 
> many reduce tasks did you have, anyways?
>
> Looking into what the reducer key was might only make sense if a particular 
> reduce task was slow.
>
> If your table2 is small enough to fit in memory, you might want to try a map 
> join.
> More details at:
> http://www.facebook.com/note.php?note_id=470667928919
>
> Let me know what you find.
>
> Mark
>
> - Original Message -
> From: "Ayon Sinha" < ayonsi...@yahoo.com >
> To: "Hive Mailinglist" < user@hive.apache.org >
> Sent: Wednesday, November 16, 2011 9:03:23 PM
> Subject: Severely hit by "curse of last reducer"
>
>
>
> Hi,
> Where do I find the log of what reducer key is causing the last reducer to go 
> on for hours? The reducer logs don't say much about the key its processing. 
> Is there a way to enable a debug mode where it would log the key it's 
> processing?
>
>
> My query looks like:
>
>
> select partner_name, dates, sum(coins_granted) from table1 u join table2 p on 
> u.partner_id=p.id group by partner_name, dates
>
>
>
> My uncompressed size of table1 is about 30GB.
>
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>


Re: problems with indices

2011-09-22 Thread rohan monga
Hi Prajakta,
I get it now. But are there plans to support groupbys of that format,
or would I be able to write a UDF of sorts to do that?

Also, could you please give me an example of accelerating equi-joins
with indices, I haven't been able to get that to work either :(

Regards,
--
Rohan Monga



On Wed, Sep 21, 2011 at 6:15 PM, Prajakta Kalmegh  wrote:
> Hi Rohan
> As of now, we support queries with aggregate function count on columns which
> are the same as index key column and are used in group-by construct. For
> example,
> CREATE TABLE tbl(key int, value int);
> CREATE INDEX tbl_key_idx ON TABLE tbl(key) AS
> 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED
> REBUILD IDXPROPERTIES("AGGREGATES"="count(key)");
> ALTER INDEX tbl_key_idx ON tbl REBUILD;
> It is valid to use the following query:
> select key, count(key) from tbl where key < 1000 group by key;
> But not valid to use:
> select value, count(key) from tbl group by value; (I think this is the same
> as your case)
> As of now, you cannot use the aggregate index in case of join queries.
> For the semantic analysis error, I apologize for the typo in the code. It
> worked for us as it was uniform in all the classes :)
> You forgot to change it in a few more instances and hence the semantic
> error. The aggregate index itself creates the column references with a
> 'count_Of..' in it. We have fixed the issue and should be updated in the
> trunk soon. Thanks for noticing it though. :)
> Regards,
> Prajakta
>
>
>
>
>
>
>
> On Wed, Sep 21, 2011 at 12:28 AM, rohan monga  wrote:
>>
>> Hi Prajakta,
>> Thanks, I was able to generate a plan which was using indices by
>> following your advise.
>> However, I want to group on one key and count the others
>> "select count(id2) from table_t1 group by id1", would that not be possible
>> ?
>>
>> Also, could you give me an example of using indices to speed up joins?
>>
>> And for the semantic analysis thing, I sent the error in my first
>> email, here are the lines that I modified to get it to work. Basically
>> changed "_count_Of" to "_count_of"
>>
>> Regards,
>> --
>> Rohan Monga
>>
>>
>>
>> On Wed, Sep 21, 2011 at 4:22 AM, Prajakta Kalmegh 
>> wrote:
>> > Hi Rohan
>> > I run your queries on the same version from trunk and did not get any
>> > errors. Not sure why you are getting a semantic analysis error. It would
>> > be
>> > good if you could send me a snapshot of the error message.
>> > About the second issue of getting the same plan even if you
>> > set hive.optimize.index.groupby to true or false, we do not optimize for
>> > cases where the count (aggregate functions) are applied on keys other
>> > than
>> > the group-by keys. Since you are trying to get a count of id2 and have
>> > the
>> > index key (as well as the group-by key) on id1, the optimization is not
>> > applied. Hence the same plan.
>> > Hope this helps. Please let me know if you have any questions.
>> > Regards,
>> > Prajakta
>> >
>> >
>> >> From: rohan monga 
>> >> Date: September 19, 2011 11:26:29 PM PDT
>> >> To: 
>> >> Subject: problems with indices
>> >> Reply-To: 
>> >>
>> >> Hi,
>> >> I have a table and index that look like
>> >>
>> >> 
>> >> CREATE TABLE table_t1(id1 int,  id2 int)
>> >> CREATE INDEX table_t1_idx ON TABLE table_t1(id1) AS
>> >> 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED
>> >> REBUILD IDXPROPERTIES("AGGREGATES"="count(id2)");
>> >> ALTER INDEX table_t1_idx ON table_t1 REBUILD;
>> >> 
>> >>
>> >> Now, I am seeing 2 problems with the query 'set
>> >> hive.optimize.index.groupby=false;select count(id2) from table_t1
>> >> group by id1',
>> >> firstly,
>> >>
>> >> 
>> >> FAILED: Error in semantic analysis: Line 1:11 Invalid table alias or
>> >> column reference '`_count_Of_id2`': (possible column names are: id1,
>> >> _bucketname, _offsets, _count_of_id2)
>> >> 
>> >>
>> >> So, I assumed that it was a typo or something, and I change all 'Of'
>> >> to 'of' in the relevant places in the code.
>> >>
>> >> Then I ran the query again, with the option set to true and then to
>> >> false, the run times were almost the same. Also 'explain' on the
>> >> queries show identical query plans.
>> >> I am using hive revision 1172989 from trunk
>> >>
>> >> Could someone help me with this?
>> >>
>> >> Regards,
>> >>
>> >> --
>> >> Rohan Monga
>> >
>> >
>> >
>
>


Re: problems with indices

2011-09-20 Thread rohan monga
Hi Prajakta,
Thanks, I was able to generate a plan which was using indices by
following your advise.
However, I want to group on one key and count the others
"select count(id2) from table_t1 group by id1", would that not be possible ?

Also, could you give me an example of using indices to speed up joins?

And for the semantic analysis thing, I sent the error in my first
email, here are the lines that I modified to get it to work. Basically
changed "_count_Of" to "_count_of"

Regards,
--
Rohan Monga



On Wed, Sep 21, 2011 at 4:22 AM, Prajakta Kalmegh  wrote:
> Hi Rohan
> I run your queries on the same version from trunk and did not get any
> errors. Not sure why you are getting a semantic analysis error. It would be
> good if you could send me a snapshot of the error message.
> About the second issue of getting the same plan even if you
> set hive.optimize.index.groupby to true or false, we do not optimize for
> cases where the count (aggregate functions) are applied on keys other than
> the group-by keys. Since you are trying to get a count of id2 and have the
> index key (as well as the group-by key) on id1, the optimization is not
> applied. Hence the same plan.
> Hope this helps. Please let me know if you have any questions.
> Regards,
> Prajakta
>
>
>> From: rohan monga 
>> Date: September 19, 2011 11:26:29 PM PDT
>> To: 
>> Subject: problems with indices
>> Reply-To: 
>>
>> Hi,
>> I have a table and index that look like
>>
>> 
>> CREATE TABLE table_t1(id1 int,  id2 int)
>> CREATE INDEX table_t1_idx ON TABLE table_t1(id1) AS
>> 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED
>> REBUILD IDXPROPERTIES("AGGREGATES"="count(id2)");
>> ALTER INDEX table_t1_idx ON table_t1 REBUILD;
>> 
>>
>> Now, I am seeing 2 problems with the query 'set
>> hive.optimize.index.groupby=false;select count(id2) from table_t1
>> group by id1',
>> firstly,
>>
>> 
>> FAILED: Error in semantic analysis: Line 1:11 Invalid table alias or
>> column reference '`_count_Of_id2`': (possible column names are: id1,
>> _bucketname, _offsets, _count_of_id2)
>> 
>>
>> So, I assumed that it was a typo or something, and I change all 'Of'
>> to 'of' in the relevant places in the code.
>>
>> Then I ran the query again, with the option set to true and then to
>> false, the run times were almost the same. Also 'explain' on the
>> queries show identical query plans.
>> I am using hive revision 1172989 from trunk
>>
>> Could someone help me with this?
>>
>> Regards,
>>
>> --
>> Rohan Monga
>
>
>


rohan.diff
Description: Binary data


problems with indices

2011-09-19 Thread rohan monga
Hi,
I have a table and index that look like


CREATE TABLE table_t1(id1 int,  id2 int)
CREATE INDEX table_t1_idx ON TABLE table_t1(id1) AS
'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED
REBUILD IDXPROPERTIES("AGGREGATES"="count(id2)");
ALTER INDEX table_t1_idx ON table_t1 REBUILD;


Now, I am seeing 2 problems with the query 'set
hive.optimize.index.groupby=false;select count(id2) from table_t1
group by id1',
firstly,


FAILED: Error in semantic analysis: Line 1:11 Invalid table alias or
column reference '`_count_Of_id2`': (possible column names are: id1,
_bucketname, _offsets, _count_of_id2)


So, I assumed that it was a typo or something, and I change all 'Of'
to 'of' in the relevant places in the code.

Then I ran the query again, with the option set to true and then to
false, the run times were almost the same. Also 'explain' on the
queries show identical query plans.
I am using hive revision 1172989 from trunk

Could someone help me with this?

Regards,

--
Rohan Monga


Re: CREATE INDEX

2011-09-12 Thread rohan monga
yup, I'd be interested in this too. I'd like to know how the indices
are applied/used and which queries can they be useful.

Thanks,
--
Rohan Monga



On Mon, Sep 12, 2011 at 6:16 PM, Adriaan Tijsseling
 wrote:
> Does anyone know the proper documentation for using the "CREATE INDEX" 
> command in hive? The hive homepage doesn't have the information.
>
> Cheers,
>
> Adriaan
>
>


Problem with custom map-reduce scripts

2011-08-31 Thread rohan monga
Hi,
I have a hive query like this

from (
 from mytable
 map mytable.c1, mytable.c2, mytable.c3
 using '/bin/cat' as c1,c2,c3
 cluster by c1) mo
 insert overwrite table mytable2
 reduce mo.c1, mo.c2, mo.c3
 using 'python convert.py'
 as x1, x2;

Now what is happening is that some of the records from 'mytable' are
not making their way into 'mytable2'. However,

from (
 from (select * from mytable where c1=1)
 map mytable.c1, mytable.c2, mytable.c3
 using '/bin/cat' as c1,c2,c3
 cluster by c1) mo
 insert overwrite table mytable2
 reduce mo.c1, mo.c2, mo.c3
 using 'python convert.py'
 as x1, x2;

Say, the missing record was for c1=1, then with this query, the record
shows up in mytable2. Is there something I am missing ?

Regards,
--
Rohan Monga


Comments in hive scripts

2011-08-05 Thread rohan monga
Hi,
I am using hive version 0.7 and I am having an odd issue with comments

The following works

use default;
-- this is good
-- select blah from blahness
...


However,

use default;
-- this is good
-- select * from blahness;
...

doesn't work, is this because hive is parsing ';' before removing
comments? or am I missing something?

Thanks,
--
Rohan Monga