Re: UDF reflect

2014-04-03 Thread John Meagher
It's probably not as pretty as the new built-in version, but this
allows scripted UDFs in any javax.script language:
https://github.com/livingsocial/HiveSwarm/blob/master/src/main/java/com/livingsocial/hive/udf/ScriptedUDF.java

On Thu, Apr 3, 2014 at 4:35 PM, Andy Srine andy.sr...@gmail.com wrote:
 Thanks Edward. But inline groovy is available on Hive 13 right? I am using
 an older version.

 Best,
 Andy



 On Thu, Apr 3, 2014 at 11:37 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 You can write UDF's in groovy now. That pretty much means. You can just
 write a quick method inline now. Makese udf reflect much less useful.


 On Thu, Apr 3, 2014 at 2:22 PM, Andy Srine andy.sr...@gmail.com wrote:

 Thanks Szehon and Peyman, I want to call hashCode() on the UUID object.
 This object is stored in the table as a string, but I can convert it to
 UUID. Thats not the problem.

 Basically the question is, how do we call this reflect UDF on methods
 that takes no arguments? How do I do the following using the reflect UDF?

 (fromString(uid_str)).hashCode()

 Thanks,
 Andy




 On Thu, Apr 3, 2014 at 6:18 AM, Peyman Mohajerian mohaj...@gmail.com
 wrote:

 Maybe your intention is the following:
 reflect(java.util.UUID, randomUUID)


 On Thu, Apr 3, 2014 at 2:33 AM, Szehon Ho sze...@cloudera.com wrote:

 Hi, according to the description of the reflect UDF, you are trying to
 call java.util.UUID.hashcode(uidString), which doesnt seem to be an 
 existing
 method on either java 6/7.

 http://docs.oracle.com/javase/7/docs/api/java/util/UUID.html#hashCode()

 Thanks
 Szehon




 On Wed, Apr 2, 2014 at 2:13 PM, Andy Srine andy.sr...@gmail.com
 wrote:

 Hi guys,


 I am trying to use the reflect UDF for an UUID method and am getting
 an exception. I believe this function should be available in java 
 1.6.0_31
 the system is running.


 select reflect(java.util.UUID, hashCode, uid_str) my_uid,

 ...


 My suspicion is, this is because the hive column I am calling this on
 is a string and not an UUID. So I nested the reflects as shown below to 
 go
 from a string to an UUID first and then to hashCode it.


 reflect(java.util.UUID, hashCode, reflect(java.util.UUID,
 fromString, uid_str)) my_uid,


 In either case, I always get the exception below though the row of
 data it prints has no null for the uid_str column. Any ideas?


 at
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:565)

 at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:143)

 ... 8 more

 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
 UDFReflect getMethod

 at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:164)

 at
 org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.evaluate(ExprNodeGenericFuncEvaluator.java:163)

 at
 org.apache.hadoop.hive.ql.exec.KeyWrapperFactory$ListKeyWrapper.getNewKey(KeyWrapperFactory.java:113)

 at
 org.apache.hadoop.hive.ql.exec.GroupByOperator.processOp(GroupByOperator.java:794)

 at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)

 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)

 at
 org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)

 at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)

 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)

 at
 org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)

 at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)

 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)

 at
 org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:548)

 ... 9 more

 Caused by: java.lang.NoSuchMethodException:
 java.util.UUID.hashCode(null)

 at java.lang.Class.getMethod(Class.java:1605)

 at
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFReflect.evaluate(GenericUDFReflect.java:160)


 Thanks,

 Andy







 --
 Thanks,
 Andy





 --
 Thanks,
 Andy



Re: Help with simple UDF error..

2014-02-13 Thread John Meagher
Try changing ArrayList to just List as the argument to the evaluate function.

On Thu, Feb 13, 2014 at 12:15 PM, Manish maa...@gmail.com wrote:
 Hello:
 While running the hive query :

 hive select Actor, count (*) from metrics where
 array_contains(type_tuple, Actor) and checkAttributes (attribute_tuple,
 p1, Actor) ;

 I get this error :

 FAILED: SemanticException [Error 10014]: Line 1:103 Wrong arguments
 'Actor': No matching method for class yk.HiveCheckAllAttributes with
 (arraystring, string, string). Possible choices: _FUNC_(string, string,
 string)


 the metrics hive table schema looks like :

 hive describe metrics;
 OK
 col_name data_type comment
 entity_uuid string
 type_tuple arraystring
 attribute_tuple arraystring


 the class yk.HiveCheckAllAttributes extends UDF and has evaluate method
 defined as :

 --snip--
 public boolean evaluate(ArrayListString payload, String pType, String
 type) {
 ...
 }

 I have tried Text type and still go the same error..
 --snip--
 public boolean evaluate(Text payload, String pType, String type) {
 ...
 }



 How can I pass Hive ArrayStrings as an argument to a simple UDF?

 /Manish


Re: Can data be passed to the final mode init call in a UDAF?

2014-02-12 Thread John Meagher
The code I'm having problems with is
https://github.com/livingsocial/HiveSwarm/blob/WIP_scripting/src/main/java/com/livingsocial/hive/udf/ScriptedUDAF.java.

There is a FIXME comment at the problem area where the return type for
the init method can not be determined.

On Mon, Feb 10, 2014 at 4:26 PM, John Meagher john.meag...@gmail.com wrote:
 I'm working on a UDAF that takes in a constant string that defines
 what the final output of the UDAF will be.  In the mode=PARTIAL1 call
 to the init function all the parameters are available and the constant
 can be read so the output ObjectInspector can be built.  I haven't
 found a way to pass this same constant data through so it is available
 to the mode=FINAL call to init.

 Is there a way to pass constant parameters through the UDAF so it is
 available in the mode=FINAL init call?

 I have tried having the PARTIAL1 init call return a
 StructObjectInspector containing constant object inspectors with the
 data, but when that is read during the FINAL call the incoming
 ObjectInspectors are no longer constants.

 I've looked for ways to access the JobConf object for the job to try
 to pass this parameter along, but I can't find a way to access that
 from Hive.  I can pass the data along with the data coming from
 terminatePartial, but that data is not available until merge is called
 and init in FINAL mode may be called without another invocation of
 init.

 Note: this is with Hive 0.10.  While I'd like to get this working
 under 0.10 I'd like to know if this is a feature available in newer
 releases.

 Thanks,
 John


Re: FUNCTION HIVE to DAYS OF WEEK

2014-02-11 Thread John Meagher
The pom.xml file in the top level specifies the version of Hadoop and Hive
it is compiling against.  There are 2 ways to get things working:
1) Modify the HiveSwarm pom.xml file to reference the version of Hadoop and
Hive you are using and recompile
2) Copy the day of the week UDF over into your own project

Either approach should work.



On Tue, Feb 11, 2014 at 5:47 AM, Eduardo Parra Valdes | BEEVA 
eduardo.pa...@beeva.com wrote:

 Hi, Jhon

 Thank very much for your response. this project is very interesting.

 I test your jar , but no work.
 I think that is because i have other version of hadoop. I have installed
 Hadoop 2.2.0 whit YARN.

 the Error is

 Diagnostic Messages for this Task:
 Error: java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:425)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
 ... 9 more
 Caused by: java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
 ... 14 more
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
 ... 17 more
 Caused by: java.lang.RuntimeException: Map operator initialization failed
 at
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:134)
 ... 22 more
 Caused by: java.lang.NullPointerException
 at
 org.apache.hadoop.hive.ql.exec.FunctionRegistry.isStateful(FunctionRegistry.java:1422)
 at
 org.apache.hadoop.hive.ql.exec.FunctionRegistry.isDeterministic(FunctionRegistry.java:1385)
 at
 org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.isDeterministic(ExprNodeGenericFuncEvaluator.java:132)
 at
 org.apache.hadoop.hive.ql.exec.ExprNodeEvaluatorFactory.iterate(ExprNodeEvaluatorFactory.java:83)
 at
 org.apache.hadoop.hive.ql.exec.ExprNodeEvaluatorFactory.toCachedEval(ExprNodeEvaluatorFactory.java:73)
 at
 org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:57)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:453)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:409)
 at
 org.apache.hadoop.hive.ql.exec.FilterOperator.initializeOp(FilterOperator.java:82)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:453)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:409)
 at
 org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:188)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377)
 at
 org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:425)
 at
 org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377)
 at
 org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:113)
 ... 22 more

 I'll try doing other test. whit other version of hadoop.

 if you see , why don't your jar  ,
 Quizás quisiste decir: te agradecería que me lo dijeras.
 I appreciate that you told me.

 Yhanks a lot!




 2014-02-10 22:43 GMT+01:00 John Meagher john.meag...@gmail.com

Can data be passed to the final mode init call in a UDAF?

2014-02-10 Thread John Meagher
I'm working on a UDAF that takes in a constant string that defines
what the final output of the UDAF will be.  In the mode=PARTIAL1 call
to the init function all the parameters are available and the constant
can be read so the output ObjectInspector can be built.  I haven't
found a way to pass this same constant data through so it is available
to the mode=FINAL call to init.

Is there a way to pass constant parameters through the UDAF so it is
available in the mode=FINAL init call?

I have tried having the PARTIAL1 init call return a
StructObjectInspector containing constant object inspectors with the
data, but when that is read during the FINAL call the incoming
ObjectInspectors are no longer constants.

I've looked for ways to access the JobConf object for the job to try
to pass this parameter along, but I can't find a way to access that
from Hive.  I can pass the data along with the data coming from
terminatePartial, but that data is not available until merge is called
and init in FINAL mode may be called without another invocation of
init.

Note: this is with Hive 0.10.  While I'd like to get this working
under 0.10 I'd like to know if this is a feature available in newer
releases.

Thanks,
John


Re: FUNCTION HIVE to DAYS OF WEEK

2014-02-10 Thread John Meagher
Here's one implementation of it:
https://github.com/livingsocial/HiveSwarm#dayofweekdate.

The code for it is pretty straight forward:
https://github.com/livingsocial/HiveSwarm/blob/master/src/main/java/com/livingsocial/hive/udf/DayOfWeek.java


On Mon, Feb 10, 2014 at 4:38 PM, Stephen Sprague sprag...@gmail.com wrote:

 oddly enough i don't see one here:
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

 however, you're not the only one finding something like this useful. cf.
 https://issues.apache.org/jira/browse/HIVE-6046

 in the meantime it appears as though you'll have to write your own UDF or
 perhaps setup a streaming job via transform() function.

 Cheers,
 Stephen.

 PS. The transform() function is _really_ easy to use so don't be scared
 off.  cf.
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
 (the last example is the simplest.)


 On Mon, Feb 10, 2014 at 8:33 AM, Eduardo Parra Valdes | BEEVA 
 eduardo.pa...@beeva.com wrote:

 Hello! all
 I wish to find a function that returns hive day of the week (Monday,
 Tuesday .. etc) to enter a parameter (timestamp).
 Anyone have an idea of how to do it?

 --
  [image: BEEVA]

 *Eduardo Parra Valdés*
 eduardo.pa...@beeva.com

 BEE OUR CLIENT

 WWW.BEEVA.COM http://www.beeva.com/

 Clara del Rey 26, 4ª planta.

 [28002] Madrid.
 https://www.facebook.com/somosBEEVAhttp://www.linkedin.com/company/beevahttps://twitter.com/beeva_es

 Antes de imprimir este mensaje, por favor compruebe que es necesario
 hacerlo.

 Before you print this message please consider if it really necesary.
 Aviso Legal:
 Este mensaje, su contenido y cualquier fichero transmitido con él, está
 dirigido únicamente a su destinatario y es confidencial. Por ello, se
 informa a quien lo reciba por error o tenga conocimiento del mismo sin ser
 su destinatario, que la información contenida en él es reservada y su uso
 no autorizado, por lo que en tal caso le rogamos nos lo comunique por la
 misma vía, así como que se abstenga de reproducir el mensaje mediante
 cualquier medio o remitirlo o entregarlo a otra persona, procediendo a su
 borrado de manera inmediata.Disclaimer:
 This message, its content and any file attached thereto is for the
 intended recipient only and is confidential. If you have received this
 e-mail in error or had access to it, you should note that the information
 in it is private and any use thereof is unauthorised. In such an event
 please notify us by e-mail. Any reproduction of this e-mail by whatsoever
 means and any transmission or dissemination thereof to other persons is
 prohibited. It should be deleted immediately from your system.




Re: Formatting hive queries

2014-01-22 Thread John Meagher
I use vim and https://github.com/vim-scripts/SQLUtilities to do it.
It's not hive specific.  Any SQL formatting tool will work.

On Tue, Jan 21, 2014 at 11:23 PM, pandees waran pande...@gmail.com wrote:
 Hi,

 I would like to come up with a code which automatically formats your hql
 files.
 Because, formatting is one of the tedious task and i would like to  come up
 with an utility for that.
 Please let me know, whether any  specific utilities exist already for
 formatting hive queries.

 --
 Thanks,
 Pandeeswaran


Re: Problème with min function in HiveQL

2013-08-29 Thread John Meagher
Aggregate functions need to go in a HAVING clause instead of the WHERE
clause.  WHERE clauses are applied prior to aggregation, HAVING is
applied post aggregation.

select ...
from ...
where  some row level filter
group by ...
having some aggregate level filter


On Thu, Aug 29, 2013 at 2:49 PM, Jason Dere jd...@hortonworks.com wrote:
 Looks like the issue is the use of min() within the WHERE clause - the place
 where the exception is being thrown has the following comment:
 // UDAF in filter condition, group-by caluse, param of funtion, etc.


 On Aug 29, 2013, at 3:01 AM, Jérôme Verdier verdier.jerom...@gmail.com
 wrote:

 Hi everybody,

 I am coding some HiveQL script to do some calculations.

 I have a problem with the min() function.

 My hive script is below :

 INSERT INTO TABLE default.THM_CA_RGRP_PRODUITS_SEM

 SELECT
 '${hiveconf:in_co_societe}'   as co_societe,
 '${hiveconf:in_co_an_semaine}'as co_an_semaine,
 a.type_entite as type_entite,
 a.code_entite as code_entite,
 a.type_rgrp_produits  as type_rgrp_produits,
 a.co_rgrp_produitsas co_rgrp_produits,
 SUM(a.MT_CA_NET_TTC)  as MT_CA_NET_TTC,
 SUM(a.MT_OBJ_CA_NET_TTC)  as MT_OBJ_CA_NET_TTC,
 SUM(a.NB_CLIENTS) as NB_CLIENTS,
 SUM(a.MT_CA_NET_TTC_COMP) as MT_CA_NET_TTC_COMP,
 SUM(a.MT_OBJ_CA_NET_TTC_COMP) as MT_OBJ_CA_NET_TTC_COMP,
 SUM(a.NB_CLIENTS_COMP)as NB_CLIENTS_COMP
   from
 default.THM_CA_RGRP_PRODUITS_JOUR a
   JOIN default.CALENDRIER b
   -- A verifier
   WHERE CAST(a.dt_jour AS TIMESTAMP) =
 CAST(min(UNIX_TIMESTAMP(b.dt_jour)) AS BIGINT)
   AND CAST(a.dt_jour AS TIMESTAMP)  CAST(min(b.dt_jour)+1 AS TIMESTAMP)
   AND a.co_societe = '${hiveconf:in_co_societe}'
   AND ('${hiveconf:flg_reprise_entite}' = 0 OR a.type_entite'MAG')
   GROUP BY
 a.type_entite,
 a.code_entite,
 a.type_rgrp_produits,
 a.co_rgrp_produits;

 And, when i try to launch this, i get this error :

 FAILED: SemanticException [Error 10128]: Line 20:62 Not yet supported place
 for UDAF 'min'

 Obviously, there is a problem with min() function.

 How can i pass through this error?

 Thanks for your help



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader of
 this message is not the intended recipient, you are hereby notified that any
 printing, copying, dissemination, distribution, disclosure or forwarding of
 this communication is strictly prohibited. If you have received this
 communication in error, please contact the sender immediately and delete it
 from your system. Thank You.


Re: Java Courses for Scripters/Big Data Geeks

2013-07-17 Thread John Meagher
The Data Science course on Coursera has a pretty good overview of map
reduce, Hive, and Pig without going into the Java side of things.
https://www.coursera.org/course/datasci.  It's not in depth, but it is
enough to get started.

On Wed, Jul 17, 2013 at 3:52 PM, John Omernik j...@omernik.com wrote:
 Hey all -

 I was wondering if there were any shortcut Java courses out there.  As in,
 I am not looking for a holistic learn everything about Java course, but more
 of a So you are a big data/hive geek and you get Python/Perl pretty well,
 but when you try to understand Java your head explodes and it feels like you
 are missing something entry level and basic thus you need these basic things
 and you'll be fine course.


 Any thoughts?




Re: Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug)

2013-05-29 Thread John Meagher
What is the data type of the p1 column?  I've used hive with
partitions containing far above 2 billion rows without having any
problems like this.

On Wed, May 29, 2013 at 2:41 PM, Gabi Kazav gabi.ka...@pursway.com wrote:
 Hi,



 We are working on hive DB with our Hadoop cluster.

 We now facing an issue about joining a big partition with more than 2^31
 rows.

 When the partition has more than 2147483648 rows (even 2147483649) the
 output of the join is a single row.
 When the partition has less than 2147483648 rows (event 2147483647) the
 output is correct.

 Our test case:

 create a table with 2147483649 rows in a partition with the value : 1 ,
 join this table to another table with a single row,single column with the
 value 1 on the partition_key.
 later delete 2 rows and run the same join.
 1st : only a single row is created
 2nd : 2147483647 rows

 the query we run for test the case is:



 create table output_rows_over as

 select a.s1

 from  max_sint_rows a join small_table b

 on (a.p1=b.p1);



 on more than 2^31 rows we got the following on reducer log:

 2013-05-27 21:51:14,186 INFO
 org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1

 On less than 2^31 rows we got the following reducer log:

 2013-05-27 23:43:14,681 INFO
 org.apache.hadoop.hive.ql.exec.FileSinkOperator:
 TABLE_ID_1_ROWCOUNT:2147483647





 Anyone faced this issue?

 Does hive has workaround for that?

 I have huge partitions I need to work on and I cannot use hive for that..



 Thanks,





 Gabi Kazav

 Infrastructure Team Leader, Pursway.com






Re: How to do a calculation among 2 rows?

2013-05-15 Thread John Meagher
There's a new release candidate version of Hive that includes
windowing functions that address this.  RC1 is available at
http://people.apache.org/~hashutosh/hive-0.11.0-rc1/.  Also see
https://issues.apache.org/jira/browse/HIVE-896

Without 0.11 the approach from Paul will work, but will be slow.

On Wed, May 15, 2013 at 7:53 AM, Paul COURTOIS p...@pole-conseils.com wrote:
 Hi Bhathiya,

 what about cartesian product ?

 the idea would be something like that :
 select a, b, c from t1
 join t1 as t2 on t2.a = t1.a
 where t2.b  t1.b

 it's not very elegant, but should work ;-)

 Paul


 2013/5/15 Bhathiya Jayasekara tobhathi...@gmail.com

 Hi all,

 Can someone please give an idea for this? I'm totally stuck.


 http://stackoverflow.com/questions/16558438/in-hive-how-to-do-a-calculation-among-2-rows

 Thanks.

 --Bhathiya




Re: Hive Group By Limitations

2013-05-06 Thread John Meagher
Not quite sure but I think each group by will give another M/R job.

It will be done in a single M/R job no matter how many fields are in
the GROUP BY clause.

On Mon, May 6, 2013 at 2:07 PM, Peter Chu pete@outlook.com wrote:
 In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY
 clause.

 Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a;
 -- This does not work.

 To make it work, I would need to add the other fields in the group by
 clause.

 Not quite sure but I think each group by will give another M/R job.

 Wondering if there is any other way / better way to do group by.

 Peter


Re: Using Reflect: A thread for ideas

2013-02-19 Thread John Meagher
Another option for this functionality would be to use the Java scripting
API.  The basic structure of the call would be...

select script( scriptLanguage, scriptToRun, args... )

I haven't seen that in Hive, but something similar is available for Pig.
 Documentation for that is available on
http://pig.apache.org/docs/r0.9.2/udf.html#js-udfs.  There's also a
variation in Jira https://issues.apache.org/jira/browse/PIG-1777.



On Wed, Feb 13, 2013 at 11:38 PM, John Omernik j...@omernik.com wrote:

 I stumbled across the little documented reflect function today. I've
 always known about it, but java scares me if it's not in a cup so I didn't
 dig.  Well today I dug, and found an awesome use case for reflect (for me)
 and wanted to share.  I also thought it would be nice to validate some
 thoughts I had on reflect, and how we could possibly share ideas on reflect
 so that folks could get more use out of this great feature of hive.

 Here's my example: A simple URL decode function:

 select url, reflect('java.net.URLDecoder', 'decode', url, 'utf-8') as
 decoded_url from logs
 Basically I am using the decode function of the java.net.URLDecoder class.
  Pretty awesome, works great, no files to distribute either.  Even works
 through JDBC!

 Ok that being said, I realized now that the function I am trying to call
 has to return data in a simple data type.  For example, I struggle to come
 up with a simple reflect() for making an Hex MD5 out of a string because
 the built in function return an object, which have methods that can return
 what I am looking for. Which is great, but then I have to compile java
 code, distribute a jar, and then run the code. I am looking for simple like
 the URLDecoding function.

 I love this reflect feature, but I think it's probably underutilized due
 to the perceived usability issues for beginners.  So that leads me to my
 next thought. What if we brain storm here handy functions in Java that are
 not included in the standard hive language, that make the transition to
 hive well using the reflect function and the show an example of it's use? I
 went first with my URLDecode, and obviously will be looking for more, but
 have you seen some examples that we neat and worked well for you? Can you
 share?

 Perhaps if we get enough examples we could roll some of these into a wiki
 page on the hive wiki that folks can use to get over the perceived
 complexity of using java reflect?

 Thanks to those who have worked hard to implement features like this, it
 is truly awesome.



Re: Help in hive query

2012-10-30 Thread John Meagher
The WHERE part in the approvals can be moved up to be an IF in the SELECT...


SELECT client_id,receive_dd,receive_hh,
  receive_hh+1,
  COUNT(1) AS transaction_count,
  SUM( IF ( response=00, 1, 0) ) AS approval_count,
  SUM( IF ( response=00, 1, 0) ) / COUNT(1) * 100 AS percent
FROM sale_test
group by fdc_client_id,receive_dd,receive_hh,receive_hh+1;



On Tue, Oct 30, 2012 at 1:51 AM, dyuti a hadoop.hiv...@gmail.com wrote:
 Hi All,

 I want to perform (No.of .approvals in an hour/No.of transactions in that
 hour)*100.

 //COUNT(1) AS cnt  gives total transactions in an hour
 SELECT client_id,receive_dd,receive_hh,receive_hh+1,COUNT(1) AS cnt FROM
 sale_test group by fdc_client_id,receive_dd,receive_hh,receive_hh+1;

 GETREGREOS  23  16  17  5969
 GETREGREOS  23  21  22  2602
 GETREGREOS  24  3   4   114

 //Approved transactions where response=00
 SELECT client_id,receive_dd,receive_hh,receive_hh+1,COUNT(1) AS cnt FROM
 sale_test where response=00 group by
 client_id,receive_dd,receive_hh,receive_hh+1;
 GETREGREOS  23  16  17  5775
 GETREGREOS  23  21  22  2515
 GETREGREOS  24  3   4   103


 I want to perform 100 * (5775/5969) , 100 * (2515/2602) , 100 * (103/114)
 like the same for all other clients for each hour i.e., (No.of .approvals in
 an hour/No.of transactions in that hour)*100.

 Please help me out as how to achieve this in hive.


 Thanks  Regards,
 dti


Re: Hive directory permissions

2012-08-16 Thread John Meagher
Creating the /user/hive/warehouse folder is a one-time setup step that can
be done as the hdfs user.  With g+w permissions any user can then create
and read the tables.


On Thu, Aug 16, 2012 at 9:57 AM, Connell, Chuck chuck.conn...@nuance.comwrote:

  I have no doubt that works, but surely a Hive user should not need sudo
 privileges!  I am also looking for best practices, since we have run into
 the same.

 ** **

 ** **

 *From:* Himanish Kushary [mailto:himan...@gmail.com]
 *Sent:* Thursday, August 16, 2012 9:51 AM

 *To:* user@hive.apache.org
 *Subject:* Re: Hive directory permissions

  ** **

 We usually start the shell thru sudo,otherwise we get a Permission
 denied while creating Hive tables.

 ** **

 But this is a good point, any suggestions/best practices from the user
 community ?

 ** **

 Thanks

 On Thu, Aug 16, 2012 at 9:37 AM, Connell, Chuck chuck.conn...@nuance.com
 wrote:

 I have run into similar problems. Thanks for the suggestions. One concern…
 Isn’t hdfs a highly privileged user within the Hadoop cluster? So do we
 really want it to be standard practice for all Hive users to su to hdfs?**
 **

  

 Chuck Connell

 Nuance RD Data Team

 Burlington, MA

  

 *From:* Himanish Kushary [mailto:himan...@gmail.com]
 *Sent:* Thursday, August 16, 2012 9:30 AM
 *To:* user@hive.apache.org
 *Subject:* Re: Hive directory permissions

  

 Hi Sean,

  

 From the Hive language manual - Moreover, we strongly advise users to
 create the HDFS directories /tmp and /user/hive/warehouse
 (aka hive.metastore.warehouse.dir) and set them chmod g+w before tables
 are created in Hive

  

 My warehouse directory has the following permissions:

  

 *Name*

 *Type*

 *Size*

 *Replication*

 *Block Size*

 *Modification Time*

 *Permission*

 *Owner*

 *Group*

 *warehousehttp://h-node02:50075/browseDirectory.jsp?dir=%2Fuser%2Fhive%2FwarehousenamenodeInfoPort=50070nnaddr=hdfs01:8020
 *

 *dir*

 *2012-08-15 15:51*

 *rwxrwxr-x*

 *hdfs*

 *supergroup*


 Also I make sure that I run the hive CLI as 'hdfs' user.

  

 Hope this helps

  

 Thanks

 Himanish

 On Thu, Aug 16, 2012 at 1:50 AM, Sean Bigdatafun 
 sean.bigdata...@gmail.com wrote:

 I ran the following command trying to setup my hive directory, but got the
 following error.  

  

 root@cdh-client:/# hadoop fs -mkdir   /user/hive/warehouse

 mkdir: org.apache.hadoop.security.AccessControlException: Permission
 denied: user=root, access=WRITE, inode=/user:hdfs:hadoop:drwxr-xr-x

  

 Can someone tell me what is the correct ownership for hive directories?
 My /user directory is with the following ownership: 

 userhttp://cdh-slave2-16g.hwhadoop.com:50075/browseDirectory.jsp?dir=%2FusernamenodeInfoPort=50070
 

 dir

 2012-08-14 16:51

 rwxr-xr-x

 hdfs

 hadoop

  

 Should I create a sub-directory under user with the following permission?*
 ***

 hhttp://cdh-slave2-16g.hwhadoop.com:50075/browseDirectory.jsp?dir=%2FusernamenodeInfoPort=50070
 ive

 dir

 2012-08-14 16:51

 rwxr-xr-x

 hive

 hadoop


 Thanks,

 Sean

  

  

  



 

  

 --
 Thanks  Regards
 Himanish



 

 ** **

 --
 Thanks  Regards
 Himanish



Re: Converting rows into dynamic colums in Hive

2012-08-09 Thread John Meagher
R is another programming language geared to statistical analysis.  It
has some features that make it easy to write this sort of data
manipulation.  http://www.r-project.org/

Here's a sample R function that does something similar to what you
need (note: I'm not a real R programmer, so this might be ugly, but it
works):

abCountsToMatrix - function(csvFilename, outputmatrixfile){

   data - read.csv(csvFilename, as.is=T, check.names=F, header=F)
   cat(Loaded, nrow(data), rows.\n)
   idrows - sort(unique(c(data[ , 1])))
   idcols - sort(unique(c(data[ , 2])))

   cat(Found, length(idrows),  X , length(idcols), .)
   output - array(0, c(length(idrows), length(idcols)))
   rownames(output) - idrows
   colnames(output) - idcols

   for(i in 1:nrow(data)){

 if (i %% 100 == 0) {
   cat(On , i,  of , nrow(data), \n)
 }

 rowidx - which(idrows == data[i, 1])
 colidx - which(idcols == data[i, 2])

 count - data[i,3]

 output[rowidx, colidx] - count
   }


   write.csv(output, file=outputmatrixfile, quote=F)

}



On Wed, Aug 8, 2012 at 8:58 PM,  richin.j...@nokia.com wrote:
 John,

 What is R?

 -Original Message-
 From: ext John Meagher [mailto:john.meag...@gmail.com]
 Sent: Wednesday, August 08, 2012 4:34 PM
 To: user@hive.apache.org
 Subject: Re: Converting rows into dynamic colums in Hive

 I don't think having dynamic columns is possible in Hive.  I've always output 
 from Hive a structure like your query output and used R to convert it into a 
 dynamic column structure.


 On Wed, Aug 8, 2012 at 3:56 PM,  richin.j...@nokia.com wrote:
 Thanks Ashish, that gives an idea.

 But I am not sure about the outer select loop, I have to know all the
 values in Beta column beforehand to do a max on each value.

 Is there a better way?



 Richin



 From: ext Ashish Thusoo [mailto:athu...@qubole.com]
 Sent: Tuesday, August 07, 2012 5:05 PM
 To: user@hive.apache.org
 Subject: Re: Converting rows into dynamic colums in Hive



 you should be able to do this in hive using a group by on alpha and
 then using a combination of the max and if statement... something on
 the following lines

 select alpha, max(abc), max(pqr), ...
 (
   select alpha, if (beta == 'abc', Gamma, NULL) as abc, if (beta ==
 'pqr', Gamma, NUL) as pqr, 
   from table
 )
 group by alpha

 something on those lines...

 Ashish

 On Tue, Aug 7, 2012 at 1:57 PM, richin.j...@nokia.com wrote:

 Hi All,



 One of my Query output looks like-



 AlphaBeta   Gamma

 123 xyz 1.0

 123 abc 0.5

 123 pqr 1.3

 123 def 2.1

 456 xyz 0.1

 456 abc 0.6

 456 pqr 1.9

 456 def 3.2

 456 lmn1.1

 456 sdf  1.2



 I want the output for the data visualization purpose to look like
 (basically taking rows from one table and making it column in another
 table)



 Alphaxyz abc pqr def lmn
 sdf

 123 1.0  0.5  1.3  2.1

 456 0.1  0.6  1.9  3.2
 1.1  1.2



 Question - Can it be done in Hive? If not, any suggestions.



 Thanks,

 Richin








Re: Converting rows into dynamic colums in Hive

2012-08-08 Thread John Meagher
I don't think having dynamic columns is possible in Hive.  I've always
output from Hive a structure like your query output and used R to
convert it into a dynamic column structure.


On Wed, Aug 8, 2012 at 3:56 PM,  richin.j...@nokia.com wrote:
 Thanks Ashish, that gives an idea.

 But I am not sure about the outer select loop, I have to know all the values
 in Beta column beforehand to do a max on each value.

 Is there a better way?



 Richin



 From: ext Ashish Thusoo [mailto:athu...@qubole.com]
 Sent: Tuesday, August 07, 2012 5:05 PM
 To: user@hive.apache.org
 Subject: Re: Converting rows into dynamic colums in Hive



 you should be able to do this in hive using a group by on alpha and then
 using a combination of the max and if statement... something on the
 following lines

 select alpha, max(abc), max(pqr), ...
 (
   select alpha, if (beta == 'abc', Gamma, NULL) as abc, if (beta == 'pqr',
 Gamma, NUL) as pqr, 
   from table
 )
 group by alpha

 something on those lines...

 Ashish

 On Tue, Aug 7, 2012 at 1:57 PM, richin.j...@nokia.com wrote:

 Hi All,



 One of my Query output looks like-



 AlphaBeta   Gamma

 123 xyz 1.0

 123 abc 0.5

 123 pqr 1.3

 123 def 2.1

 456 xyz 0.1

 456 abc 0.6

 456 pqr 1.9

 456 def 3.2

 456 lmn1.1

 456 sdf  1.2



 I want the output for the data visualization purpose to look like (basically
 taking rows from one table and making it column in another table)



 Alphaxyz abc pqr def lmn
 sdf

 123 1.0  0.5  1.3  2.1

 456 0.1  0.6  1.9  3.2
 1.1  1.2



 Question – Can it be done in Hive? If not, any suggestions.



 Thanks,

 Richin