RE: Converting rows into dynamic colums in Hive
John, What is R? -Original Message- From: ext John Meagher [mailto:john.meag...@gmail.com] Sent: Wednesday, August 08, 2012 4:34 PM To: user@hive.apache.org Subject: Re: Converting rows into dynamic colums in Hive I don't think having dynamic columns is possible in Hive. I've always output from Hive a structure like your query output and used R to convert it into a dynamic column structure. On Wed, Aug 8, 2012 at 3:56 PM, wrote: > Thanks Ashish, that gives an idea. > > But I am not sure about the outer select loop, I have to know all the > values in Beta column beforehand to do a max on each value. > > Is there a better way? > > > > Richin > > > > From: ext Ashish Thusoo [mailto:athu...@qubole.com] > Sent: Tuesday, August 07, 2012 5:05 PM > To: user@hive.apache.org > Subject: Re: Converting rows into dynamic colums in Hive > > > > you should be able to do this in hive using a group by on alpha and > then using a combination of the max and if statement... something on > the following lines > > select alpha, max(abc), max(pqr), ... > ( > select alpha, if (beta == 'abc', Gamma, NULL) as abc, if (beta == > 'pqr', Gamma, NUL) as pqr, > from table > ) > group by alpha > > something on those lines... > > Ashish > > On Tue, Aug 7, 2012 at 1:57 PM, wrote: > > Hi All, > > > > One of my Query output looks like- > > > > AlphaBeta Gamma > > 123 xyz 1.0 > > 123 abc 0.5 > > 123 pqr 1.3 > > 123 def 2.1 > > 456 xyz 0.1 > > 456 abc 0.6 > > 456 pqr 1.9 > > 456 def 3.2 > > 456 lmn1.1 > > 456 sdf 1.2 > > > > I want the output for the data visualization purpose to look like > (basically taking rows from one table and making it column in another > table) > > > > Alphaxyz abc pqr def lmn > sdf > > 123 1.0 0.5 1.3 2.1 > > 456 0.1 0.6 1.9 3.2 > 1.1 1.2 > > > > Question - Can it be done in Hive? If not, any suggestions. > > > > Thanks, > > Richin > > > > > >
RE: Custom UserDefinedFunction in Hive
Actually as the custom UDF "yesterday()" mentioned below is NOT marked with the annotation @UDFType(deterministic = false), partition pruning should work in practice. The PartitionPruner has a logic around this annotation to check if a generic UDF is deterministic or not and would skip partition pruning if it finds any non-deterministic function. http://svn.apache.org/repos/asf/hive/branches/branch-0.9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionPruner.java I guess this check is implemented to avoid inconsistencies in result set that might arise in scenarios like below where predicate is :part_col = f(time) and regular_col = f(time). The expression involving "part_col" is evaluated at compile time and the expression involving "regular_col" is evaluated at run time and the function yesterday() might return different values if the query is executed around midnight. Thanks,Venkatesh Date: Wed, 8 Aug 2012 03:49:56 -0700 From: bejoy...@yahoo.com Subject: Re: Custom UserDefinedFunction in Hive To: user@hive.apache.org Hi Raihan UDFs are evaluated at run time when the query is executed. But it is hive parser during query parse time decides the boundary of data to be used for the query, ie data from which all partitions has to be processed. Because of this the entire table will be scanned for your query. Regards,Bejoy KS From: Raihan Jamal To: user@hive.apache.org Cc: d...@hive.apache.org Sent: Tuesday, August 7, 2012 10:50 PM Subject: Re: Custom UserDefinedFunction in Hive Hi Jan, I figured that out, it is working fine for me now. The only question I have is, if I am doing like this- SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10; Then the above query will be evaluated as below right? SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; So that means it will look for data in the corresponding dt partition (20120806) only right as above table is partitioned on dt column ? And it will not scan the whole table right? Raihan Jamal On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár wrote: Hi Jamal, Check if the function really returns what it should and that your data are really in MMdd format. You can do this by simple query like this: SELECT dt, yesterdaydate('MMdd') FROM REALTIME LIMIT 1; I don't see anything wrong with the function itself, it works well for me (although I tested it in hive 0.7.1). The only thing I would change about it would be to optimize it by calling 'new' only at the time of construction and reusing the object when the function is called, but that should not affect the functionality at all. Best regards, Jan On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal wrote: Problem I created the below UserDefinedFunction to get the yesterday's day in the format I wanted as I will be passing the format into this below method from the query. public final class YesterdayDate extends UDF { public String evaluate(final String format) { DateFormat dateFormat = new SimpleDateFormat(format); Calendar cal = Calendar.getInstance(); cal.add(Calendar.DATE, -1); return dateFormat.format(cal.getTime()).toString(); } } So whenever I try to run the query like below by adding the jar to classpath and creating the temporary function yesterdaydate, I always get zero result back- hive> create temporary function yesterdaydate as 'com.example.hive.udf.YesterdayDate'; OK Time taken: 0.512 seconds Below is the query I am running- hive> SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10; OK And I always get zero result back but the data is there in that table for Aug 5th. What wrong I am doing? Any suggestions will be appreciated. NOTE:- As I am working with Hive 0.6 so it doesn’t support variable substitution thing, so I cannot use hiveconf here and the above table has been partitioned on dt(date) column.
Increase max tablename length?
The default seems to be 128. Can it be increased? I haven't found a configuration parameter for that yet. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com "Luminous beings are we, not this crude matter." -- Yoda
Re: Converting rows into dynamic colums in Hive
I don't think having dynamic columns is possible in Hive. I've always output from Hive a structure like your query output and used R to convert it into a dynamic column structure. On Wed, Aug 8, 2012 at 3:56 PM, wrote: > Thanks Ashish, that gives an idea. > > But I am not sure about the outer select loop, I have to know all the values > in Beta column beforehand to do a max on each value. > > Is there a better way? > > > > Richin > > > > From: ext Ashish Thusoo [mailto:athu...@qubole.com] > Sent: Tuesday, August 07, 2012 5:05 PM > To: user@hive.apache.org > Subject: Re: Converting rows into dynamic colums in Hive > > > > you should be able to do this in hive using a group by on alpha and then > using a combination of the max and if statement... something on the > following lines > > select alpha, max(abc), max(pqr), ... > ( > select alpha, if (beta == 'abc', Gamma, NULL) as abc, if (beta == 'pqr', > Gamma, NUL) as pqr, > from table > ) > group by alpha > > something on those lines... > > Ashish > > On Tue, Aug 7, 2012 at 1:57 PM, wrote: > > Hi All, > > > > One of my Query output looks like- > > > > AlphaBeta Gamma > > 123 xyz 1.0 > > 123 abc 0.5 > > 123 pqr 1.3 > > 123 def 2.1 > > 456 xyz 0.1 > > 456 abc 0.6 > > 456 pqr 1.9 > > 456 def 3.2 > > 456 lmn1.1 > > 456 sdf 1.2 > > > > I want the output for the data visualization purpose to look like (basically > taking rows from one table and making it column in another table) > > > > Alphaxyz abc pqr def lmn > sdf > > 123 1.0 0.5 1.3 2.1 > > 456 0.1 0.6 1.9 3.2 > 1.1 1.2 > > > > Question – Can it be done in Hive? If not, any suggestions. > > > > Thanks, > > Richin > > > > > >
RE: Converting rows into dynamic colums in Hive
Thanks Ashish, that gives an idea. But I am not sure about the outer select loop, I have to know all the values in Beta column beforehand to do a max on each value. Is there a better way? Richin From: ext Ashish Thusoo [mailto:athu...@qubole.com] Sent: Tuesday, August 07, 2012 5:05 PM To: user@hive.apache.org Subject: Re: Converting rows into dynamic colums in Hive you should be able to do this in hive using a group by on alpha and then using a combination of the max and if statement... something on the following lines select alpha, max(abc), max(pqr), ... ( select alpha, if (beta == 'abc', Gamma, NULL) as abc, if (beta == 'pqr', Gamma, NUL) as pqr, from table ) group by alpha something on those lines... Ashish On Tue, Aug 7, 2012 at 1:57 PM, mailto:richin.j...@nokia.com>> wrote: Hi All, One of my Query output looks like- AlphaBeta Gamma 123 xyz 1.0 123 abc 0.5 123 pqr 1.3 123 def 2.1 456 xyz 0.1 456 abc 0.6 456 pqr 1.9 456 def 3.2 456 lmn1.1 456 sdf 1.2 I want the output for the data visualization purpose to look like (basically taking rows from one table and making it column in another table) Alphaxyz abc pqr def lmn sdf 123 1.0 0.5 1.3 2.1 456 0.1 0.6 1.9 3.2 1.1 1.2 Question - Can it be done in Hive? If not, any suggestions. Thanks, Richin
Status of HIVE-1434
Hi, I'm trying to integrate Hive in our Cassandra-Hadoop stack, but I'm finding it difficult to understand what's the right approach. As far as I see, there are 2 jira issues open (one in Hive, on in Cassandra) https://issues.apache.org/jira/browse/CASSANDRA-4131 https://issues.apache.org/jira/browse/HIVE-1434 but none of them is committed. Is patching either Hive or Cassandra the only way to integrate them? Is it any plan to commit HIVE-1434 in Hive trunk? Thanks, -- Filippo Diotalevi
Question about querying JSON data
I have a large amount of data JSON data that was generated using periods in the key names, e.g., {"category.field": "value"}. I know that's not the best way to do JSON but for better or worse, it's the data I have to deal with. I have tried using get_json_object, but I am concerned that it's JSON path expressions interpret "." as a special character. I am also concerned about the overhead of repeatedly parsing each record (each record is about 2K, so not tiny, but not huge either). I have tried using Hive-JSON-Serde but it seems to require that my column names be named the same as my JSON field names. I had heard that there was a serde somewhere that will allow me to specify a JSON path to map to each specific field name, but other than vague references on the mailing list, I haven't found any concrete info about it. I would to use existing code, but I can write my own serde if I have to. What do you recommend? Thanks in advance! --Tom
Re: Hive and joins
Thanks Bejoy. What if those values a,b are not static and are housed in a table. Much like a type 2 dimension. Is the current solution still to put it in the where clause? Thanks, Ranjith From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Wednesday, August 08, 2012 06:55 AM To: user@hive.apache.org Subject: Re: Hive and joins Hi Ranjith BETWEEN a and b, you can implement as >=a , <=b . Since that is not equality you cannot use that in ON clause you need to move it to WHERE condition in your query. Regards, Bejoy KS From: "Raghunath, Ranjith" To: "'user@hive.apache.org'" Sent: Wednesday, August 8, 2012 8:39 AM Subject: Hive and joins Since Hive (0.7.1) only supports equi-based join how are people using it for joins employing the between clause? Thanks, Ranjith
Re: Hive and joins
Hi Ranjith BETWEEN a and b, you can implement as >=a , <=b . Since that is not equality you cannot use that in ON clause you need to move it to WHERE condition in your query. Regards, Bejoy KS From: "Raghunath, Ranjith" To: "'user@hive.apache.org'" Sent: Wednesday, August 8, 2012 8:39 AM Subject: Hive and joins Since Hive (0.7.1) only supports equi-based join how are people using it for joins employing the between clause? Thanks, Ranjith
Re: Custom UserDefinedFunction in Hive
Hi Raihan UDFs are evaluated at run time when the query is executed. But it is hive parser during query parse time decides the boundary of data to be used for the query, ie data from which all partitions has to be processed.Because of this the entire table will be scanned for your query. Regards, Bejoy KS From: Raihan Jamal To: user@hive.apache.org Cc: d...@hive.apache.org Sent: Tuesday, August 7, 2012 10:50 PM Subject: Re: Custom UserDefinedFunction in Hive Hi Jan, I figured that out, it is working fine for me now. The only question I have is, if I am doing like this- SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10; Then the above query will be evaluated as below right? SELECT * FROM REALTIME where dt= ‘20120806’ LIMIT 10; So that means it will look for data in the corresponding dt partition (20120806) only right as above table is partitioned on dt column ? And it will not scan the whole table right? Raihan Jamal On Mon, Aug 6, 2012 at 10:56 PM, Jan Dolinár wrote: Hi Jamal, > > >Check if the function really returns what it should and that your data are >really in MMdd format. You can do this by simple query like this: > > >SELECT dt, yesterdaydate('MMdd') FROM REALTIME LIMIT 1; > > >I don't see anything wrong with the function itself, it works well for me >(although I tested it in hive 0.7.1). The only thing I would change about it >would be to optimize it by calling 'new' only at the time of construction and >reusing the object when the function is called, but that should not affect the >functionality at all. > > >Best regards, >Jan > > > > > > > >On Tue, Aug 7, 2012 at 3:39 AM, Raihan Jamal wrote: > >Problem >>I created the below UserDefinedFunction to get the yesterday's day in the format I wanted as I will be passing the format into this below method from the query. >> >>public final class YesterdayDate extends UDF { >> >> public String evaluate(final String format) { >> DateFormat dateFormat = new SimpleDateFormat(format); >> Calendar cal = Calendar.getInstance(); >> cal.add(Calendar.DATE, -1); >> return dateFormat.format(cal.getTime()).toString(); >> } >>} >> >> >>So whenever I try to run the query like below by adding the jar to classpath and creating the temporary function yesterdaydate, I always get zero result back- >> >>hive> create temporary function yesterdaydate as 'com.example.hive.udf.YesterdayDate'; >>OK >>Time taken: 0.512 seconds >> >>Below is the query I am running- >> >>hive> SELECT * FROM REALTIME where dt= yesterdaydate('MMdd') LIMIT 10; >>OK >> >>And I always get zero result back but the data is there in that table for Aug 5th. >> >>What wrong I am doing? Any suggestions will be appreciated. >> >> >>NOTE:- As I am working with Hive 0.6 so it doesn’t support variable substitution thing, so I cannot use hiveconf here and the above table has been partitioned on dt(date) column. >
RE: Special character replaced by '?'
Any comments/information on the below issue is highly appreciable J From: Balaraman, Anand [mailto:anand_balara...@syntelinc.com] Sent: 06 August 2012 18:25 To: user@hive.apache.org Subject: Special character replaced by '?' Hi I am facing an issue while viewing special characters (such as é) using Hive. If I view the file in HDFS (using hadoop fs -cat command), it is displayed correctly as 'é', but when I select the data using Hive, this character alone gets replaced by a question mark. Do we have any solution for handling this? Regards Anand B Confidential: This electronic message and all contents contain information from Syntel, Inc. which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee only. If you are not the addressee, any disclosure, copy, distribution or use of the contents of this message is prohibited. If you have received this electronic message in error, please notify the sender immediately and destroy the original message and all copies.