Re: Mysql - Hive Sync

2014-09-03 Thread Muthu Pandi
This cant be done since insert update delete are not supported in hive.

Mysql Applier for Hadoop package servers the same purpose of the prototype
tool which i intended to develop.

link for Mysql Applier for Hadoop
http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html



*Regards Muthupandi.K*

 [image: Picture (Device Independent Bitmap)]



On Wed, Sep 3, 2014 at 10:35 AM, Muthu Pandi muthu1...@gmail.com wrote:

 Yeah but we cant make it to work as near real time. Also my table doesnt
 have like 'ID' to use for --check-column that's why opted out of sqoop.



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]



 On Wed, Sep 3, 2014 at 10:28 AM, Nitin Pawar nitinpawar...@gmail.com
 wrote:

 have you looked at sqoop?


 On Wed, Sep 3, 2014 at 10:15 AM, Muthu Pandi muthu1...@gmail.com wrote:

 Dear All

  Am developing a prototype of syncing tables from mysql to Hive
 using python and JDBC. Is it a good idea using the JDBC for this purpose.

 My usecase will be generating the sales report using the hive, data
 pulled from mysql using the prototype tool.My data will be around 2GB/day.



 *Regards Muthupandi.K*

  [image: Picture (Device Independent Bitmap)]




 --
 Nitin Pawar





Error Configuring Object In Hive Joins

2014-09-03 Thread Sreenath
Hi all,

I have two hive tables pointing to the same location(folder) when i join
these tables hive fails in map-reduce with the error

java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 9 more


Any idea why this would happen ?


-- 
Sreenath S Kamath
Bangalore
Ph No:+91-9590989106


Re: MatchPath UDF usage info ?

2014-09-03 Thread Muhammad Asif Abbasi
Hi,

Many thanks for sending these links. Looking forward to more documentation
around this.

BTW, why does  hive-exec-0.13.0.2.1.1.0-385.jar not have any class files
for MatchPath UDF ? Have they been chucked out to a separate JAR file?
I can see that  hive-exec-0.13.0.jar has the appropriate class files, and
have tried to use them. They work well with the demo data set but we
certainly need more documentation around this.

Regards,
Asif Abbasi




On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz leftylever...@gmail.com
wrote:

 Thanks for pointing out that we still need documentation for this in the
 wiki.  (I've added a doc comment to HIVE-5087
 https://issues.apache.org/jira/browse/HIVE-5087.)  In the meantime,
 googling Hive npath turned up these sources of information:

- https://github.com/hbutani/SQLWindowing/wiki
- http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
 (slides
20-21)
-

http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/


 -- Lefty


 On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi 
 asif.abb...@gmail.com wrote:

 Hi All,

 I am trying to use MatchPath UDF (Previously called NPath). Does anybody
 have a document around its syntax and usage?

 Regards,
 Asif





Re: Reading and Writing with Hive 0.13 from a Yarn application

2014-09-03 Thread Ashutosh Chauhan
Hi Nathan,

This was done in https://issues.apache.org/jira/browse/HIVE-6248 Reasoning
was to minimize api surface area to users so that they are immune of
incompatible changes in internal classes and thus making it easier for them
to consume this and not get worried about version upgrade. Seems like in
the process some of the functionality went away.
Which info you are looking for exactly? Is it String[] getBlockLocations()
equivalent of InputSplit? If so, we can consider adding that in
ReaderContext() since that one need not to expose any hadoop or hive
classes.

Thanks,
Ashutosh


On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford nathan.bamf...@redpoint.net
wrote:

  Hi,

   My company has been working on a Yarn application for a couple of
 years-- we essentially take the place of MapReduce and split our data and
 processing ourselves.

   One of the things we've been working to support is Hive access, and the
 HCatalog interfaces and API seemed perfect. Using this information:
 https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html
 https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and
 TestReaderWriter.java from the source code, I was able to create and use
 HCatSplits to allow balanced data local parallel reading (using the size
 and locations methods available from each HCatSplit).

   Much to my dismay, 0.13 removes a lot of that functionality. The
 ReaderContext class is now an interface that only exposes numSplits,
 whereas all of the other methods are in the inaccessible (package
 only) ReaderContextImpl class.

   Since I no longer have access to the actual HCatSplits from the
 ReaderContext, I am unable to process them and send them to our yarn app on
 the data local nodes.  My only choice seems to be to partition out the
 splits to slave nodes more or less at random.

   Does anyone know if, as of 0.13, this is the intended way to interface
 with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit
 only intended for internal use, now?


  Thanks,


  Nathan Bamford



Re: MatchPath UDF usage info ?

2014-09-03 Thread Furcy Pin
Hi Muhammad,

From what I've googled a few months ago on the subject, MatchPath UDF has
been removed from Cloudera and Hortonworks releases because TeraData
claims it violates one of their patent (apparently renaming it did not
suffice).

I guess that if you really need it, it might be possible to add it yourself
as an external UDF since the code is still available out there, but I have
no idea
whether TeraData would have the right to come after you (or not?) if you do.

By the way, if anyone has news on the current situation with MatchPath and
TerraData, that would be welcome.

Furcy




2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi asif.abb...@gmail.com:

 Hi,

 Many thanks for sending these links. Looking forward to more documentation
 around this.

 BTW, why does  hive-exec-0.13.0.2.1.1.0-385.jar not have any class
 files for MatchPath UDF ? Have they been chucked out to a separate JAR
 file?
 I can see that  hive-exec-0.13.0.jar has the appropriate class files,
 and have tried to use them. They work well with the demo data set but we
 certainly need more documentation around this.

 Regards,
 Asif Abbasi




 On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz leftylever...@gmail.com
 wrote:

 Thanks for pointing out that we still need documentation for this in the
 wiki.  (I've added a doc comment to HIVE-5087
 https://issues.apache.org/jira/browse/HIVE-5087.)  In the meantime,
 googling Hive npath turned up these sources of information:

- https://github.com/hbutani/SQLWindowing/wiki
- http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
 (slides
20-21)
-

http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/


 -- Lefty


 On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi 
 asif.abb...@gmail.com wrote:

 Hi All,

 I am trying to use MatchPath UDF (Previously called NPath). Does anybody
 have a document around its syntax and usage?

 Regards,
 Asif






Re: MatchPath UDF usage info ?

2014-09-03 Thread Muhammad Asif Abbasi
Hi Furcy,

Many thanks for your email :)

My latest info was that the rename took place due to objections by
Teradata, but didn't know if they had actually requested to take it off the
distribution entirely.

Does anybody else have an idea on the licensing aspect of this? What
exactly has Teradata patented? Is it the technique to parse the rows in a
such a manner? Any tips/techniques would be highly appreciated.

Regards,
Asif Abbasi




On Wed, Sep 3, 2014 at 5:30 PM, Furcy Pin furcy@flaminem.com wrote:

 Hi Muhammad,

 From what I've googled a few months ago on the subject, MatchPath UDF has
 been removed from Cloudera and Hortonworks releases because TeraData
 claims it violates one of their patent (apparently renaming it did not
 suffice).

 I guess that if you really need it, it might be possible to add it
 yourself as an external UDF since the code is still available out there,
 but I have no idea
 whether TeraData would have the right to come after you (or not?) if you
 do.

 By the way, if anyone has news on the current situation with MatchPath and
 TerraData, that would be welcome.

 Furcy




 2014-09-03 17:18 GMT+02:00 Muhammad Asif Abbasi asif.abb...@gmail.com:

 Hi,

 Many thanks for sending these links. Looking forward to more
 documentation around this.

 BTW, why does  hive-exec-0.13.0.2.1.1.0-385.jar not have any class
 files for MatchPath UDF ? Have they been chucked out to a separate JAR
 file?
 I can see that  hive-exec-0.13.0.jar has the appropriate class files,
 and have tried to use them. They work well with the demo data set but we
 certainly need more documentation around this.

 Regards,
 Asif Abbasi




 On Tue, Aug 26, 2014 at 6:42 AM, Lefty Leverenz leftylever...@gmail.com
 wrote:

 Thanks for pointing out that we still need documentation for this in the
 wiki.  (I've added a doc comment to HIVE-5087
 https://issues.apache.org/jira/browse/HIVE-5087.)  In the meantime,
 googling Hive npath turned up these sources of information:

- https://github.com/hbutani/SQLWindowing/wiki
-
http://www.slideshare.net/Hadoop_Summit/analytical-queries-with-hive 
 (slides
20-21)
-

http://www.justinjworkman.com/big-data/using-npath-with-apache-hive/


 -- Lefty


 On Mon, Aug 25, 2014 at 8:27 AM, Muhammad Asif Abbasi 
 asif.abb...@gmail.com wrote:

 Hi All,

 I am trying to use MatchPath UDF (Previously called NPath). Does
 anybody have a document around its syntax and usage?

 Regards,
 Asif







Hive hwi setup windows 8 using cygwin

2014-09-03 Thread Prashant Sharma

Dear All,

	Kindly help with hwi setup, I am using cygwin on windows 8, jdk 1.7, ant  
1.9. I have successfully built hive and able to start the hwi war using  
command hive --service hwi. It shows  Started  
SocketConnector@0.0.0.0: in console.


Problem I am facing is when I open the browser and try to access the hwi  
url (localhost:/hwi)
I get error in console as below and the webpage also displays the stack  
trace of same error.


Error :-

2014-09-03 22:48:36,475 ERROR [1231565732@qtp-362433913-0]  
compiler.Compiler (AntCompiler.java:generateClass(225)) - Javac exception

Unable to find a javac compiler;
com.sun.tools.javac.Main is not on the classpath.
Perhaps JAVA_HOME does not point to the JDK.
It is currently set to C:\Program Files\Java\jdk1.7.0_51\jre
at  
org.apache.tools.ant.taskdefs.compilers.CompilerAdapterFactory.getCompiler(CompilerAdapterFactory.java:129)



Have tried all path related issues solutions for ant, yet the problem is  
there. Have checked ant also, its working fine in cygwin and windows  
both,I am able to run all the lifecycle tasks from ant.


Please help.

Thanks
Prashant


RE: Reading and Writing with Hive 0.13 from a Yarn application

2014-09-03 Thread Nathan Bamford
?Hi Ashutosh,

  Thanks for the reply!

  Well, we are a yarn app that is essentially doing the same things mapreduce 
does. For regular files in Hadoop, we get the block locations and sizes and 
perform some internal sorting and load balancing on the master which then 
creates the slave yarn apps on individual nodes for reading. We strive for data 
locality, as much as possible.

  To interface with Hive, the HCatalog api seemed like the appropriate 
interface.  It does a lot of things we want via the ReadEntity, allowing us to 
query and read the Hive tables at a high level.

  I used the readerwriter example (from Hive 0.12) to get things running, but I 
was using HCatSplit just like our internal split classes. I retrieved them from 
the ReaderContext and ran them through the same sorting algorithms, then 
serialized them and sent them to the individual yarn apps, etc.

  I understand the rationale for the smaller api, which is why I wondered if 
there's another avenue I should be pursuing as a yarn app (metadata vs. 
HCatalog, for instance).

  All that being said :), the ability to get the block locations (and sizes, if 
possible) would certainly solve my problems.


Thanks,


Nathan




From: Ashutosh Chauhan hashut...@apache.org
Sent: Wednesday, September 3, 2014 9:16 AM
To: user@hive.apache.org
Subject: Re: Reading and Writing with Hive 0.13 from a Yarn application

Hi Nathan,

This was done in https://issues.apache.org/jira/browse/HIVE-6248 Reasoning was 
to minimize api surface area to users so that they are immune of incompatible 
changes in internal classes and thus making it easier for them to consume this 
and not get worried about version upgrade. Seems like in the process some of 
the functionality went away.
Which info you are looking for exactly? Is it String[] getBlockLocations() 
equivalent of InputSplit? If so, we can consider adding that in ReaderContext() 
since that one need not to expose any hadoop or hive classes.

Thanks,
Ashutosh


On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford 
nathan.bamf...@redpoint.netmailto:nathan.bamf...@redpoint.net wrote:

Hi,

  My company has been working on a Yarn application for a couple of years-- we 
essentially take the place of MapReduce and split our data and processing 
ourselves.

  One of the things we've been working to support is Hive access, and the 
HCatalog interfaces and API seemed perfect. Using this information: 
https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html 
https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and 
TestReaderWriter.java from the source code, I was able to create and use 
HCatSplits to allow balanced data local parallel reading (using the size and 
locations methods available from each HCatSplit).

  Much to my dismay, 0.13 removes a lot of that functionality. The 
ReaderContext class is now an interface that only exposes numSplits, whereas 
all of the other methods are in the inaccessible (package only) 
ReaderContextImpl class.

  Since I no longer have access to the actual HCatSplits from the 
ReaderContext, I am unable to process them and send them to our yarn app on the 
data local nodes.  My only choice seems to be to partition out the splits to 
slave nodes more or less at random.

  Does anyone know if, as of 0.13, this is the intended way to interface with 
Hive via non-Hadoop yarn applications? Is the underlying HCatSplit only 
intended for internal use, now?


Thanks,


Nathan Bamford



Re: Reading and Writing with Hive 0.13 from a Yarn application

2014-09-03 Thread Ashutosh Chauhan
This api is designed for use cases like yours only. So, I will say api is
failing if it cannot service what you are trying to do with it. So, I will
encourage you to use this api and consider current shortcoming as missing
feature in it.
Feel free to file a jira requesting addition of these methods in
ReaderContext. Patches are welcome too :)

Hope it helps,
Ashutosh


On Wed, Sep 3, 2014 at 11:12 AM, Nathan Bamford nathan.bamf...@redpoint.net
 wrote:

  Hi Ashutosh,

   Thanks for the reply!

   Well, we are a yarn app that is essentially doing the same things
 mapreduce does. For regular files in Hadoop, we get the block locations and
 sizes and perform some internal sorting and load balancing on the master
 which then creates the slave yarn apps on individual nodes for reading. We
 strive for data locality, as much as possible.

   To interface with Hive, the HCatalog api seemed like the appropriate
 interface.  It does a lot of things we want via the ReadEntity, allowing us
 to query and read the Hive tables at a high level.

   I used the readerwriter example (from Hive 0.12) to get things running,
 but I was using HCatSplit just like our internal split classes. I retrieved
 them from the ReaderContext and ran them through the same sorting
 algorithms, then serialized them and sent them to the individual yarn apps,
 etc.

   I understand the rationale for the smaller api, which is why I wondered
 if there's another avenue I should be pursuing as a yarn app (metadata vs.
 HCatalog, for instance).

   All that being said :), the ability to get the block locations (and
 sizes, if possible) would certainly solve my problems.


  Thanks,


  Nathan


  --
 *From:* Ashutosh Chauhan hashut...@apache.org
 *Sent:* Wednesday, September 3, 2014 9:16 AM
 *To:* user@hive.apache.org
 *Subject:* Re: Reading and Writing with Hive 0.13 from a Yarn application

   Hi Nathan,

  This was done in https://issues.apache.org/jira/browse/HIVE-6248
 Reasoning was to minimize api surface area to users so that they are immune
 of incompatible changes in internal classes and thus making it easier for
 them to consume this and not get worried about version upgrade. Seems like
 in the process some of the functionality went away.
 Which info you are looking for exactly? Is it String[] getBlockLocations()
 equivalent of InputSplit? If so, we can consider adding that in
 ReaderContext() since that one need not to expose any hadoop or hive
 classes.

  Thanks,
 Ashutosh


 On Tue, Sep 2, 2014 at 5:26 PM, Nathan Bamford 
 nathan.bamf...@redpoint.net wrote:

  Hi,

   My company has been working on a Yarn application for a couple of
 years-- we essentially take the place of MapReduce and split our data and
 processing ourselves.

   One of the things we've been working to support is Hive access, and the
 HCatalog interfaces and API seemed perfect. Using this information:
 https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html
 https://hive.apache.org/javadocs/hcat-r0.5.0/readerwriter.html and
 TestReaderWriter.java from the source code, I was able to create and use
 HCatSplits to allow balanced data local parallel reading (using the size
 and locations methods available from each HCatSplit).

   Much to my dismay, 0.13 removes a lot of that functionality. The
 ReaderContext class is now an interface that only exposes numSplits,
 whereas all of the other methods are in the inaccessible (package
 only) ReaderContextImpl class.

   Since I no longer have access to the actual HCatSplits from the
 ReaderContext, I am unable to process them and send them to our yarn app on
 the data local nodes.  My only choice seems to be to partition out the
 splits to slave nodes more or less at random.

   Does anyone know if, as of 0.13, this is the intended way to interface
 with Hive via non-Hadoop yarn applications? Is the underlying HCatSplit
 only intended for internal use, now?


  Thanks,


  Nathan Bamford





How to use joins and averages both in the same hive query

2014-09-03 Thread Mohit Durgapal
I have two tables in hive:

Table1: uid,txid,amt,vendor Table2: uid,txid

Now I need to join the tables on txid which basically confirms a
transaction is finally recorded. There will be some transactions which will
be present only in Table1 and not in Table2.

I need to find out number of avg of transaction matches found per user(uid)
per vendor. Then I need to find the avg of these averages by adding all the
averages and divide them by the number of unique users per vendor.

Let's say I have the data:

Table1:

u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2

Table2:

u1,100
u1,101
u2,200
u2,202

Example For vendor vend1:

u1- Avg transaction find rate = 2(matches found in both Tables,Table1 and
Table2)/4(total occurrence in Table1) =0.5

u2 - Avg transaction find rate = 1/1 = 1

Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75

Required output:

vend1,0.75
vend2,1

I can't seem to find count of both matches and occurrence in just Table1 in
one hive query per user per vendor. I have reached to this query and can't
find how to change it further.

SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN
Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid

Any help would be great.


Re: Collect_set() of non-primitive types

2014-09-03 Thread Nishant Kelkar
I don't know of anything like what you want atleast until Hive 0.11.

However, you could try something like this:

INSERT OVERWRITE TABLE rollup
SELECT id, start_time, collect_set(concat_ws(,, objects.name,
objects.value, objects.type)) AS product_details
FROM bar
GROUP BY id, start_time;


It's a bit hacky, but it does the trick. It basically concats all the
objects you care about in a string, and then collects a set of that. Then,
when you want to extract a field from the product_details array, just do
this (say you want to extract the very first product's name in each array):

SELECT SPLIT(a[0].product_details)[0] AS first_name FROM rollup a;


Hope that helps!

Best Regards,
Nishant Kelkar



On Wed, Sep 3, 2014 at 1:47 PM, anusha Mangina anusha.mang...@gmail.com
wrote:

 I have a table defined as:

 CREATE TABLE foo (
   id INT,
   start_time STRING,
   name STRING,
   value STRING,
   type STRING
 )

 The 'name', 'value' and 'type' fields actually describe another object, and
 I'd like to turn these into STRUCTs, something like:

 CREATE TABLE bar (
   id INT,
   start_time STRING,
   object STRUCTname: STRING,
 value: STRING,
 type: STRING
 )

 However, I'd also like to create a rollup table containing the most recent
 result for a given 'id' field. There can be multiple different 'object'
 structs for a given combination of 'id' and 'start_time', so I thought I'd
 turn this into an array of structs. My ideal rollup table would look like:

 CREATE TABLE rollup (
   id INT,
   start_time STRING,
   objects ARRAYSTRUCTname: STRING,
 value: STRING,
 type: STRING
 )

 However, I can't do this because the following query fails:

 INSERT OVERWRITE TABLE rollup
 SELECT id, start_time, collect_set(object)
 FROM bar
 GROUP BY id, start_time

 Here's the error I get:

 FAILED: UDFArgumentTypeException Only primitive type arguments are accepted
 but structname:string,value:string,type:string was passed as parameter 1.

 Is there any way I can do this?




Re: Collect_set() of non-primitive types

2014-09-03 Thread Nishant Kelkar
Geez, too many mistakes for the day :P

I meant the following above

*CREATE TABLE* rollup_new *AS*
SELECT id, start_time, collect_set(concat_ws(,, object.name
http://objects.name/, object.value, object.type)) AS product_details
FROM bar
GROUP BY id, start_time;

The change is in the table creation query, since an INSERT OVERWRITE would
expect a struct and get a string instead, thus throwing errors right?



On Wed, Sep 3, 2014 at 2:06 PM, Nishant Kelkar nishant@gmail.com
wrote:

 Sorry, I meant the following in my example:

 INSERT OVERWRITE TABLE rollup
 SELECT id, start_time, collect_set(concat_ws(,, object.name 
 http://objects.name/, object.value, object.type)) AS product_details
 FROM bar
 GROUP BY id, start_time;

 object instead of object*s*, as you have in your example :)

 Best Regards,
 Nishant Kelkar


 On Wed, Sep 3, 2014 at 2:03 PM, Nishant Kelkar nishant@gmail.com
 wrote:

 I don't know of anything like what you want atleast until Hive 0.11.

 However, you could try something like this:

 INSERT OVERWRITE TABLE rollup
 SELECT id, start_time, collect_set(concat_ws(,, objects.name, 
 objects.value, objects.type)) AS product_details
 FROM bar
 GROUP BY id, start_time;


 It's a bit hacky, but it does the trick. It basically concats all the
 objects you care about in a string, and then collects a set of that. Then,
 when you want to extract a field from the product_details array, just do
 this (say you want to extract the very first product's name in each array):

 SELECT SPLIT(a[0].product_details)[0] AS first_name FROM rollup a;


 Hope that helps!

 Best Regards,
 Nishant Kelkar



 On Wed, Sep 3, 2014 at 1:47 PM, anusha Mangina anusha.mang...@gmail.com
 wrote:

 I have a table defined as:

 CREATE TABLE foo (
   id INT,
   start_time STRING,
   name STRING,
   value STRING,
   type STRING
 )

 The 'name', 'value' and 'type' fields actually describe another object, and
 I'd like to turn these into STRUCTs, something like:

 CREATE TABLE bar (
   id INT,
   start_time STRING,
   object STRUCTname: STRING,
 value: STRING,
 type: STRING
 )

 However, I'd also like to create a rollup table containing the most recent
 result for a given 'id' field. There can be multiple different 'object'
 structs for a given combination of 'id' and 'start_time', so I thought I'd
 turn this into an array of structs. My ideal rollup table would look like:

 CREATE TABLE rollup (
   id INT,
   start_time STRING,
   objects ARRAYSTRUCTname: STRING,
 value: STRING,
 type: STRING
 )

 However, I can't do this because the following query fails:

 INSERT OVERWRITE TABLE rollup
 SELECT id, start_time, collect_set(object)
 FROM bar
 GROUP BY id, start_time

 Here's the error I get:

 FAILED: UDFArgumentTypeException Only primitive type arguments are accepted
 but structname:string,value:string,type:string was passed as parameter 1.

 Is there any way I can do this?






Re: How to use joins and averages both in the same hive query

2014-09-03 Thread 丁桂涛(桂花)
try this:


SELECT
A.vendor,
AVG(totalmatchesperuser) as avgmatches
FROM
(SELECT
A.vendor,
A.uid,
count(*) as totalmatchesperuser
FROM
Table1 A INNER JOIN
Table2 B
ON A.uid = B.uid AND B.txid =A.txid
GROUP BY
A.vendor,
A.uid
) t
GROUP BY
A.vendor
​


On Thu, Sep 4, 2014 at 3:38 AM, Mohit Durgapal durgapalmo...@gmail.com
wrote:

 I have two tables in hive:

 Table1: uid,txid,amt,vendor Table2: uid,txid

 Now I need to join the tables on txid which basically confirms a
 transaction is finally recorded. There will be some transactions which will
 be present only in Table1 and not in Table2.

 I need to find out number of avg of transaction matches found per
 user(uid) per vendor. Then I need to find the avg of these averages by
 adding all the averages and divide them by the number of unique users per
 vendor.

 Let's say I have the data:

 Table1:

 u1,120,44,vend1
 u1,199,33,vend1
 u1,100,23,vend1
 u1,101,24,vend1
 u2,200,34,vend1
 u2,202,32,vend2

 Table2:

 u1,100
 u1,101
 u2,200
 u2,202

 Example For vendor vend1:

 u1- Avg transaction find rate = 2(matches found in both Tables,Table1 and
 Table2)/4(total occurrence in Table1) =0.5

 u2 - Avg transaction find rate = 1/1 = 1

 Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75

 Required output:

 vend1,0.75
 vend2,1

 I can't seem to find count of both matches and occurrence in just Table1
 in one hive query per user per vendor. I have reached to this query and
 can't find how to change it further.

 SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN
 Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid

 Any help would be great.




-- 
丁桂涛