Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hello,

We are using Hive to query S3 data. For one of our tables named analyze, we
generate data hierarchically. First level of hierarchy is date and second
level is a field named *generated_by*. e.g. for 20 march we may have S3
directories as
s3://analyze/20140320/111/
s3://analyze/20140320/222/
s3://analyze/20140320/333/
Size of files in each folders is typically small.

Till now we have been using static partitioning so that queries on specific
date and *generated_by* would be faster.

Now problem is that number of *generated_by* folders is increased to 1000s.
Everyday we end up adding 1000s of partitions to Hive. So queries on
analyze on one month are slowed down.

Is there any way to get rid of partitions, and at the same time maintain
good  performance of queries which are fired on specific day and
*generated_by*?
--
Regards,
Saumitra Shahapure


Re: Handling hierarchical data in Hive

2014-03-25 Thread Nitin Pawar
see if this is what you are looking for https://github.com/sskaje/hive_merge




On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) 
saumitra.shahap...@vizury.com wrote:

 Hello,

 We are using Hive to query S3 data. For one of our tables named analyze,
 we generate data hierarchically. First level of hierarchy is date and
 second level is a field named *generated_by*. e.g. for 20 march we may
 have S3 directories as
 s3://analyze/20140320/111/
 s3://analyze/20140320/222/
 s3://analyze/20140320/333/
 Size of files in each folders is typically small.

 Till now we have been using static partitioning so that queries on
 specific date and *generated_by* would be faster.

 Now problem is that number of *generated_by* folders is increased to
 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
 analyze on one month are slowed down.

 Is there any way to get rid of partitions, and at the same time maintain
 good  performance of queries which are fired on specific day and
 *generated_by*?
 --
 Regards,
 Saumitra Shahapure




-- 
Nitin Pawar


Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hi Nitin,

We are not facing small files problem since data is in S3. Also we do not
want to merge files. Merging files are creating large analyze table for say
one day would slow down queries fired on specific day and *generated_by.*

Let me explain my problem in other words.
Right now we are over-partitioning our table. Over-partitioning is giving
us benefit that query on 1-2 partitions is too fast. It's side-effect is
that If we try to query large number of partitions, query is too slow. Is
there a way to get good performance in both of the scenarios?

--
Regards,
Saumitra Shahapure


On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 see if this is what you are looking for
 https://github.com/sskaje/hive_merge




 On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) 
 saumitra.shahap...@vizury.com wrote:

 Hello,

 We are using Hive to query S3 data. For one of our tables named analyze,
 we generate data hierarchically. First level of hierarchy is date and
 second level is a field named *generated_by*. e.g. for 20 march we may
 have S3 directories as
 s3://analyze/20140320/111/
 s3://analyze/20140320/222/
 s3://analyze/20140320/333/
 Size of files in each folders is typically small.

 Till now we have been using static partitioning so that queries on
 specific date and *generated_by* would be faster.

 Now problem is that number of *generated_by* folders is increased to
 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
 analyze on one month are slowed down.

 Is there any way to get rid of partitions, and at the same time maintain
 good  performance of queries which are fired on specific day and
 *generated_by*?
 --
 Regards,
 Saumitra Shahapure




 --
 Nitin Pawar



Re: Handling hierarchical data in Hive

2014-03-25 Thread Nitin Pawar
in general when you have large number of partitions, your hive query
performance drops. This has been significantly addressed in current
releases but still see the performance issues. sadly I currently do not
have that larger dataset where I need to create large number of partitions.

This issue last time i checked was caused
by ObjectStore.getPartitionsByNames .  I am not sure this is same
implementation currently.

When you have large number of partitions, the actual time spent on query
planning increases,
one way was to set
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

you can also check the value of datanucleus.connectionPool.maxActive in
your hive config, if you can increase number of connections to your
metastore db.

we normally used to merge data for historical data into a single partition
column and then if required used to do a join between new data set and old
data sets. kind of a rolling data table and historical data table.



On Tue, Mar 25, 2014 at 4:55 PM, Saumitra Shahapure (Vizury) 
saumitra.shahap...@vizury.com wrote:

 Hi Nitin,

 We are not facing small files problem since data is in S3. Also we do not
 want to merge files. Merging files are creating large analyze table for say
 one day would slow down queries fired on specific day and *generated_by.*

 Let me explain my problem in other words.
 Right now we are over-partitioning our table. Over-partitioning is giving
 us benefit that query on 1-2 partitions is too fast. It's side-effect is
 that If we try to query large number of partitions, query is too slow. Is
 there a way to get good performance in both of the scenarios?

 --
 Regards,
 Saumitra Shahapure


 On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 see if this is what you are looking for
 https://github.com/sskaje/hive_merge




 On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) 
 saumitra.shahap...@vizury.com wrote:

 Hello,

 We are using Hive to query S3 data. For one of our tables named analyze,
 we generate data hierarchically. First level of hierarchy is date and
 second level is a field named *generated_by*. e.g. for 20 march we may
 have S3 directories as
 s3://analyze/20140320/111/
 s3://analyze/20140320/222/
 s3://analyze/20140320/333/
 Size of files in each folders is typically small.

 Till now we have been using static partitioning so that queries on
 specific date and *generated_by* would be faster.

 Now problem is that number of *generated_by* folders is increased to
 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
 analyze on one month are slowed down.

 Is there any way to get rid of partitions, and at the same time maintain
 good  performance of queries which are fired on specific day and
 *generated_by*?
 --
 Regards,
 Saumitra Shahapure




 --
 Nitin Pawar





-- 
Nitin Pawar


RE: Does hive instantiate new udf object for each record

2014-03-25 Thread java8964
The reason you saw that is because when you provide evaluate() method, you 
didn't specified the type of column it can be used. So Hive will just create 
test instance again and again for every new row, as it doesn't know how or 
which column to apply your UDF.
I changed your code as below:
public class test extends UDF {
private Text t;

public Text evaluate (String s) {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}

public Text evaluate () {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}
}
Now, if you invoke your UDF like this:
select test(colA) from AnyTable;
You should see one Init and the rest are OK, make sense?
Yong
From: sky880883...@hotmail.com
To: user@hive.apache.org
Subject: RE: Does hive instantiate new udf object for each record
Date: Tue, 25 Mar 2014 10:17:46 +0800




I have implemented a simple udf for test.


public class test extends UDF {
private Text t;

public Text evaluate () {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}
}

And the test query: select test() from AnyTable;
I got
initialization
initialization
initialization
...

I have also implemented a similar GenericUDF, and got similar result.

What' wrong with my code?

Best Regards,ypgFrom: java8...@hotmail.com
To: user@hive.apache.org
Subject: RE: Does hive instantiate new udf object for each record
Date: Mon, 24 Mar 2014 16:58:49 -0400




Your UDF object will only initialized once per map or reducer. 
When you said your UDF object being initialized for each row, why do you think 
so? Do you have log to make you think that way?
If OK, please provide more information, so we can help you, like your example 
code, log etc
Yong

Date: Tue, 25 Mar 2014 00:30:21 +0800
From: sky880883...@hotmail.com
To: user@hive.apache.org
Subject: Does hive instantiate new udf object for each record


Hi all,
I'm trying to implement a udf which makes use of some data structures 
like binary tree. However,  it seems that hive instantiates new udf 
object for each row in the table. Then the data structures would be also 
initialized again and again for each row.Whereas, in the book 
Programming Hive, a geoip function is taken for an example showing that a 
LookupService object is saved in a reference so it only needs to be 
initialized once in the lifetime of a map or reduce task that initializes it. 
The code for this function can be found here 
(https://github.com/edwardcapriolo/hive-geoip/).
Could anyone give me some ideas how to make the udf object initialize 
once in the lifetime of a map or reduce task?

Best Regards,ypg



  

Re: Handling hierarchical data in Hive

2014-03-25 Thread Prasan Samtani
Hi Saumitra,

You might want to look into clustering within the partition. That is, partition 
by day, but cluster by generated by (within those partitions), and see if 
that improves performance. Refer to the CLUSTER BY command in the Hive language 
Manual.

-Prasan


On Mar 25, 2014, at 4:26 AM, Saumitra Shahapure (Vizury) 
saumitra.shahap...@vizury.commailto:saumitra.shahap...@vizury.com wrote:

Hi Nitin,

We are not facing small files problem since data is in S3. Also we do not want 
to merge files. Merging files are creating large analyze table for say one day 
would slow down queries fired on specific day and generated_by.

Let me explain my problem in other words.
Right now we are over-partitioning our table. Over-partitioning is giving us 
benefit that query on 1-2 partitions is too fast. It's side-effect is that If 
we try to query large number of partitions, query is too slow. Is there a way 
to get good performance in both of the scenarios?

--
Regards,
Saumitra Shahapure


On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar 
nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote:
see if this is what you are looking for https://github.com/sskaje/hive_merge




On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) 
saumitra.shahap...@vizury.commailto:saumitra.shahap...@vizury.com wrote:
Hello,

We are using Hive to query S3 data. For one of our tables named analyze, we 
generate data hierarchically. First level of hierarchy is date and second level 
is a field named generated_by. e.g. for 20 march we may have S3 directories as
s3://analyze/20140320/111/
s3://analyze/20140320/222/
s3://analyze/20140320/333/
Size of files in each folders is typically small.

Till now we have been using static partitioning so that queries on specific 
date and generated_by would be faster.

Now problem is that number of generated_by folders is increased to 1000s. 
Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on 
one month are slowed down.

Is there any way to get rid of partitions, and at the same time maintain good  
performance of queries which are fired on specific day and generated_by?
--
Regards,
Saumitra Shahapure



--
Nitin Pawar



Re: Permission denied creating external table

2014-03-25 Thread Abdelrahman Shettia
Hi Oliver,

Try to set these properties in core-site.xml: Using * will allow everyone
to impersonate Hive and the cluster needs to be restarted.

property
  namehadoop.proxyuser.hive.groups/name
  valueusers/value
  descriptionAllow the superuser hive to impersonate any members of
the group users. Required only when installing Hive.
  /description
/property

where *$HIVE_USER* is the user owning Hive Services. For example, hive.

property
  namehadoop.proxyuser.hive.hosts/name
  value$Hive_Hostname_FQDN/value
  descriptionHostname from where superuser hive can connect.
Required only when installing Hive.
  /description
/property



Also, Enable the following configuration in Hive-site.xml:

hive.metastore.execute.setugi

In addition, Please use the directory path only while you are creating the
table and it would be better to have 'hadoop' as the supergroup.

Hope this helps.

Thanks



On Mon, Mar 24, 2014 at 9:59 AM, Oliver ohook...@gmail.com wrote:

 Hi Rahman,

 On 24 March 2014 16:45, Abdelrahman Shettia ashet...@hortonworks.comwrote:

 Hi Oliver,

 Can you perform a simple test of hadoop fs -cat
 hdfs:///logs/2014/03-24/actual_log_file_name.seq by the same user? Also
 what are the configurations setting for the following?


 Yes, I can access that file with the same user using hadoop fs -cat as
 well as other tools (I've been using Pig up until this point).



 hive.metastore.execute.setugi


 I'm not setting this explicitly anywhere.



 hive.metastore.warehouse.dir


 I have this set in my HiveQL script:
 SET hive.metastore.warehouse.dir=/user/oliver/warehouse;

 This directory already exists, since I created it before running the
 script.



 hive.metastore.uris


 Not explicitly set anywhere to my knowledge.



  Thanks,
 Rahman

 On Mar 24, 2014, at 8:17 AM, Oliver ohook...@gmail.com wrote:

 Hi,

 I have a bunch of data already in place in a directory on HDFS containing
 many different logs of different types, so I'm attempting to load these
 externally like so:

 CREATE EXTERNAL TABLE mylogs (line STRING) STORED AS SEQUENCEFILE
 LOCATION 'hdfs:///logs/2014/03-24/actual_log_file_name.seq';

 However I get this error back when doing so:

 FAILED: Execution Error, return code 1 from
 org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got
 exception: org.apache.hadoop.security.AccessControlException Permission
 denied: user=oliver, access=WRITE,
 inode=/logs/2014/03-24:logs:supergroup:drwxr-xr-x
  at
 org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:224)
 at
 org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:204)
  at
 org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4716)
  at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4698)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:4672)
  at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3035)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:2999)
  at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2980)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:648)
  at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:419)
 at
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44970)
  at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1701)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1697)
  at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Unknown Source)
  at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1695)
 )

 This directory is intentionally read-only by regular users who want to
 read the logs and analyse them. Am I missing some configuration data for
 Hive that will tell it to only store metadata elsewhere? I already
 have hive.metastore.warehouse.dir set to another location where I have
 write permission.

 Best Regards,
 Oliver



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, 

Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hi Nitin/Prasan,

Thanks for your replies, I appreciate your help :)

Clustering looks to be quite close to what we want. However one main gap is
that we need to fire hive query to populate clusters. In our case, the
clustered data is already there. So computation in Hive query would be
redundant. If

CREATE TABLE analyze (generated_by INT, other_representative_field INT)
PARTITIONED BY (dt STRING)
CLUSTERED BY (generated_by) INTO 100 BUCKETS;

Just accepts s3 directory hierarchy that we have (as explained in first
mail), our problem would be solved.

Another interesting solutions seem to be creating partition on dt field and
creating Hive index/view on *generated_by *field.

If anyone has insights around these, they would be really helpful.
Meanwhile we will try to solve our problem by buckets/indices.


--
Regards,
Saumitra Shahapure


On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani prasan.samt...@hulu.comwrote:

 Hi Saumitra,

 You might want to look into clustering within the partition. That is,
 partition by day, but cluster by generated by (within those
 partitions), and see if that improves performance. Refer to the CLUSTER BY
 command in the Hive language Manual.

 -Prasan


 On Mar 25, 2014, at 4:26 AM, Saumitra Shahapure (Vizury) 
 saumitra.shahap...@vizury.com wrote:

 Hi Nitin,

 We are not facing small files problem since data is in S3. Also we do not
 want to merge files. Merging files are creating large analyze table for say
 one day would slow down queries fired on specific day and *generated_by.*

 Let me explain my problem in other words.
 Right now we are over-partitioning our table. Over-partitioning is giving
 us benefit that query on 1-2 partitions is too fast. It's side-effect is
 that If we try to query large number of partitions, query is too slow. Is
 there a way to get good performance in both of the scenarios?

 --
 Regards,
 Saumitra Shahapure


 On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 see if this is what you are looking for
 https://github.com/sskaje/hive_merge




 On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) 
 saumitra.shahap...@vizury.com wrote:

 Hello,

 We are using Hive to query S3 data. For one of our tables named analyze,
 we generate data hierarchically. First level of hierarchy is date and
 second level is a field named *generated_by*. e.g. for 20 march we may
 have S3 directories as
 s3://analyze/20140320/111/
 s3://analyze/20140320/222/
 s3://analyze/20140320/333/
 Size of files in each folders is typically small.

 Till now we have been using static partitioning so that queries on
 specific date and *generated_by* would be faster.

 Now problem is that number of *generated_by* folders is increased to
 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
 analyze on one month are slowed down.

 Is there any way to get rid of partitions, and at the same time maintain
 good  performance of queries which are fired on specific day and
 *generated_by*?
 --
 Regards,
 Saumitra Shahapure




 --
 Nitin Pawar





Re: Buildfile: build.xml does not exist!

2014-03-25 Thread Chinna Rao Lalam
Hi,

  Hive is  mavenized, so please follow this link to build


https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-MakingChanges


Hope It Helps you,



On Tue, Mar 25, 2014 at 9:12 PM, Nagarjuna Vissarapu 
nagarjuna.v...@gmail.com wrote:


 Hi,

 Can you please help me in finding bulid.xml for installation of hive. I
 used following link for installation  
 *https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation
 https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation
 . *For using *ant package* command I not found build.xml file*. *Thanks
 in advance please help me.

 Please find the attachment.

 --
 With Thanks  Regards
 Nagarjuna Vissarapu
 9052179339




Re: Handling hierarchical data in Hive

2014-03-25 Thread Nitin Pawar
bucketing is certainly helpful when you have finite number of values on a
different column in a partitioned column.
though bucketing would mean that when you load data into the table, it
can't be a straight forward load data in path, you will need to run it via
hive queries (which does not seem to be a problem at least from the look of
it)

clustering used to be in the ranges of 2 like 2, 4, 8, 16 etc. Not sure if
it has changed now.
Also while loading data for bucketed table its advised you set the value
for set hive.enforce.bucketing = true;

 I have rarely used indexing in hive. but I do remember hive indexes used
to provide better data access to certain queries as well the storage layout
helps in improving search and lookup of the data.

It may be really helpful if you can note down the performance you get after
fine tuning the parameters

On Tue, Mar 25, 2014 at 10:37 PM, Saumitra Shahapure (Vizury) 
saumitra.shahap...@vizury.com wrote:

 Hi Nitin/Prasan,

 Thanks for your replies, I appreciate your help :)

 Clustering looks to be quite close to what we want. However one main gap
 is that we need to fire hive query to populate clusters. In our case, the
 clustered data is already there. So computation in Hive query would be
 redundant. If

 CREATE TABLE analyze (generated_by INT, other_representative_field INT)
 PARTITIONED BY (dt STRING)
 CLUSTERED BY (generated_by) INTO 100 BUCKETS;

 Just accepts s3 directory hierarchy that we have (as explained in first
 mail), our problem would be solved.

 Another interesting solutions seem to be creating partition on dt field
 and creating Hive index/view on *generated_by *field.

 If anyone has insights around these, they would be really helpful.
 Meanwhile we will try to solve our problem by buckets/indices.


 --
 Regards,
 Saumitra Shahapure


 On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani 
 prasan.samt...@hulu.comwrote:

 Hi Saumitra,

 You might want to look into clustering within the partition. That is,
 partition by day, but cluster by generated by (within those
 partitions), and see if that improves performance. Refer to the CLUSTER BY
 command in the Hive language Manual.

 -Prasan


 On Mar 25, 2014, at 4:26 AM, Saumitra Shahapure (Vizury) 
 saumitra.shahap...@vizury.com wrote:

 Hi Nitin,

 We are not facing small files problem since data is in S3. Also we do not
 want to merge files. Merging files are creating large analyze table for say
 one day would slow down queries fired on specific day and *generated_by.*

 Let me explain my problem in other words.
 Right now we are over-partitioning our table. Over-partitioning is giving
 us benefit that query on 1-2 partitions is too fast. It's side-effect is
 that If we try to query large number of partitions, query is too slow. Is
 there a way to get good performance in both of the scenarios?

 --
 Regards,
 Saumitra Shahapure


 On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote:

 see if this is what you are looking for
 https://github.com/sskaje/hive_merge




 On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) 
 saumitra.shahap...@vizury.com wrote:

 Hello,

 We are using Hive to query S3 data. For one of our tables named
 analyze, we generate data hierarchically. First level of hierarchy is date
 and second level is a field named *generated_by*. e.g. for 20 march we
 may have S3 directories as
 s3://analyze/20140320/111/
 s3://analyze/20140320/222/
 s3://analyze/20140320/333/
 Size of files in each folders is typically small.

 Till now we have been using static partitioning so that queries on
 specific date and *generated_by* would be faster.

 Now problem is that number of *generated_by* folders is increased to
 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on
 analyze on one month are slowed down.

 Is there any way to get rid of partitions, and at the same time
 maintain good  performance of queries which are fired on specific day and
 *generated_by*?
 --
 Regards,
 Saumitra Shahapure




 --
 Nitin Pawar






-- 
Nitin Pawar


Delta or incremental loading for Hbase table

2014-03-25 Thread Manjula mohapatra
We have a Hbase table.
Each time we aggreate the table based on some columns, we are doing full
scan for entire table.
What are the ideas for extracting just the delta or increments frokm the
last loading .


Right now i m following this approach. But want some better ideas.
- Mount the hbase into Hive table
-The rowkey of hbase table is mapped to key column in hive table.
- extracting the timestamp from rowkey and extracting yesterday's data.
- also there is a timestamp column ( non key) . I am extracting previous
days's data and aggregating it
- Then merging the incremental aggregated data into target aggregate table
using full outer join .


Questions 1) any better sugestions for incremental loading
2) if the use of key column from Hive , give any perfromance benefit. I
dont see much change in terms of timing.


RE: Does hive instantiate new udf object for each record

2014-03-25 Thread sky88088
It works! 
I really appreciate your help!


Best Regards,ypg
From: java8...@hotmail.com
To: user@hive.apache.org
Subject: RE: Does hive instantiate new udf object for each record
Date: Tue, 25 Mar 2014 09:57:25 -0400




The reason you saw that is because when you provide evaluate() method, you 
didn't specified the type of column it can be used. So Hive will just create 
test instance again and again for every new row, as it doesn't know how or 
which column to apply your UDF.
I changed your code as below:
public class test extends UDF {
private Text t;

public Text evaluate (String s) {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}

public Text evaluate () {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}
}
Now, if you invoke your UDF like this:
select test(colA) from AnyTable;
You should see one Init and the rest are OK, make sense?
Yong
From: sky880883...@hotmail.com
To: user@hive.apache.org
Subject: RE: Does hive instantiate new udf object for each record
Date: Tue, 25 Mar 2014 10:17:46 +0800




I have implemented a simple udf for test.


public class test extends UDF {
private Text t;

public Text evaluate () {
if(t==null) {
t=new Text(initialization);
}
else {
t=new Text(OK);
}
return t;
}
}

And the test query: select test() from AnyTable;
I got
initialization
initialization
initialization
...

I have also implemented a similar GenericUDF, and got similar result.

What' wrong with my code?

Best Regards,ypgFrom: java8...@hotmail.com
To: user@hive.apache.org
Subject: RE: Does hive instantiate new udf object for each record
Date: Mon, 24 Mar 2014 16:58:49 -0400




Your UDF object will only initialized once per map or reducer. 
When you said your UDF object being initialized for each row, why do you think 
so? Do you have log to make you think that way?
If OK, please provide more information, so we can help you, like your example 
code, log etc
Yong

Date: Tue, 25 Mar 2014 00:30:21 +0800
From: sky880883...@hotmail.com
To: user@hive.apache.org
Subject: Does hive instantiate new udf object for each record


Hi all,
I'm trying to implement a udf which makes use of some data structures 
like binary tree. However,  it seems that hive instantiates new udf 
object for each row in the table. Then the data structures would be also 
initialized again and again for each row.Whereas, in the book 
Programming Hive, a geoip function is taken for an example showing that a 
LookupService object is saved in a reference so it only needs to be 
initialized once in the lifetime of a map or reduce task that initializes it. 
The code for this function can be found here 
(https://github.com/edwardcapriolo/hive-geoip/).
Could anyone give me some ideas how to make the udf object initialize 
once in the lifetime of a map or reduce task?

Best Regards,ypg




  

Writing to Hive tables programmatically

2014-03-25 Thread Jiahua Wang
Hello,

I've been looking for good ways to create and write to Hive tables from
Java code. So far, I've considered the following options:

1. Create Hive table using the JDBC client, write data to HDFS using bare
HDFS operations, and load that data into the Hive table using the JDBC
client. I didn't like this since I'd have to write a lot of code to handle
various file types myself, which I'm guessing has already been done.
2. Use HCatalog.

I didn't like #1 since I'd have to write a lot of code to handle various
file types myself, which I'm guessing has already been done. Using HCatalog
(#2) looks really simple from Pig and MapReduce, but I wasn't able to
figure out how to write to a Hive table outside of a MapReduce job for this.

Any help would be greatly appreciated!

Thanks,
Alvin


Fwd: Future date getting converted to epoch date with windowing function

2014-03-25 Thread Akansha Jain
Hi,
I am trying to use hive windowing functions for a business use case. Hive
version is Apache Hive 0.11.
I have a table with a column end_date where value is 2999-12-31. While
using hive windowing function with this value, Hive is converting it to
1970s date.

*Query used is :*

SELECT account_id,
   device_id,
   status,
   LEAD (status) OVER (PARTITION BY device_id ORDER BY start_date DESC)
prev_status,
   start_date,
   end_date
from my_table;

*Sample data : *

  account_id  device_id  status  primary_min  start_date  end_date





 9 111 2 111 2012-08-29 00:00:00  2013-08-14 00:00:00  9
111 5 111 2013-08-15 00:00:00  2013-08-15 00:00:00  9 111 4
111 2013-08-16 00:00:00  2013-11-30 00:00:00  9 111 4
111 2013-12-01
00:00:00  2013-12-01 00:00:00  9 111 4 111 2013-12-02
00:00:00 2014-01-15 00:00:00
9 111 4 111 2014-01-16 00:00:00  2999-12-31 00:00:00

*Output : *

  account_id device_id status prev_status start_date end_date  9 111
2 NULL 2012-08-29 00:00:00 2013-08-14 00:00:00  9 111 5 2 2013-08-15
00:00:00 2013-08-15 00:00:00  9 111 4 5 2013-08-16 00:00:00 2013-11-30
00:00:00  9 111 4 4 2013-12-01 00:00:00 2013-12-01 00:00:00  9
111 4 4 2013-12-02 00:00:00 2014-01-15 00:00:00  9 111 4 4
2014-01-16
00:00:00 1979-03-26 23:28:00

Here, date 2999-12-31 got converted to 1979-03-26. I have tried converting
date type to String but not help.

Please let me know if anyone has faced same issue and resolved it.

Thanks in advance,

Akansha