Handling hierarchical data in Hive
Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and *generated_by* would be faster. Now problem is that number of *generated_by* folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure
Re: Handling hierarchical data in Hive
see if this is what you are looking for https://github.com/sskaje/hive_merge On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and *generated_by* would be faster. Now problem is that number of *generated_by* folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure -- Nitin Pawar
Re: Handling hierarchical data in Hive
Hi Nitin, We are not facing small files problem since data is in S3. Also we do not want to merge files. Merging files are creating large analyze table for say one day would slow down queries fired on specific day and *generated_by.* Let me explain my problem in other words. Right now we are over-partitioning our table. Over-partitioning is giving us benefit that query on 1-2 partitions is too fast. It's side-effect is that If we try to query large number of partitions, query is too slow. Is there a way to get good performance in both of the scenarios? -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote: see if this is what you are looking for https://github.com/sskaje/hive_merge On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and *generated_by* would be faster. Now problem is that number of *generated_by* folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure -- Nitin Pawar
Re: Handling hierarchical data in Hive
in general when you have large number of partitions, your hive query performance drops. This has been significantly addressed in current releases but still see the performance issues. sadly I currently do not have that larger dataset where I need to create large number of partitions. This issue last time i checked was caused by ObjectStore.getPartitionsByNames . I am not sure this is same implementation currently. When you have large number of partitions, the actual time spent on query planning increases, one way was to set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat you can also check the value of datanucleus.connectionPool.maxActive in your hive config, if you can increase number of connections to your metastore db. we normally used to merge data for historical data into a single partition column and then if required used to do a join between new data set and old data sets. kind of a rolling data table and historical data table. On Tue, Mar 25, 2014 at 4:55 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hi Nitin, We are not facing small files problem since data is in S3. Also we do not want to merge files. Merging files are creating large analyze table for say one day would slow down queries fired on specific day and *generated_by.* Let me explain my problem in other words. Right now we are over-partitioning our table. Over-partitioning is giving us benefit that query on 1-2 partitions is too fast. It's side-effect is that If we try to query large number of partitions, query is too slow. Is there a way to get good performance in both of the scenarios? -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote: see if this is what you are looking for https://github.com/sskaje/hive_merge On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and *generated_by* would be faster. Now problem is that number of *generated_by* folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure -- Nitin Pawar -- Nitin Pawar
RE: Does hive instantiate new udf object for each record
The reason you saw that is because when you provide evaluate() method, you didn't specified the type of column it can be used. So Hive will just create test instance again and again for every new row, as it doesn't know how or which column to apply your UDF. I changed your code as below: public class test extends UDF { private Text t; public Text evaluate (String s) { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } public Text evaluate () { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } } Now, if you invoke your UDF like this: select test(colA) from AnyTable; You should see one Init and the rest are OK, make sense? Yong From: sky880883...@hotmail.com To: user@hive.apache.org Subject: RE: Does hive instantiate new udf object for each record Date: Tue, 25 Mar 2014 10:17:46 +0800 I have implemented a simple udf for test. public class test extends UDF { private Text t; public Text evaluate () { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } } And the test query: select test() from AnyTable; I got initialization initialization initialization ... I have also implemented a similar GenericUDF, and got similar result. What' wrong with my code? Best Regards,ypgFrom: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Does hive instantiate new udf object for each record Date: Mon, 24 Mar 2014 16:58:49 -0400 Your UDF object will only initialized once per map or reducer. When you said your UDF object being initialized for each row, why do you think so? Do you have log to make you think that way? If OK, please provide more information, so we can help you, like your example code, log etc Yong Date: Tue, 25 Mar 2014 00:30:21 +0800 From: sky880883...@hotmail.com To: user@hive.apache.org Subject: Does hive instantiate new udf object for each record Hi all, I'm trying to implement a udf which makes use of some data structures like binary tree. However, it seems that hive instantiates new udf object for each row in the table. Then the data structures would be also initialized again and again for each row.Whereas, in the book Programming Hive, a geoip function is taken for an example showing that a LookupService object is saved in a reference so it only needs to be initialized once in the lifetime of a map or reduce task that initializes it. The code for this function can be found here (https://github.com/edwardcapriolo/hive-geoip/). Could anyone give me some ideas how to make the udf object initialize once in the lifetime of a map or reduce task? Best Regards,ypg
Re: Handling hierarchical data in Hive
Hi Saumitra, You might want to look into clustering within the partition. That is, partition by day, but cluster by generated by (within those partitions), and see if that improves performance. Refer to the CLUSTER BY command in the Hive language Manual. -Prasan On Mar 25, 2014, at 4:26 AM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.commailto:saumitra.shahap...@vizury.com wrote: Hi Nitin, We are not facing small files problem since data is in S3. Also we do not want to merge files. Merging files are creating large analyze table for say one day would slow down queries fired on specific day and generated_by. Let me explain my problem in other words. Right now we are over-partitioning our table. Over-partitioning is giving us benefit that query on 1-2 partitions is too fast. It's side-effect is that If we try to query large number of partitions, query is too slow. Is there a way to get good performance in both of the scenarios? -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.commailto:nitinpawar...@gmail.com wrote: see if this is what you are looking for https://github.com/sskaje/hive_merge On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.commailto:saumitra.shahap...@vizury.com wrote: Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named generated_by. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and generated_by would be faster. Now problem is that number of generated_by folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and generated_by? -- Regards, Saumitra Shahapure -- Nitin Pawar
Re: Permission denied creating external table
Hi Oliver, Try to set these properties in core-site.xml: Using * will allow everyone to impersonate Hive and the cluster needs to be restarted. property namehadoop.proxyuser.hive.groups/name valueusers/value descriptionAllow the superuser hive to impersonate any members of the group users. Required only when installing Hive. /description /property where *$HIVE_USER* is the user owning Hive Services. For example, hive. property namehadoop.proxyuser.hive.hosts/name value$Hive_Hostname_FQDN/value descriptionHostname from where superuser hive can connect. Required only when installing Hive. /description /property Also, Enable the following configuration in Hive-site.xml: hive.metastore.execute.setugi In addition, Please use the directory path only while you are creating the table and it would be better to have 'hadoop' as the supergroup. Hope this helps. Thanks On Mon, Mar 24, 2014 at 9:59 AM, Oliver ohook...@gmail.com wrote: Hi Rahman, On 24 March 2014 16:45, Abdelrahman Shettia ashet...@hortonworks.comwrote: Hi Oliver, Can you perform a simple test of hadoop fs -cat hdfs:///logs/2014/03-24/actual_log_file_name.seq by the same user? Also what are the configurations setting for the following? Yes, I can access that file with the same user using hadoop fs -cat as well as other tools (I've been using Pig up until this point). hive.metastore.execute.setugi I'm not setting this explicitly anywhere. hive.metastore.warehouse.dir I have this set in my HiveQL script: SET hive.metastore.warehouse.dir=/user/oliver/warehouse; This directory already exists, since I created it before running the script. hive.metastore.uris Not explicitly set anywhere to my knowledge. Thanks, Rahman On Mar 24, 2014, at 8:17 AM, Oliver ohook...@gmail.com wrote: Hi, I have a bunch of data already in place in a directory on HDFS containing many different logs of different types, so I'm attempting to load these externally like so: CREATE EXTERNAL TABLE mylogs (line STRING) STORED AS SEQUENCEFILE LOCATION 'hdfs:///logs/2014/03-24/actual_log_file_name.seq'; However I get this error back when doing so: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=oliver, access=WRITE, inode=/logs/2014/03-24:logs:supergroup:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:224) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:204) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:149) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4716) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:4698) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:4672) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3035) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:2999) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:2980) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:648) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:419) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44970) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1701) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1697) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1695) ) This directory is intentionally read-only by regular users who want to read the logs and analyse them. Am I missing some configuration data for Hive that will tell it to only store metadata elsewhere? I already have hive.metastore.warehouse.dir set to another location where I have write permission. Best Regards, Oliver CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying,
Re: Handling hierarchical data in Hive
Hi Nitin/Prasan, Thanks for your replies, I appreciate your help :) Clustering looks to be quite close to what we want. However one main gap is that we need to fire hive query to populate clusters. In our case, the clustered data is already there. So computation in Hive query would be redundant. If CREATE TABLE analyze (generated_by INT, other_representative_field INT) PARTITIONED BY (dt STRING) CLUSTERED BY (generated_by) INTO 100 BUCKETS; Just accepts s3 directory hierarchy that we have (as explained in first mail), our problem would be solved. Another interesting solutions seem to be creating partition on dt field and creating Hive index/view on *generated_by *field. If anyone has insights around these, they would be really helpful. Meanwhile we will try to solve our problem by buckets/indices. -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani prasan.samt...@hulu.comwrote: Hi Saumitra, You might want to look into clustering within the partition. That is, partition by day, but cluster by generated by (within those partitions), and see if that improves performance. Refer to the CLUSTER BY command in the Hive language Manual. -Prasan On Mar 25, 2014, at 4:26 AM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hi Nitin, We are not facing small files problem since data is in S3. Also we do not want to merge files. Merging files are creating large analyze table for say one day would slow down queries fired on specific day and *generated_by.* Let me explain my problem in other words. Right now we are over-partitioning our table. Over-partitioning is giving us benefit that query on 1-2 partitions is too fast. It's side-effect is that If we try to query large number of partitions, query is too slow. Is there a way to get good performance in both of the scenarios? -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote: see if this is what you are looking for https://github.com/sskaje/hive_merge On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and *generated_by* would be faster. Now problem is that number of *generated_by* folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure -- Nitin Pawar
Re: Buildfile: build.xml does not exist!
Hi, Hive is mavenized, so please follow this link to build https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-MakingChanges Hope It Helps you, On Tue, Mar 25, 2014 at 9:12 PM, Nagarjuna Vissarapu nagarjuna.v...@gmail.com wrote: Hi, Can you please help me in finding bulid.xml for installation of hive. I used following link for installation *https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation . *For using *ant package* command I not found build.xml file*. *Thanks in advance please help me. Please find the attachment. -- With Thanks Regards Nagarjuna Vissarapu 9052179339
Re: Handling hierarchical data in Hive
bucketing is certainly helpful when you have finite number of values on a different column in a partitioned column. though bucketing would mean that when you load data into the table, it can't be a straight forward load data in path, you will need to run it via hive queries (which does not seem to be a problem at least from the look of it) clustering used to be in the ranges of 2 like 2, 4, 8, 16 etc. Not sure if it has changed now. Also while loading data for bucketed table its advised you set the value for set hive.enforce.bucketing = true; I have rarely used indexing in hive. but I do remember hive indexes used to provide better data access to certain queries as well the storage layout helps in improving search and lookup of the data. It may be really helpful if you can note down the performance you get after fine tuning the parameters On Tue, Mar 25, 2014 at 10:37 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hi Nitin/Prasan, Thanks for your replies, I appreciate your help :) Clustering looks to be quite close to what we want. However one main gap is that we need to fire hive query to populate clusters. In our case, the clustered data is already there. So computation in Hive query would be redundant. If CREATE TABLE analyze (generated_by INT, other_representative_field INT) PARTITIONED BY (dt STRING) CLUSTERED BY (generated_by) INTO 100 BUCKETS; Just accepts s3 directory hierarchy that we have (as explained in first mail), our problem would be solved. Another interesting solutions seem to be creating partition on dt field and creating Hive index/view on *generated_by *field. If anyone has insights around these, they would be really helpful. Meanwhile we will try to solve our problem by buckets/indices. -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani prasan.samt...@hulu.comwrote: Hi Saumitra, You might want to look into clustering within the partition. That is, partition by day, but cluster by generated by (within those partitions), and see if that improves performance. Refer to the CLUSTER BY command in the Hive language Manual. -Prasan On Mar 25, 2014, at 4:26 AM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hi Nitin, We are not facing small files problem since data is in S3. Also we do not want to merge files. Merging files are creating large analyze table for say one day would slow down queries fired on specific day and *generated_by.* Let me explain my problem in other words. Right now we are over-partitioning our table. Over-partitioning is giving us benefit that query on 1-2 partitions is too fast. It's side-effect is that If we try to query large number of partitions, query is too slow. Is there a way to get good performance in both of the scenarios? -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar nitinpawar...@gmail.comwrote: see if this is what you are looking for https://github.com/sskaje/hive_merge On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) saumitra.shahap...@vizury.com wrote: Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/ s3://analyze/20140320/333/ Size of files in each folders is typically small. Till now we have been using static partitioning so that queries on specific date and *generated_by* would be faster. Now problem is that number of *generated_by* folders is increased to 1000s. Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure -- Nitin Pawar -- Nitin Pawar
Delta or incremental loading for Hbase table
We have a Hbase table. Each time we aggreate the table based on some columns, we are doing full scan for entire table. What are the ideas for extracting just the delta or increments frokm the last loading . Right now i m following this approach. But want some better ideas. - Mount the hbase into Hive table -The rowkey of hbase table is mapped to key column in hive table. - extracting the timestamp from rowkey and extracting yesterday's data. - also there is a timestamp column ( non key) . I am extracting previous days's data and aggregating it - Then merging the incremental aggregated data into target aggregate table using full outer join . Questions 1) any better sugestions for incremental loading 2) if the use of key column from Hive , give any perfromance benefit. I dont see much change in terms of timing.
RE: Does hive instantiate new udf object for each record
It works! I really appreciate your help! Best Regards,ypg From: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Does hive instantiate new udf object for each record Date: Tue, 25 Mar 2014 09:57:25 -0400 The reason you saw that is because when you provide evaluate() method, you didn't specified the type of column it can be used. So Hive will just create test instance again and again for every new row, as it doesn't know how or which column to apply your UDF. I changed your code as below: public class test extends UDF { private Text t; public Text evaluate (String s) { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } public Text evaluate () { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } } Now, if you invoke your UDF like this: select test(colA) from AnyTable; You should see one Init and the rest are OK, make sense? Yong From: sky880883...@hotmail.com To: user@hive.apache.org Subject: RE: Does hive instantiate new udf object for each record Date: Tue, 25 Mar 2014 10:17:46 +0800 I have implemented a simple udf for test. public class test extends UDF { private Text t; public Text evaluate () { if(t==null) { t=new Text(initialization); } else { t=new Text(OK); } return t; } } And the test query: select test() from AnyTable; I got initialization initialization initialization ... I have also implemented a similar GenericUDF, and got similar result. What' wrong with my code? Best Regards,ypgFrom: java8...@hotmail.com To: user@hive.apache.org Subject: RE: Does hive instantiate new udf object for each record Date: Mon, 24 Mar 2014 16:58:49 -0400 Your UDF object will only initialized once per map or reducer. When you said your UDF object being initialized for each row, why do you think so? Do you have log to make you think that way? If OK, please provide more information, so we can help you, like your example code, log etc Yong Date: Tue, 25 Mar 2014 00:30:21 +0800 From: sky880883...@hotmail.com To: user@hive.apache.org Subject: Does hive instantiate new udf object for each record Hi all, I'm trying to implement a udf which makes use of some data structures like binary tree. However, it seems that hive instantiates new udf object for each row in the table. Then the data structures would be also initialized again and again for each row.Whereas, in the book Programming Hive, a geoip function is taken for an example showing that a LookupService object is saved in a reference so it only needs to be initialized once in the lifetime of a map or reduce task that initializes it. The code for this function can be found here (https://github.com/edwardcapriolo/hive-geoip/). Could anyone give me some ideas how to make the udf object initialize once in the lifetime of a map or reduce task? Best Regards,ypg
Writing to Hive tables programmatically
Hello, I've been looking for good ways to create and write to Hive tables from Java code. So far, I've considered the following options: 1. Create Hive table using the JDBC client, write data to HDFS using bare HDFS operations, and load that data into the Hive table using the JDBC client. I didn't like this since I'd have to write a lot of code to handle various file types myself, which I'm guessing has already been done. 2. Use HCatalog. I didn't like #1 since I'd have to write a lot of code to handle various file types myself, which I'm guessing has already been done. Using HCatalog (#2) looks really simple from Pig and MapReduce, but I wasn't able to figure out how to write to a Hive table outside of a MapReduce job for this. Any help would be greatly appreciated! Thanks, Alvin
Fwd: Future date getting converted to epoch date with windowing function
Hi, I am trying to use hive windowing functions for a business use case. Hive version is Apache Hive 0.11. I have a table with a column end_date where value is 2999-12-31. While using hive windowing function with this value, Hive is converting it to 1970s date. *Query used is :* SELECT account_id, device_id, status, LEAD (status) OVER (PARTITION BY device_id ORDER BY start_date DESC) prev_status, start_date, end_date from my_table; *Sample data : * account_id device_id status primary_min start_date end_date 9 111 2 111 2012-08-29 00:00:00 2013-08-14 00:00:00 9 111 5 111 2013-08-15 00:00:00 2013-08-15 00:00:00 9 111 4 111 2013-08-16 00:00:00 2013-11-30 00:00:00 9 111 4 111 2013-12-01 00:00:00 2013-12-01 00:00:00 9 111 4 111 2013-12-02 00:00:00 2014-01-15 00:00:00 9 111 4 111 2014-01-16 00:00:00 2999-12-31 00:00:00 *Output : * account_id device_id status prev_status start_date end_date 9 111 2 NULL 2012-08-29 00:00:00 2013-08-14 00:00:00 9 111 5 2 2013-08-15 00:00:00 2013-08-15 00:00:00 9 111 4 5 2013-08-16 00:00:00 2013-11-30 00:00:00 9 111 4 4 2013-12-01 00:00:00 2013-12-01 00:00:00 9 111 4 4 2013-12-02 00:00:00 2014-01-15 00:00:00 9 111 4 4 2014-01-16 00:00:00 1979-03-26 23:28:00 Here, date 2999-12-31 got converted to 1979-03-26. I have tried converting date type to String but not help. Please let me know if anyone has faced same issue and resolved it. Thanks in advance, Akansha