Trying jdbc:hive client
I am trying hive using java jdbc client, i can execute simple queries like select * from table and select * from table where someting=someting but when i am going for join queries it throwing me the following error: * In my Netbean ide code this is the exception*: Running: SELECT * FROM sampletab1 sp1 JOIN sampletab12 sp2 ON (sp1.id = sp2.id) limit 10 Exception in thread main java.sql.SQLException: Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:192) at witsmlstore.HiveJdbcJava.main(HiveJdbcJava.java:77) Java Result: 1 *and at the server consol this is the output :* WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Hive history file=/tmp/shashwat/hive_job_log_shashwat_201203191222_772980239.txt Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapred.reduce.tasks=number Starting Job = job_201203191156_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201203191156_0003 Kill Command = /home/shashwat/Hadoop/hadoop-0.20.205/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:9001 -kill job_201203191156_0003 2012-03-19 12:23:07,239 Stage-1 map = 0%, reduce = 0% 2012-03-19 12:24:01,453 Stage-1 map = 100%, reduce = 100% Ended Job = job_201203191156_0003 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask * When is checked in the task tracker the error was :* *@ http://localhost:50030/taskdetails.jsp?tipid=task_201203191156_0003_m_00 * java.io.IOException: Cannot create an instance of InputSplit class = org.apache.hadoop.hive.hbase.HBaseSplit:org.apache.hadoop.hive.hbase.HBaseSplit at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.readFields(HiveInputFormat.java:145) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:396) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:412) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) What may be the probable cause When i am trying embeded mode its throwing me error as follows: 12/03/19 12:27:39 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 12/03/19 12:27:39 INFO metastore.ObjectStore: ObjectStore, initialize called 12/03/19 12:27:39 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored 12/03/19 12:27:39 INFO DataNucleus.Persistence: Property javax.jdo.option.NonTransactionalRead unknown - will be ignored 12/03/19 12:27:39 INFO DataNucleus.Persistence: = Persistence Configuration === 12/03/19 12:27:39 INFO DataNucleus.Persistence: DataNucleus Persistence Factory - Vendor: DataNucleus Version: 2.0.3 12/03/19 12:27:39 INFO DataNucleus.Persistence: DataNucleus Persistence Factory initialised for datastore URL=jdbc:derby:;databaseName=metastore_db;create=true driver=org.apache.derby.jdbc.EmbeddedDriver userName=APP 12/03/19 12:27:39 INFO DataNucleus.Persistence: === 12/03/19 12:27:42 INFO Datastore.Schema: Initialising Catalog , Schema APP using None auto-start option 12/03/19 12:27:42 INFO Datastore.Schema: Catalog , Schema APP initialised - managing 0 classes 12/03/19 12:27:42 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 12/03/19 12:27:42 INFO DataNucleus.MetaData: Registering listener for metadata initialisation 12/03/19 12:27:42 INFO metastore.ObjectStore: Initialized ObjectStore 12/03/19 12:27:42 WARN DataNucleus.MetaData: MetaData Parser encountered an error in file jar:file:/home/shashwat/Hadoop/hive-0.7.1/lib/hive-metastore-0.7.1.jar!/package.jdo at line 11, column 6 : cvc-elt.1: Cannot
how is number of mappers determined in mapside join?
Hi there, when I'm executing the following queries in hive set hive.auto.convert.join = true; CREATE TABLE IDAP_ROOT as SELECT a.*,b.acnt_no FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id the number of mappers to run in the mapside join is 3, how is it determined? When launching a job in hadoop mapreduce, i know it's determined by the function max(Min split size, min(Max split size, HDFS blockSize)) which in my configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB and 1.5MB respectively. Thus I thought the mappers to launch to be around 15, which is not the case. Thanks Bruce
Re: how is number of mappers determined in mapside join?
Hi Bruce In map side join the smaller table is loader in memory and hence the number of mappers is dependent only on the data on larger table. Say If CombineHiveInputFormat is used and we have our hdfs block size as 32 mb, min split size as 1B and max split size as 256 mb. Which means one mapper would be processing data chunks not less than 1B and not more than 256 MB. So based on that mappers would be triggered, so a possibility in your case mapper 1 - 200 MB mapper 2 - 120 MB mapper 3 - 140 MB Every mapper is processing data whose size id between 1B and 256 MB. Totally of 460 MB, your table size. I'm not sure of the formula you posted here, Can you point me to the document from which you got this? Regards Bejoy From: Bruce Bian weidong@gmail.com To: user@hive.apache.org Sent: Monday, March 19, 2012 2:42 PM Subject: how is number of mappers determined in mapside join? Hi there, when I'm executing the following queries in hive set hive.auto.convert.join = true; CREATE TABLE IDAP_ROOT as SELECT a.*,b.acnt_no FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id the number of mappers to run in the mapside join is 3, how is it determined? When launching a job in hadoop mapreduce, i know it's determined by the function max(Min split size, min(Max split size, HDFS blockSize)) which in my configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB and 1.5MB respectively. Thus I thought the mappers to launch to be around 15, which is not the case. Thanks Bruce
Re: how is number of mappers determined in mapside join?
Hi Bejoy, Thanks for your reply. The function is from the book, Hadoop The Definitive Guide 2nd edition. On page 203 there is The split size is calculated by the formula (see the computeSplitSize() method in FileInputFormat): max(minimumSize, min(maximumSize, blockSize)) by default:minimumSize blockSize maximumSize so the split size is blockSize. And I've actually used the HDFS block size to control the number of mappers launched before. So as to your response, do you mean that any value of the data between 1B and 256MB is OK for the mappers to process? Then the only way I can think of to increase the #mappers is to reduce the max split size. Regards, Bruce On Mon, Mar 19, 2012 at 8:48 PM, Bejoy Ks bejoy...@yahoo.com wrote: Hi Bruce In map side join the smaller table is loader in memory and hence the number of mappers is dependent only on the data on larger table. Say If CombineHiveInputFormat is used and we have our hdfs block size as 32 mb, min split size as 1B and max split size as 256 mb. Which means one mapper would be processing data chunks not less than 1B and not more than 256 MB. So based on that mappers would be triggered, so a possibility in your case mapper 1 - 200 MB mapper 2 - 120 MB mapper 3 - 140 MB Every mapper is processing data whose size id between 1B and 256 MB. Totally of 460 MB, your table size. I'm not sure of the formula you posted here, Can you point me to the document from which you got this? Regards Bejoy -- *From:* Bruce Bian weidong@gmail.com *To:* user@hive.apache.org *Sent:* Monday, March 19, 2012 2:42 PM *Subject:* how is number of mappers determined in mapside join? Hi there, when I'm executing the following queries in hive set hive.auto.convert.join = true; CREATE TABLE IDAP_ROOT as SELECT a.*,b.acnt_no FROM idap_pi_root a LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id the number of mappers to run in the mapside join is 3, how is it determined? When launching a job in hadoop mapreduce, i know it's determined by the function max(Min split size, min(Max split size, HDFS blockSize)) which in my configuration is max(1B, min(256MB ,32MB)=32MB and the two tables are 460MB and 1.5MB respectively. Thus I thought the mappers to launch to be around 15, which is not the case. Thanks Bruce
Hive CLI and Standalone Server : Need Suggestion
Hi I need to schedule my hive scripts which needs to process incoming weblogs on an hourly basis. Currently, I could process my weblog files by executing my scripts from hive command line interface. Now I want to keep my scripts in a file and invoke my scripts at a regular periods of interval. I came to know that hive command line options provides a facility to pass the .sql file as input for execution. Is it the right approach for any production environment. OR Should I use my hive server in stand alone mode and inovke my hive scripts using JDBC API. Request you to suggest me the best approach. Regards, LK
Re: Hive CLI and Standalone Server : Need Suggestion
Hi LakshmiKanth In production systems if you have a sequence of command to be executed pack them in order in a file. Then execute the command as hive -f filename ; For simplicity, you can use a cron job to run it in a scheduled manner. Just give this command in a .sh file call the file in cron. Infact you can use any scheduler that would trigger a .sh file. But for hadoop based work flows the preferred workflow manager is oozie and I recommend oozie for hadoop jobs. Regrads Bejoy KS From: LakshmiKanth P lk.asp...@gmail.com To: user@hive.apache.org Sent: Tuesday, March 20, 2012 12:19 AM Subject: Hive CLI and Standalone Server : Need Suggestion Hi I need to schedule my hive scripts which needs to process incoming weblogs on an hourly basis. Currently, I could process my weblog files by executing my scripts from hive command line interface. Now I want to keep my scripts in a file and invoke my scripts at a regular periods of interval. I came to know that hive command line options provides a facility to pass the .sql file as input for execution. Is it the right approach for any production environment. OR Should I use my hive server in stand alone mode and inovke my hive scripts using JDBC API. Request you to suggest me the best approach. Regards, LK
RE: Hive CLI and Standalone Server : Need Suggestion
Great topic as I was wondering a similar thing this morning...I want to use oozie to execute my hive job, but I have to pass the job parameters that I generate with a shell script. Some of the literature that I've seen says that oozie may or may not allow for calling shell scripts. Is that true? Thanks Carla From: ext Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Monday, March 19, 2012 15:34 To: user@hive.apache.org Subject: Re: Hive CLI and Standalone Server : Need Suggestion Hi LakshmiKanth In production systems if you have a sequence of command to be executed pack them in order in a file. Then execute the command as hive -f filename ; For simplicity, you can use a cron job to run it in a scheduled manner. Just give this command in a .sh file call the file in cron. Infact you can use any scheduler that would trigger a .sh file. But for hadoop based work flows the preferred workflow manager is oozie and I recommend oozie for hadoop jobs. Regrads Bejoy KS From: LakshmiKanth P lk.asp...@gmail.commailto:lk.asp...@gmail.com To: user@hive.apache.orgmailto:user@hive.apache.org Sent: Tuesday, March 20, 2012 12:19 AM Subject: Hive CLI and Standalone Server : Need Suggestion Hi I need to schedule my hive scripts which needs to process incoming weblogs on an hourly basis. Currently, I could process my weblog files by executing my scripts from hive command line interface. Now I want to keep my scripts in a file and invoke my scripts at a regular periods of interval. I came to know that hive command line options provides a facility to pass the .sql file as input for execution. Is it the right approach for any production environment. OR Should I use my hive server in stand alone mode and inovke my hive scripts using JDBC API. Request you to suggest me the best approach. Regards, LK
Re: Hive CLI and Standalone Server : Need Suggestion
This is a bit of a problem. ozzie is great for workflow scheduling but oozie does not have actions for everything and adding actions is non-trivial in current versions. I have created some bootleg/generic oozie actions that make it easy to exec pretty much anything and treat it as an action. https://github.com/edwardcapriolo/m6d_oozie On Mon, Mar 19, 2012 at 3:38 PM, carla.stae...@nokia.com wrote: Great topic as I was wondering a similar thing this morning…I want to use oozie to execute my hive job, but I have to pass the job parameters that I generate with a shell script. Some of the literature that I’ve seen says that oozie may or may not allow for calling shell scripts. Is that true? Thanks Carla From: ext Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Monday, March 19, 2012 15:34 To: user@hive.apache.org Subject: Re: Hive CLI and Standalone Server : Need Suggestion Hi LakshmiKanth In production systems if you have a sequence of command to be executed pack them in order in a file. Then execute the command as hive -f filename ; For simplicity, you can use a cron job to run it in a scheduled manner. Just give this command in a .sh file call the file in cron. Infact you can use any scheduler that would trigger a .sh file. But for hadoop based work flows the preferred workflow manager is oozie and I recommend oozie for hadoop jobs. Regrads Bejoy KS From: LakshmiKanth P lk.asp...@gmail.com To: user@hive.apache.org Sent: Tuesday, March 20, 2012 12:19 AM Subject: Hive CLI and Standalone Server : Need Suggestion Hi I need to schedule my hive scripts which needs to process incoming weblogs on an hourly basis. Currently, I could process my weblog files by executing my scripts from hive command line interface. Now I want to keep my scripts in a file and invoke my scripts at a regular periods of interval. I came to know that hive command line options provides a facility to pass the .sql file as input for execution. Is it the right approach for any production environment. OR Should I use my hive server in stand alone mode and inovke my hive scripts using JDBC API. Request you to suggest me the best approach. Regards, LK
Re: Hive CLI and Standalone Server : Need Suggestion
Eduardo, Beside the mapreduce/streaming/hive/pig/sqoop/distcp action, Oozie has a JAVA action (to execute a Java Main class in the cluster), a SSH action (to execute a script via SSH in a remote host), and a SHELL action (to execute a script in the cluster). Would you mind explaining what does your m6d extension that JAVA, SSH or SHELL cannot do to in a similar way? Thanks. Alejandro On Mon, Mar 19, 2012 at 12:46 PM, Edward Capriolo edlinuxg...@gmail.comwrote: This is a bit of a problem. ozzie is great for workflow scheduling but oozie does not have actions for everything and adding actions is non-trivial in current versions. I have created some bootleg/generic oozie actions that make it easy to exec pretty much anything and treat it as an action. https://github.com/edwardcapriolo/m6d_oozie On Mon, Mar 19, 2012 at 3:38 PM, carla.stae...@nokia.com wrote: Great topic as I was wondering a similar thing this morning…I want to use oozie to execute my hive job, but I have to pass the job parameters that I generate with a shell script. Some of the literature that I’ve seen says that oozie may or may not allow for calling shell scripts. Is that true? Thanks Carla From: ext Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Monday, March 19, 2012 15:34 To: user@hive.apache.org Subject: Re: Hive CLI and Standalone Server : Need Suggestion Hi LakshmiKanth In production systems if you have a sequence of command to be executed pack them in order in a file. Then execute the command as hive -f filename ; For simplicity, you can use a cron job to run it in a scheduled manner. Just give this command in a .sh file call the file in cron. Infact you can use any scheduler that would trigger a .sh file. But for hadoop based work flows the preferred workflow manager is oozie and I recommend oozie for hadoop jobs. Regrads Bejoy KS From: LakshmiKanth P lk.asp...@gmail.com To: user@hive.apache.org Sent: Tuesday, March 20, 2012 12:19 AM Subject: Hive CLI and Standalone Server : Need Suggestion Hi I need to schedule my hive scripts which needs to process incoming weblogs on an hourly basis. Currently, I could process my weblog files by executing my scripts from hive command line interface. Now I want to keep my scripts in a file and invoke my scripts at a regular periods of interval. I came to know that hive command line options provides a facility to pass the .sql file as input for execution. Is it the right approach for any production environment. OR Should I use my hive server in stand alone mode and inovke my hive scripts using JDBC API. Request you to suggest me the best approach. Regards, LK
Re: Hive CLI and Standalone Server : Need Suggestion
I am not trying to knock oozie but MapReduce Action: Would be great but hadoop docs taught me the proper way to write hadoop programs was Tool and Configured. 90% of our legacy jobs are tools. MapReduce action can not launch Tools. So JavaMain... SSH action is something I would never allow on our network. Super bootleg and insecure. HiveAction requires the entire hive fat client which is not easy since our RDBMS needs to be configured to allow every possible tasktracker to access it's metastore. Would be better if HiveAction was HiveThriftAction then it would only need minimal jars and a host port pair. Again back to JavaMain... Not sure about the shell action. May not have been around when I put this framework together. My main point is that oozie in its current form is not very flexible, what if I want to add an RDBMS action? Beg developers to patch it in? Just having to patch in actions is detracting. (I know there is a jira open on this) The reason I wrote the library was: https://github.com/edwardcapriolo/m6d_oozie/blob/master/src/main/java/com/m6d/oozie/RunShellProps.java The problem I was facing with the Shell and Java Main actions is that if you want to extract any output to be used in the next phase of the job it is not easy to get at. I wrote a JavaMain that was capture-output / friendly. On Mon, Mar 19, 2012 at 5:23 PM, Alejandro Abdelnur t...@cloudera.com wrote: Eduardo, Beside the mapreduce/streaming/hive/pig/sqoop/distcp action, Oozie has a JAVA action (to execute a Java Main class in the cluster), a SSH action (to execute a script via SSH in a remote host), and a SHELL action (to execute a script in the cluster). Would you mind explaining what does your m6d extension that JAVA, SSH or SHELL cannot do to in a similar way? Thanks. Alejandro On Mon, Mar 19, 2012 at 12:46 PM, Edward Capriolo edlinuxg...@gmail.com wrote: This is a bit of a problem. ozzie is great for workflow scheduling but oozie does not have actions for everything and adding actions is non-trivial in current versions. I have created some bootleg/generic oozie actions that make it easy to exec pretty much anything and treat it as an action. https://github.com/edwardcapriolo/m6d_oozie On Mon, Mar 19, 2012 at 3:38 PM, carla.stae...@nokia.com wrote: Great topic as I was wondering a similar thing this morning…I want to use oozie to execute my hive job, but I have to pass the job parameters that I generate with a shell script. Some of the literature that I’ve seen says that oozie may or may not allow for calling shell scripts. Is that true? Thanks Carla From: ext Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Monday, March 19, 2012 15:34 To: user@hive.apache.org Subject: Re: Hive CLI and Standalone Server : Need Suggestion Hi LakshmiKanth In production systems if you have a sequence of command to be executed pack them in order in a file. Then execute the command as hive -f filename ; For simplicity, you can use a cron job to run it in a scheduled manner. Just give this command in a .sh file call the file in cron. Infact you can use any scheduler that would trigger a .sh file. But for hadoop based work flows the preferred workflow manager is oozie and I recommend oozie for hadoop jobs. Regrads Bejoy KS From: LakshmiKanth P lk.asp...@gmail.com To: user@hive.apache.org Sent: Tuesday, March 20, 2012 12:19 AM Subject: Hive CLI and Standalone Server : Need Suggestion Hi I need to schedule my hive scripts which needs to process incoming weblogs on an hourly basis. Currently, I could process my weblog files by executing my scripts from hive command line interface. Now I want to keep my scripts in a file and invoke my scripts at a regular periods of interval. I came to know that hive command line options provides a facility to pass the .sql file as input for execution. Is it the right approach for any production environment. OR Should I use my hive server in stand alone mode and inovke my hive scripts using JDBC API. Request you to suggest me the best approach. Regards, LK
Hive performance vs. SQL?
I haven't had an opportunity to set up a huge Hive database yet because exporting csv files from our SQL database is, in itself, a rather laborious task. I was just curious how I might expect Hive to perform vs. SQL on large databases and large queries? I realize Hive is pretty latent since it builds and runs MapReduce jobs for even the simplest queries, but that is precisely why I think it might perform better on long queries against large (external CSV) databases). Would you expect Hive to ever outperform SQL on a single machine (standalone or pseudo-distributed mode)? I am entirely open to the possibility that the answer is no, that Hive could never compete with SQL in a single machine. Is this true? If so, how large (how parallel) do you think the underlying Hadoop cluster needs to be before Hive overtakes SQL? 2X? 10X? Where is the crossover point where Hive actually outperforms SQL? Along similar lines, might Hive never outperform SQL on a database small enough for SQL to run on a single machine, a 10s to 100s of GBs? Must the database itself be so large that SQL is effectively crippled and the data must be distributed before Hive offer significant gains? I am really just trying to get a basic feel for how I might anticipate's Hive's behavior vs. SQL once I get a large system up and running. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Re: Hive performance vs. SQL?
From my experience, if you can fit data in a SQL without sharding or anything, don't ever think twice. Hive is not even comparable. I would rather say that Hive is a nice SQL interface over Hadoop M/R rather than any SQL replacement. If you are running a DWH in SQL and you don't need to grow your data to at least a couple of Tb, then keep SQL. The very nice feature of Hadoop/Hive is that your DWH can grow (almost) horizontally without much trouble by buying new servers, and most of your queries scale with the number of servers too. You have to know that doing a SELECT count(1) FROM t where t is ~1Gb can take more time to start/stop the M/R job which has huge overhead than to actually count. A simple wc -l takes about a second on any normal PC. On Mon, Mar 19, 2012 at 11:51 PM, Keith Wiley kwi...@keithwiley.com wrote: I haven't had an opportunity to set up a huge Hive database yet because exporting csv files from our SQL database is, in itself, a rather laborious task. I was just curious how I might expect Hive to perform vs. SQL on large databases and large queries? I realize Hive is pretty latent since it builds and runs MapReduce jobs for even the simplest queries, but that is precisely why I think it might perform better on long queries against large (external CSV) databases). Would you expect Hive to ever outperform SQL on a single machine (standalone or pseudo-distributed mode)? I am entirely open to the possibility that the answer is no, that Hive could never compete with SQL in a single machine. Is this true? If so, how large (how parallel) do you think the underlying Hadoop cluster needs to be before Hive overtakes SQL? 2X? 10X? Where is the crossover point where Hive actually outperforms SQL? Along similar lines, might Hive never outperform SQL on a database small enough for SQL to run on a single machine, a 10s to 100s of GBs? Must the database itself be so large that SQL is effectively crippled and the data must be distributed before Hive offer significant gains? I am really just trying to get a basic feel for how I might anticipate's Hive's behavior vs. SQL once I get a large system up and running. Thanks. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
LOAD DATA problem
Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1 to logs that already exist in a partition? If the log is already in hdfs/hive I'd rather it fail and give me an return code or output saying that the log already exists. For example, if I run these queries: /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23') /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23') /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23') /usr/local/hive/bin/hive -e LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23') I end up with: test_a.bz2 test_b.bz2 test_b_copy_1.bz2 test_b_copy_2.bz2 However, If I use OVERWRITE it will nuke all the data in the partition (including test_a.bz2) and I end up with just: test_b.bz2 I recall that older versions of hive would not do this. How do I handle this case? Is there a safe atomic way to do this? Sean