Gabi-

Glad to know I'm not the only one scratching my head on this one!  The changed 
behavior caught us off guard.

I haven't found a solution in my sleuthing tonight.  Indeed, any help would be 
greatly appreciated on this!

Sean

From: Gabi D <gabi...@gmail.com<mailto:gabi...@gmail.com>>
Reply-To: <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Tue, 20 Mar 2012 10:03:04 +0200
To: <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: LOAD DATA problem

Hi Vikas,
we are facing the same problem that Sean reported and have also noticed that 
this behavior changed with a newer version of hive. Previously, when you 
inserted a file with the same name into a partition/table, hive would fail the 
request (with yet another of its cryptic messages, an issue in itself) while 
now it does load the file and adds the _copy_N addition to the suffix.
I have to say that, normally, we do not check for existance of a file with the 
same name in our hdfs directories. Our files arrive with unique names and if we 
try to insert the same file again it is because of some failure in one of the 
steps in our flow (e.g., files that were handled and loaded into hive have not 
been removed from our work directory for some reason hence in the next run of 
our load process they were reloaded). We do not want to add a step that checks 
whether a file with the same name already exists in hdfs - this is costly and 
most of the time (hopefully all of it) unnecessary. What we would like is to 
get some 'duplicate file' error and be able to disregard it, knowing that the 
file is already safely in its place.
Note, that having duplicate files causes us to double count rows which is 
unacceptable for many applications.
Moreover, we use gz files and since this behavior changes the suffix of the 
file (from gz to gz_copy_N) when this happens we seem to be getting all sorts 
of strange data since hadoop can't recognize that this is a zipped file and 
does not decompress it before reading it ...
Any help or suggestions on this issue would be much appreciated, we have been 
unable to find any so far.


On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive 
<hadooph...@gmail.com<mailto:hadooph...@gmail.com>> wrote:
hey Sean,

its becoz you are appending the file in same partition with the same name(which 
is not possible) you must change the file name before appending into same 
partition.

AFAIK, i don't think that there is any other way to do that, either you can you 
partition name or the file name.

Thanks
Vikas Srivastava


On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara 
<sean.mcnam...@webtrends.com<mailto:sean.mcnam...@webtrends.com>> wrote:
Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1 to logs 
that already exist in a partition?  If the log is already in hdfs/hive I'd 
rather it fail and give me an return code or output saying that the log already 
exists.

For example, if I run these queries:
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"

I end up with:
test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2

However, If I use OVERWRITE it will nuke all the data in the partition 
(including test_a.bz2) and I end up with just:
test_b.bz2

I recall that older versions of hive would not do this.  How do I handle this 
case?  Is there a safe atomic way to do this?

Sean









Reply via email to