Re: LOAD DATA problem

2012-03-21 Thread Sean McNamara
I have filed a JIRA that describes the desired 'IF NOT EXISTS' functionality:

https://issues.apache.org/jira/browse/HIVE-2889



From: Gabi D mailto:gabi...@gmail.com>>
Reply-To: mailto:user@hive.apache.org>>
Date: Wed, 21 Mar 2012 10:52:25 +0200
To: mailto:user@hive.apache.org>>
Subject: Re: LOAD DATA problem

We also do the check before loading the file into hive, but we're not very 
happy with this solution. A hack on the backend is better since a hack on the 
front end has to happen for every file while a hack on the backend would 
actually happen only for duplicate files. So performance wise backend is better 
(though impossible at the moment).
'if not exists' would have been great...


On Tue, Mar 20, 2012 at 7:20 PM, Edward Capriolo 
mailto:edlinuxg...@gmail.com>> wrote:
The syntax would be 'LOAD DATA [IF NOT EXISTS] INFILE' . Is a good suggestion.

In hindsight it would have been add new syntax for the renaming files
feature rather then changing the current behaviour. Although the
change of behaviour sucks for you (and I am sorry about that), I
believe the new better default.

Either you need a 'hack' on the front end before you load the file, or
a 'hack' on the back end to catch the exception after the conflict, or
you have to expand hive's syntax for support both (also unattractive
for a couple reasons).

Our hive 'workflows' are lacked in a good amount of groovy. We have
contemplated just going crazy and writing some Domain Specific
Language and teach it to hive, but we just hacked up some groovy and
went on with our stuff.


On Tue, Mar 20, 2012 at 1:04 PM, Sean McNamara
mailto:sean.mcnam...@webtrends.com>> wrote:
>> Still, what I think Sean is asking for, as well as am I, is the option to
>> tell Hive to reject duplicate files altogether
>
> Exactly this.
>
>
> I would expect the default behavior of LOAD DATA LOCAL INPATH to either:
>
> Throw an error if the file already exists in hive/hdfs and return an exit
> code (what it used to do)
> Re-copy over the existing file (less preferable, but it would be a nice if
> there was a flag to do this)
>
>
> For now as a hack I first check if the file already exists in hdfs before I
> load in the data. Something that is built-in and atomic would be ideal.
>
> Sean
>
>
> From: Gabi D mailto:gabi...@gmail.com>>
> Reply-To: mailto:user@hive.apache.org>>
> Date: Tue, 20 Mar 2012 17:59:37 +0200
> To: mailto:user@hive.apache.org>>
> Subject: Re: LOAD DATA problem
>
> Hi Edward,
> thanks for looking into this.
> what fix 2296 does is not so good. It kind of messes with my filename, so
> better concatenate it as .copy_n.gz (rahter than
> _copy_n.gz) but that request might be considered petty...
> Still, what I think Sean is asking for, as well as am I, is the option to
> tell Hive to reject duplicate files altogether (returning an error code
> preferably). Could be by some addition to the syntax or a hive setup
> parameter, doesn't really matter.
> Will also look into hive query hooks as you suggested.
>
> On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo 
> mailto:edlinuxg...@gmail.com>>
> wrote:
>>
>> The copy_n should have been fixed in 0.8.0
>>
>> https://issues.apache.org/jira/browse/HIVE-2296
>>
>> On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
>> mailto:sean.mcnam...@webtrends.com>> wrote:
>> > Gabi-
>> >
>> > Glad to know I'm not the only one scratching my head on this one!  The
>> > changed behavior caught us off guard.
>> >
>> > I haven't found a solution in my sleuthing tonight.  Indeed, any help
>> > would
>> > be greatly appreciated on this!
>> >
>> > Sean
>> >
>> > From: Gabi D mailto:gabi...@gmail.com>>
>> > Reply-To: mailto:user@hive.apache.org>>
>> > Date: Tue, 20 Mar 2012 10:03:04 +0200
>> > To: mailto:user@hive.apache.org>>
>> > Subject: Re: LOAD DATA problem
>> >
>> > Hi Vikas,
>> > we are facing the same problem that Sean reported and have also noticed
>> > that
>> > this behavior changed with a newer version of hive. Previously, when you
>> > inserted a file with the same name into a partition/table, hive would
>> > fail
>> > the request (with yet another of its cryptic messages, an issue in
>> > itself)
>> > while now it does load the file and adds the _copy_N addition to the
>> > suffix.
>> > I have to say that, normally, we do not check for existance of a file
>> > with
>> > the same name in our hdfs directories. Our files arrive with un

Re: LOAD DATA problem

2012-03-21 Thread Gabi D
We also do the check before loading the file into hive, but we're not very
happy with this solution. A hack on the backend is better since a hack on
the front end has to happen for every file while a hack on the backend
would actually happen only for duplicate files. So performance wise backend
is better (though impossible at the moment).
'if not exists' would have been great...


On Tue, Mar 20, 2012 at 7:20 PM, Edward Capriolo wrote:

> The syntax would be 'LOAD DATA [IF NOT EXISTS] INFILE' . Is a good
> suggestion.
>
> In hindsight it would have been add new syntax for the renaming files
> feature rather then changing the current behaviour. Although the
> change of behaviour sucks for you (and I am sorry about that), I
> believe the new better default.
>
> Either you need a 'hack' on the front end before you load the file, or
> a 'hack' on the back end to catch the exception after the conflict, or
> you have to expand hive's syntax for support both (also unattractive
> for a couple reasons).
>
> Our hive 'workflows' are lacked in a good amount of groovy. We have
> contemplated just going crazy and writing some Domain Specific
> Language and teach it to hive, but we just hacked up some groovy and
> went on with our stuff.
>
>
> On Tue, Mar 20, 2012 at 1:04 PM, Sean McNamara
>  wrote:
> >> Still, what I think Sean is asking for, as well as am I, is the option
> to
> >> tell Hive to reject duplicate files altogether
> >
> > Exactly this.
> >
> >
> > I would expect the default behavior of LOAD DATA LOCAL INPATH to either:
> >
> > Throw an error if the file already exists in hive/hdfs and return an exit
> > code (what it used to do)
> > Re-copy over the existing file (less preferable, but it would be a nice
> if
> > there was a flag to do this)
> >
> >
> > For now as a hack I first check if the file already exists in hdfs
> before I
> > load in the data. Something that is built-in and atomic would be ideal.
> >
> > Sean
> >
> >
> > From: Gabi D 
> > Reply-To: 
> > Date: Tue, 20 Mar 2012 17:59:37 +0200
> > To: 
> > Subject: Re: LOAD DATA problem
> >
> > Hi Edward,
> > thanks for looking into this.
> > what fix 2296 does is not so good. It kind of messes with my filename, so
> > better concatenate it as .copy_n.gz (rahter than
> > _copy_n.gz) but that request might be considered petty...
> > Still, what I think Sean is asking for, as well as am I, is the option to
> > tell Hive to reject duplicate files altogether (returning an error code
> > preferably). Could be by some addition to the syntax or a hive setup
> > parameter, doesn't really matter.
> > Will also look into hive query hooks as you suggested.
> >
> > On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo 
> > wrote:
> >>
> >> The copy_n should have been fixed in 0.8.0
> >>
> >> https://issues.apache.org/jira/browse/HIVE-2296
> >>
> >> On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
> >>  wrote:
> >> > Gabi-
> >> >
> >> > Glad to know I'm not the only one scratching my head on this one!  The
> >> > changed behavior caught us off guard.
> >> >
> >> > I haven't found a solution in my sleuthing tonight.  Indeed, any help
> >> > would
> >> > be greatly appreciated on this!
> >> >
> >> > Sean
> >> >
> >> > From: Gabi D 
> >> > Reply-To: 
> >> > Date: Tue, 20 Mar 2012 10:03:04 +0200
> >> > To: 
> >> > Subject: Re: LOAD DATA problem
> >> >
> >> > Hi Vikas,
> >> > we are facing the same problem that Sean reported and have also
> noticed
> >> > that
> >> > this behavior changed with a newer version of hive. Previously, when
> you
> >> > inserted a file with the same name into a partition/table, hive would
> >> > fail
> >> > the request (with yet another of its cryptic messages, an issue in
> >> > itself)
> >> > while now it does load the file and adds the _copy_N addition to the
> >> > suffix.
> >> > I have to say that, normally, we do not check for existance of a file
> >> > with
> >> > the same name in our hdfs directories. Our files arrive with unique
> >> > names
> >> > and if we try to insert the same file again it is because of some
> >> > failure in
> >> > one of the steps in our flow (e.g., files that wer

Re: LOAD DATA problem

2012-03-20 Thread Edward Capriolo
The syntax would be 'LOAD DATA [IF NOT EXISTS] INFILE' . Is a good suggestion.

In hindsight it would have been add new syntax for the renaming files
feature rather then changing the current behaviour. Although the
change of behaviour sucks for you (and I am sorry about that), I
believe the new better default.

Either you need a 'hack' on the front end before you load the file, or
a 'hack' on the back end to catch the exception after the conflict, or
you have to expand hive's syntax for support both (also unattractive
for a couple reasons).

Our hive 'workflows' are lacked in a good amount of groovy. We have
contemplated just going crazy and writing some Domain Specific
Language and teach it to hive, but we just hacked up some groovy and
went on with our stuff.


On Tue, Mar 20, 2012 at 1:04 PM, Sean McNamara
 wrote:
>> Still, what I think Sean is asking for, as well as am I, is the option to
>> tell Hive to reject duplicate files altogether
>
> Exactly this.
>
>
> I would expect the default behavior of LOAD DATA LOCAL INPATH to either:
>
> Throw an error if the file already exists in hive/hdfs and return an exit
> code (what it used to do)
> Re-copy over the existing file (less preferable, but it would be a nice if
> there was a flag to do this)
>
>
> For now as a hack I first check if the file already exists in hdfs before I
> load in the data. Something that is built-in and atomic would be ideal.
>
> Sean
>
>
> From: Gabi D 
> Reply-To: 
> Date: Tue, 20 Mar 2012 17:59:37 +0200
> To: 
> Subject: Re: LOAD DATA problem
>
> Hi Edward,
> thanks for looking into this.
> what fix 2296 does is not so good. It kind of messes with my filename, so
> better concatenate it as .copy_n.gz (rahter than
> _copy_n.gz) but that request might be considered petty...
> Still, what I think Sean is asking for, as well as am I, is the option to
> tell Hive to reject duplicate files altogether (returning an error code
> preferably). Could be by some addition to the syntax or a hive setup
> parameter, doesn't really matter.
> Will also look into hive query hooks as you suggested.
>
> On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo 
> wrote:
>>
>> The copy_n should have been fixed in 0.8.0
>>
>> https://issues.apache.org/jira/browse/HIVE-2296
>>
>> On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
>>  wrote:
>> > Gabi-
>> >
>> > Glad to know I'm not the only one scratching my head on this one!  The
>> > changed behavior caught us off guard.
>> >
>> > I haven't found a solution in my sleuthing tonight.  Indeed, any help
>> > would
>> > be greatly appreciated on this!
>> >
>> > Sean
>> >
>> > From: Gabi D 
>> > Reply-To: 
>> > Date: Tue, 20 Mar 2012 10:03:04 +0200
>> > To: 
>> > Subject: Re: LOAD DATA problem
>> >
>> > Hi Vikas,
>> > we are facing the same problem that Sean reported and have also noticed
>> > that
>> > this behavior changed with a newer version of hive. Previously, when you
>> > inserted a file with the same name into a partition/table, hive would
>> > fail
>> > the request (with yet another of its cryptic messages, an issue in
>> > itself)
>> > while now it does load the file and adds the _copy_N addition to the
>> > suffix.
>> > I have to say that, normally, we do not check for existance of a file
>> > with
>> > the same name in our hdfs directories. Our files arrive with unique
>> > names
>> > and if we try to insert the same file again it is because of some
>> > failure in
>> > one of the steps in our flow (e.g., files that were handled and loaded
>> > into
>> > hive have not been removed from our work directory for some reason hence
>> > in
>> > the next run of our load process they were reloaded). We do not want to
>> > add
>> > a step that checks whether a file with the same name already exists in
>> > hdfs
>> > - this is costly and most of the time (hopefully all of it) unnecessary.
>> > What we would like is to get some 'duplicate file' error and be able to
>> > disregard it, knowing that the file is already safely in its place.
>> > Note, that having duplicate files causes us to double count rows which
>> > is
>> > unacceptable for many applications.
>> > Moreover, we use gz files and since this behavior changes the suffix of
>> > the
>> > file (from gz to gz_copy_N) when this happens we seem to be getting all
>> > sort

Re: LOAD DATA problem

2012-03-20 Thread Sean McNamara
> Still, what I think Sean is asking for, as well as am I, is the option to 
> tell Hive to reject duplicate files altogether

Exactly this.


I would expect the default behavior of LOAD DATA LOCAL INPATH to either:

  *   Throw an error if the file already exists in hive/hdfs and return an exit 
code (what it used to do)
  *   Re-copy over the existing file (less preferable, but it would be a nice 
if there was a flag to do this)

For now as a hack I first check if the file already exists in hdfs before I 
load in the data. Something that is built-in and atomic would be ideal.

Sean


From: Gabi D mailto:gabi...@gmail.com>>
Reply-To: mailto:user@hive.apache.org>>
Date: Tue, 20 Mar 2012 17:59:37 +0200
To: mailto:user@hive.apache.org>>
Subject: Re: LOAD DATA problem

Hi Edward,
thanks for looking into this.
what fix 2296 does is not so good. It kind of messes with my filename, so 
better concatenate it as .copy_n.gz (rahter than 
_copy_n.gz) but that request might be considered petty...
Still, what I think Sean is asking for, as well as am I, is the option to tell 
Hive to reject duplicate files altogether (returning an error code preferably). 
Could be by some addition to the syntax or a hive setup parameter, doesn't 
really matter.
Will also look into hive query hooks as you suggested.

On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo 
mailto:edlinuxg...@gmail.com>> wrote:
The copy_n should have been fixed in 0.8.0

https://issues.apache.org/jira/browse/HIVE-2296

On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
mailto:sean.mcnam...@webtrends.com>> wrote:
> Gabi-
>
> Glad to know I'm not the only one scratching my head on this one!  The
> changed behavior caught us off guard.
>
> I haven't found a solution in my sleuthing tonight.  Indeed, any help would
> be greatly appreciated on this!
>
> Sean
>
> From: Gabi D mailto:gabi...@gmail.com>>
> Reply-To: mailto:user@hive.apache.org>>
> Date: Tue, 20 Mar 2012 10:03:04 +0200
> To: mailto:user@hive.apache.org>>
> Subject: Re: LOAD DATA problem
>
> Hi Vikas,
> we are facing the same problem that Sean reported and have also noticed that
> this behavior changed with a newer version of hive. Previously, when you
> inserted a file with the same name into a partition/table, hive would fail
> the request (with yet another of its cryptic messages, an issue in itself)
> while now it does load the file and adds the _copy_N addition to the suffix.
> I have to say that, normally, we do not check for existance of a file with
> the same name in our hdfs directories. Our files arrive with unique names
> and if we try to insert the same file again it is because of some failure in
> one of the steps in our flow (e.g., files that were handled and loaded into
> hive have not been removed from our work directory for some reason hence in
> the next run of our load process they were reloaded). We do not want to add
> a step that checks whether a file with the same name already exists in hdfs
> - this is costly and most of the time (hopefully all of it) unnecessary.
> What we would like is to get some 'duplicate file' error and be able to
> disregard it, knowing that the file is already safely in its place.
> Note, that having duplicate files causes us to double count rows which is
> unacceptable for many applications.
> Moreover, we use gz files and since this behavior changes the suffix of the
> file (from gz to gz_copy_N) when this happens we seem to be getting all
> sorts of strange data since hadoop can't recognize that this is a zipped
> file and does not decompress it before reading it ...
> Any help or suggestions on this issue would be much appreciated, we have
> been unable to find any so far.
>
>
> On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive 
> mailto:hadooph...@gmail.com>> wrote:
>>
>> hey Sean,
>>
>> its becoz you are appending the file in same partition with the same
>> name(which is not possible) you must change the file name before appending
>> into same partition.
>>
>> AFAIK, i don't think that there is any other way to do that, either you
>> can you partition name or the file name.
>>
>> Thanks
>> Vikas Srivastava
>>
>>
>> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
>> mailto:sean.mcnam...@webtrends.com>> wrote:
>>>
>>> Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
>>> to logs that already exist in a partition?  If the log is already in
>>> hdfs/hive I'd rather it fail and give me an return code or output saying
>>> that the log already exists.
>>>
>>> For example, if I run these queries:
>>> /usr/local/hive/bin/hive -e "LOAD DA

Re: LOAD DATA problem

2012-03-20 Thread Gabi D
Hi Edward,
thanks for looking into this.
what fix 2296 does is not so good. It kind of messes with my filename, so
better concatenate it as *.*copy_n.gz (rahter than
*_*copy_n.gz)
but that request might be considered petty...
Still, what I think Sean is asking for, as well as am I, is the option to
tell Hive to reject duplicate files altogether (returning an error code
preferably). Could be by some addition to the syntax or a hive setup
parameter, doesn't really matter.
Will also look into hive query hooks as you suggested.

On Tue, Mar 20, 2012 at 3:05 PM, Edward Capriolo wrote:

> The copy_n should have been fixed in 0.8.0
>
> https://issues.apache.org/jira/browse/HIVE-2296
>
> On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
>  wrote:
> > Gabi-
> >
> > Glad to know I'm not the only one scratching my head on this one!  The
> > changed behavior caught us off guard.
> >
> > I haven't found a solution in my sleuthing tonight.  Indeed, any help
> would
> > be greatly appreciated on this!
> >
> > Sean
> >
> > From: Gabi D 
> > Reply-To: 
> > Date: Tue, 20 Mar 2012 10:03:04 +0200
> > To: 
> > Subject: Re: LOAD DATA problem
> >
> > Hi Vikas,
> > we are facing the same problem that Sean reported and have also noticed
> that
> > this behavior changed with a newer version of hive. Previously, when you
> > inserted a file with the same name into a partition/table, hive would
> fail
> > the request (with yet another of its cryptic messages, an issue in
> itself)
> > while now it does load the file and adds the _copy_N addition to the
> suffix.
> > I have to say that, normally, we do not check for existance of a file
> with
> > the same name in our hdfs directories. Our files arrive with unique names
> > and if we try to insert the same file again it is because of some
> failure in
> > one of the steps in our flow (e.g., files that were handled and loaded
> into
> > hive have not been removed from our work directory for some reason hence
> in
> > the next run of our load process they were reloaded). We do not want to
> add
> > a step that checks whether a file with the same name already exists in
> hdfs
> > - this is costly and most of the time (hopefully all of it) unnecessary.
> > What we would like is to get some 'duplicate file' error and be able to
> > disregard it, knowing that the file is already safely in its place.
> > Note, that having duplicate files causes us to double count rows which is
> > unacceptable for many applications.
> > Moreover, we use gz files and since this behavior changes the suffix of
> the
> > file (from gz to gz_copy_N) when this happens we seem to be getting all
> > sorts of strange data since hadoop can't recognize that this is a zipped
> > file and does not decompress it before reading it ...
> > Any help or suggestions on this issue would be much appreciated, we have
> > been unable to find any so far.
> >
> >
> > On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive 
> wrote:
> >>
> >> hey Sean,
> >>
> >> its becoz you are appending the file in same partition with the same
> >> name(which is not possible) you must change the file name before
> appending
> >> into same partition.
> >>
> >> AFAIK, i don't think that there is any other way to do that, either you
> >> can you partition name or the file name.
> >>
> >> Thanks
> >> Vikas Srivastava
> >>
> >>
> >> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
> >>  wrote:
> >>>
> >>> Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
> >>> to logs that already exist in a partition?  If the log is already in
> >>> hdfs/hive I'd rather it fail and give me an return code or output
> saying
> >>> that the log already exists.
> >>>
> >>> For example, if I run these queries:
> >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
> >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
> >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
> >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
> >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
> >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
> >>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
> >>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
> >>>
> >>> I end up with:
> >>> test_a.bz2
> >>> test_b.bz2
> >>> test_b_copy_1.bz2
> >>> test_b_copy_2.bz2
> >>>
> >>> However, If I use OVERWRITE it will nuke all the data in the partition
> >>> (including test_a.bz2) and I end up with just:
> >>> test_b.bz2
> >>>
> >>> I recall that older versions of hive would not do this.  How do I
> handle
> >>> this case?  Is there a safe atomic way to do this?
> >>>
> >>> Sean
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >
>


Re: LOAD DATA problem

2012-03-20 Thread Edward Capriolo
The copy_n should have been fixed in 0.8.0

https://issues.apache.org/jira/browse/HIVE-2296

On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
 wrote:
> Gabi-
>
> Glad to know I'm not the only one scratching my head on this one!  The
> changed behavior caught us off guard.
>
> I haven't found a solution in my sleuthing tonight.  Indeed, any help would
> be greatly appreciated on this!
>
> Sean
>
> From: Gabi D 
> Reply-To: 
> Date: Tue, 20 Mar 2012 10:03:04 +0200
> To: 
> Subject: Re: LOAD DATA problem
>
> Hi Vikas,
> we are facing the same problem that Sean reported and have also noticed that
> this behavior changed with a newer version of hive. Previously, when you
> inserted a file with the same name into a partition/table, hive would fail
> the request (with yet another of its cryptic messages, an issue in itself)
> while now it does load the file and adds the _copy_N addition to the suffix.
> I have to say that, normally, we do not check for existance of a file with
> the same name in our hdfs directories. Our files arrive with unique names
> and if we try to insert the same file again it is because of some failure in
> one of the steps in our flow (e.g., files that were handled and loaded into
> hive have not been removed from our work directory for some reason hence in
> the next run of our load process they were reloaded). We do not want to add
> a step that checks whether a file with the same name already exists in hdfs
> - this is costly and most of the time (hopefully all of it) unnecessary.
> What we would like is to get some 'duplicate file' error and be able to
> disregard it, knowing that the file is already safely in its place.
> Note, that having duplicate files causes us to double count rows which is
> unacceptable for many applications.
> Moreover, we use gz files and since this behavior changes the suffix of the
> file (from gz to gz_copy_N) when this happens we seem to be getting all
> sorts of strange data since hadoop can't recognize that this is a zipped
> file and does not decompress it before reading it ...
> Any help or suggestions on this issue would be much appreciated, we have
> been unable to find any so far.
>
>
> On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive  wrote:
>>
>> hey Sean,
>>
>> its becoz you are appending the file in same partition with the same
>> name(which is not possible) you must change the file name before appending
>> into same partition.
>>
>> AFAIK, i don't think that there is any other way to do that, either you
>> can you partition name or the file name.
>>
>> Thanks
>> Vikas Srivastava
>>
>>
>> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
>>  wrote:
>>>
>>> Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
>>> to logs that already exist in a partition?  If the log is already in
>>> hdfs/hive I'd rather it fail and give me an return code or output saying
>>> that the log already exists.
>>>
>>> For example, if I run these queries:
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>>
>>> I end up with:
>>> test_a.bz2
>>> test_b.bz2
>>> test_b_copy_1.bz2
>>> test_b_copy_2.bz2
>>>
>>> However, If I use OVERWRITE it will nuke all the data in the partition
>>> (including test_a.bz2) and I end up with just:
>>> test_b.bz2
>>>
>>> I recall that older versions of hive would not do this.  How do I handle
>>> this case?  Is there a safe atomic way to do this?
>>>
>>> Sean
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: LOAD DATA problem

2012-03-20 Thread Edward Capriolo
By now you all have realized that the load file semantics have
changed. I can not find the exact issue but here is a related change.


   * [HIVE-306] - Support "INSERT [INTO] destination"

I do not see a way out of this without code. Maybe you could code up a
hive query hook for this.

It defiantly makes a good point that appending copy_of_n after the gz
is bad since that will confuse text input format which relies on
extension to chose decompresser. I will open an issue on that.

On Tue, Mar 20, 2012 at 4:12 AM, Sean McNamara
 wrote:
> Gabi-
>
> Glad to know I'm not the only one scratching my head on this one!  The
> changed behavior caught us off guard.
>
> I haven't found a solution in my sleuthing tonight.  Indeed, any help would
> be greatly appreciated on this!
>
> Sean
>
> From: Gabi D 
> Reply-To: 
> Date: Tue, 20 Mar 2012 10:03:04 +0200
> To: 
> Subject: Re: LOAD DATA problem
>
> Hi Vikas,
> we are facing the same problem that Sean reported and have also noticed that
> this behavior changed with a newer version of hive. Previously, when you
> inserted a file with the same name into a partition/table, hive would fail
> the request (with yet another of its cryptic messages, an issue in itself)
> while now it does load the file and adds the _copy_N addition to the suffix.
> I have to say that, normally, we do not check for existance of a file with
> the same name in our hdfs directories. Our files arrive with unique names
> and if we try to insert the same file again it is because of some failure in
> one of the steps in our flow (e.g., files that were handled and loaded into
> hive have not been removed from our work directory for some reason hence in
> the next run of our load process they were reloaded). We do not want to add
> a step that checks whether a file with the same name already exists in hdfs
> - this is costly and most of the time (hopefully all of it) unnecessary.
> What we would like is to get some 'duplicate file' error and be able to
> disregard it, knowing that the file is already safely in its place.
> Note, that having duplicate files causes us to double count rows which is
> unacceptable for many applications.
> Moreover, we use gz files and since this behavior changes the suffix of the
> file (from gz to gz_copy_N) when this happens we seem to be getting all
> sorts of strange data since hadoop can't recognize that this is a zipped
> file and does not decompress it before reading it ...
> Any help or suggestions on this issue would be much appreciated, we have
> been unable to find any so far.
>
>
> On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive  wrote:
>>
>> hey Sean,
>>
>> its becoz you are appending the file in same partition with the same
>> name(which is not possible) you must change the file name before appending
>> into same partition.
>>
>> AFAIK, i don't think that there is any other way to do that, either you
>> can you partition name or the file name.
>>
>> Thanks
>> Vikas Srivastava
>>
>>
>> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
>>  wrote:
>>>
>>> Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
>>> to logs that already exist in a partition?  If the log is already in
>>> hdfs/hive I'd rather it fail and give me an return code or output saying
>>> that the log already exists.
>>>
>>> For example, if I run these queries:
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>>
>>> I end up with:
>>> test_a.bz2
>>> test_b.bz2
>>> test_b_copy_1.bz2
>>> test_b_copy_2.bz2
>>>
>>> However, If I use OVERWRITE it will nuke all the data in the partition
>>> (including test_a.bz2) and I end up with just:
>>> test_b.bz2
>>>
>>> I recall that older versions of hive would not do this.  How do I handle
>>> this case?  Is there a safe atomic way to do this?
>>>
>>> Sean
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: LOAD DATA problem

2012-03-20 Thread Sean McNamara
Gabi-

Glad to know I'm not the only one scratching my head on this one!  The changed 
behavior caught us off guard.

I haven't found a solution in my sleuthing tonight.  Indeed, any help would be 
greatly appreciated on this!

Sean

From: Gabi D mailto:gabi...@gmail.com>>
Reply-To: mailto:user@hive.apache.org>>
Date: Tue, 20 Mar 2012 10:03:04 +0200
To: mailto:user@hive.apache.org>>
Subject: Re: LOAD DATA problem

Hi Vikas,
we are facing the same problem that Sean reported and have also noticed that 
this behavior changed with a newer version of hive. Previously, when you 
inserted a file with the same name into a partition/table, hive would fail the 
request (with yet another of its cryptic messages, an issue in itself) while 
now it does load the file and adds the _copy_N addition to the suffix.
I have to say that, normally, we do not check for existance of a file with the 
same name in our hdfs directories. Our files arrive with unique names and if we 
try to insert the same file again it is because of some failure in one of the 
steps in our flow (e.g., files that were handled and loaded into hive have not 
been removed from our work directory for some reason hence in the next run of 
our load process they were reloaded). We do not want to add a step that checks 
whether a file with the same name already exists in hdfs - this is costly and 
most of the time (hopefully all of it) unnecessary. What we would like is to 
get some 'duplicate file' error and be able to disregard it, knowing that the 
file is already safely in its place.
Note, that having duplicate files causes us to double count rows which is 
unacceptable for many applications.
Moreover, we use gz files and since this behavior changes the suffix of the 
file (from gz to gz_copy_N) when this happens we seem to be getting all sorts 
of strange data since hadoop can't recognize that this is a zipped file and 
does not decompress it before reading it ...
Any help or suggestions on this issue would be much appreciated, we have been 
unable to find any so far.


On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive 
mailto:hadooph...@gmail.com>> wrote:
hey Sean,

its becoz you are appending the file in same partition with the same name(which 
is not possible) you must change the file name before appending into same 
partition.

AFAIK, i don't think that there is any other way to do that, either you can you 
partition name or the file name.

Thanks
Vikas Srivastava


On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara 
mailto:sean.mcnam...@webtrends.com>> wrote:
Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1 to logs 
that already exist in a partition?  If the log is already in hdfs/hive I'd 
rather it fail and give me an return code or output saying that the log already 
exists.

For example, if I run these queries:
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"

I end up with:
test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2

However, If I use OVERWRITE it will nuke all the data in the partition 
(including test_a.bz2) and I end up with just:
test_b.bz2

I recall that older versions of hive would not do this.  How do I handle this 
case?  Is there a safe atomic way to do this?

Sean











Re: LOAD DATA problem

2012-03-20 Thread Gabi D
Hi Vikas,
we are facing the same problem that Sean reported and have also noticed
that this behavior changed with a newer version of hive. Previously, when
you inserted a file with the same name into a partition/table, hive would
fail the request (with yet another of its cryptic messages, an issue in
itself) while now it does load the file and adds the _copy_N addition to
the suffix.
I have to say that, normally, we do not check for existance of a file with
the same name in our hdfs directories. Our files arrive with unique names
and if we try to insert the same file again it is because of some failure
in one of the steps in our flow (e.g., files that were handled and loaded
into hive have not been removed from our work directory for some reason
hence in the next run of our load process they were reloaded). We do not
want to add a step that checks whether a file with the same name already
exists in hdfs - this is costly and most of the time (hopefully all of it)
unnecessary. What we would like is to get some 'duplicate file' error and
be able to disregard it, knowing that the file is already safely in its
place.
Note, that having duplicate files causes us to double count rows which is
unacceptable for many applications.
Moreover, we use gz files and since this behavior changes the suffix of the
file (from gz to gz_copy_N) when this happens we seem to be getting all
sorts of strange data since hadoop can't recognize that this is a zipped
file and does not decompress it before reading it ...
Any help or suggestions on this issue would be much appreciated, we have
been unable to find any so far.


On Tue, Mar 20, 2012 at 9:29 AM, hadoop hive  wrote:

> hey Sean,
>
> its becoz you are appending the file in same partition with the same
> name(which is not possible) you must change the file name before appending
> into same partition.
>
> AFAIK, i don't think that there is any other way to do that, either you
> can you partition name or the file name.
>
> Thanks
> Vikas Srivastava
>
>
> On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara <
> sean.mcnam...@webtrends.com> wrote:
>
>>  Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
>> to logs that already exist in a partition?  If the log is already in
>> hdfs/hive I'd rather it fail and give me an return code or output saying
>> that the log already exists.
>>
>>  For example, if I run these queries:
>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
>> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>>
>>  I end up with:
>> test_a.bz2
>> test_b.bz2
>> test_b_copy_1.bz2
>> test_b_copy_2.bz2
>>
>>  However, If I use OVERWRITE it will nuke all the data in the partition
>> (including test_a.bz2) and I end up with just:
>> test_b.bz2
>>
>>  I recall that older versions of hive would not do this.  How do I
>> handle this case?  Is there a safe atomic way to do this?
>>
>>  Sean
>>
>>
>>
>>
>>
>>
>>
>>
>


Re: LOAD DATA problem

2012-03-20 Thread hadoop hive
hey Sean,

its becoz you are appending the file in same partition with the same
name(which is not possible) you must change the file name before appending
into same partition.

AFAIK, i don't think that there is any other way to do that, either you can
you partition name or the file name.

Thanks
Vikas Srivastava


On Tue, Mar 20, 2012 at 6:45 AM, Sean McNamara
wrote:

>  Is there a way to prevent LOAD DATA LOCAL INPATH from appending _copy_1
> to logs that already exist in a partition?  If the log is already in
> hdfs/hive I'd rather it fail and give me an return code or output saying
> that the log already exists.
>
>  For example, if I run these queries:
> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO
> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
> /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO
> TABLE logs PARTITION(ds='2012-03-19', hr='23')"
>
>  I end up with:
> test_a.bz2
> test_b.bz2
> test_b_copy_1.bz2
> test_b_copy_2.bz2
>
>  However, If I use OVERWRITE it will nuke all the data in the partition
> (including test_a.bz2) and I end up with just:
> test_b.bz2
>
>  I recall that older versions of hive would not do this.  How do I handle
> this case?  Is there a safe atomic way to do this?
>
>  Sean
>
>
>
>
>
>
>
>