[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object

2012-04-05 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2927:


  Description: 
*Background:*
get_json_object extracts json objects from a json string based on a specified 
path.


*Problem:*
The current implementation of get_json_object can't see keys with a '.' in 
them.  Our data contains '.' in the keys, so we have to filter our json keys 
through a streaming script to replace '.' for '_'.


*Example:*
{{json = {"a":{"b": 1}, "c.d": 2}}}

{{get_json_object(json, "$.a.b") returns: 1}}
{{get_json_object(json, "$.c.d") returns: NULL}}

In the present implementation of get_json_object, c.d is not addressable.


*Proposal:*
The desired behavior would be to allow the JSON path to be escape-able, like so:

{{get_json_object(json, '$.c\\\.d') would return: 2}}

  was:
*Background:*
get_json_object extracts json objects from a json string based on a specified 
path.


*Problem:*
The current implementation of get_json_object can't see keys with a '.' in 
them.  Our data contains '.' in the keys, so we have to filter our json keys 
through a streaming script to replace '.' for '_'.


*Example:*
{{json = {"a":{"b": 1}, "c.d": 2}}}

{{get_json_object(json, "a.b") returns: 1}}
{{get_json_object(json, "c.d") returns: NULL}}

In the present implementation of get_json_object, c.d is not addressable.


*Proposal:*
The desired behavior would be to allow the JSON path to be escape-able, like so:

{{get_json_object(json, '$.c\\\.d') would return: 2}}

Affects Version/s: 0.8.1
Fix Version/s: 0.9.0

> Allow escape character in get_json_object
> -
>
> Key: HIVE-2927
> URL: https://issues.apache.org/jira/browse/HIVE-2927
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 0.8.1
>Reporter: Sean McNamara
> Fix For: 0.9.0
>
> Attachments: HIVE-2927.1.patch.txt
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> *Background:*
> get_json_object extracts json objects from a json string based on a specified 
> path.
> *Problem:*
> The current implementation of get_json_object can't see keys with a '.' in 
> them.  Our data contains '.' in the keys, so we have to filter our json keys 
> through a streaming script to replace '.' for '_'.
> *Example:*
> {{json = {"a":{"b": 1}, "c.d": 2}}}
> {{get_json_object(json, "$.a.b") returns: 1}}
> {{get_json_object(json, "$.c.d") returns: NULL}}
> In the present implementation of get_json_object, c.d is not addressable.
> *Proposal:*
> The desired behavior would be to allow the JSON path to be escape-able, like 
> so:
> {{get_json_object(json, '$.c\\\.d') would return: 2}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object

2012-04-05 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2927:


Component/s: (was: Security)
 Serializers/Deserializers

> Allow escape character in get_json_object
> -
>
> Key: HIVE-2927
> URL: https://issues.apache.org/jira/browse/HIVE-2927
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Reporter: Sean McNamara
> Attachments: HIVE-2927.1.patch.txt
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> *Background:*
> get_json_object extracts json objects from a json string based on a specified 
> path.
> *Problem:*
> The current implementation of get_json_object can't see keys with a '.' in 
> them.  Our data contains '.' in the keys, so we have to filter our json keys 
> through a streaming script to replace '.' for '_'.
> *Example:*
> {{json = {"a":{"b": 1}, "c.d": 2}}}
> {{get_json_object(json, "a.b") returns: 1}}
> {{get_json_object(json, "c.d") returns: NULL}}
> In the present implementation of get_json_object, c.d is not addressable.
> *Proposal:*
> The desired behavior would be to allow the JSON path to be escape-able, like 
> so:
> {{get_json_object(json, '$.c\\\.d') would return: 2}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object

2012-04-05 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2927:



Needs code review. thnx!

> Allow escape character in get_json_object
> -
>
> Key: HIVE-2927
> URL: https://issues.apache.org/jira/browse/HIVE-2927
> Project: Hive
>  Issue Type: Improvement
>  Components: Security
>Reporter: Sean McNamara
> Attachments: HIVE-2927.1.patch.txt
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> *Background:*
> get_json_object extracts json objects from a json string based on a specified 
> path.
> *Problem:*
> The current implementation of get_json_object can't see keys with a '.' in 
> them.  Our data contains '.' in the keys, so we have to filter our json keys 
> through a streaming script to replace '.' for '_'.
> *Example:*
> {{json = {"a":{"b": 1}, "c.d": 2}}}
> {{get_json_object(json, "a.b") returns: 1}}
> {{get_json_object(json, "c.d") returns: NULL}}
> In the present implementation of get_json_object, c.d is not addressable.
> *Proposal:*
> The desired behavior would be to allow the JSON path to be escape-able, like 
> so:
> {{get_json_object(json, '$.c\\\.d') would return: 2}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object

2012-04-05 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2927:


Attachment: HIVE-2927.1.patch.txt

Patch adds ability to escape '.' in JSON keys using \\.

> Allow escape character in get_json_object
> -
>
> Key: HIVE-2927
> URL: https://issues.apache.org/jira/browse/HIVE-2927
> Project: Hive
>  Issue Type: Improvement
>  Components: Security
>Reporter: Sean McNamara
> Attachments: HIVE-2927.1.patch.txt
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> *Background:*
> get_json_object extracts json objects from a json string based on a specified 
> path.
> *Problem:*
> The current implementation of get_json_object can't see keys with a '.' in 
> them.  Our data contains '.' in the keys, so we have to filter our json keys 
> through a streaming script to replace '.' for '_'.
> *Example:*
> {{json = {"a":{"b": 1}, "c.d": 2}}}
> {{get_json_object(json, "a.b") returns: 1}}
> {{get_json_object(json, "c.d") returns: NULL}}
> In the present implementation of get_json_object, c.d is not addressable.
> *Proposal:*
> The desired behavior would be to allow the JSON path to be escape-able, like 
> so:
> {{get_json_object(json, '$.c\\\.d') would return: 2}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object

2012-04-05 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2927:


Description: 
*Background:*
get_json_object extracts json objects from a json string based on a specified 
path.


*Problem:*
The current implementation of get_json_object can't see keys with a '.' in 
them.  Our data contains '.' in the keys, so we have to filter our json keys 
through a streaming script to replace '.' for '_'.


*Example:*
{{json = {"a":{"b": 1}, "c.d": 2}}}

{{get_json_object(json, "a.b") returns: 1}}
{{get_json_object(json, "c.d") returns: NULL}}

In the present implementation of get_json_object, c.d is not addressable.


*Proposal:*
The desired behavior would be to allow the JSON path to be escape-able, like so:

{{get_json_object(json, '$.c\\\.d') would return: 2}}

  was:
*Background:*
get_json_object extracts json objects from a json string based on a specified 
path.


*Problem:*
The current implementation of get_json_object can't see keys with a '.' in 
them.  Our data contains '.' in the keys, so we have to filter our json keys 
through a streaming script to replace '.' for '_'.


*Example:*
{{json = {"a":{"b": 1}, "c.d": 2}}}

{{get_json_object(json, "a.b") returns: 1}}
{{get_json_object(json, "c.d") returns: NULL}}

In the present implementation of get_json_object, c.d is not addressable.


*Proposal:*
The desired behavior would be to allow the JSON path to be escape-able, like so:

{{get_json_object(json, '$.c.d') would return: 2}}


> Allow escape character in get_json_object
> -
>
> Key: HIVE-2927
> URL: https://issues.apache.org/jira/browse/HIVE-2927
> Project: Hive
>  Issue Type: Improvement
>  Components: Security
>Reporter: Sean McNamara
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> *Background:*
> get_json_object extracts json objects from a json string based on a specified 
> path.
> *Problem:*
> The current implementation of get_json_object can't see keys with a '.' in 
> them.  Our data contains '.' in the keys, so we have to filter our json keys 
> through a streaming script to replace '.' for '_'.
> *Example:*
> {{json = {"a":{"b": 1}, "c.d": 2}}}
> {{get_json_object(json, "a.b") returns: 1}}
> {{get_json_object(json, "c.d") returns: NULL}}
> In the present implementation of get_json_object, c.d is not addressable.
> *Proposal:*
> The desired behavior would be to allow the JSON path to be escape-able, like 
> so:
> {{get_json_object(json, '$.c\\\.d') would return: 2}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object

2012-04-05 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2927:


  Component/s: (was: Serializers/Deserializers)
   Security
  Description: 
*Background:*
get_json_object extracts json objects from a json string based on a specified 
path.


*Problem:*
The current implementation of get_json_object can't see keys with a '.' in 
them.  Our data contains '.' in the keys, so we have to filter our json keys 
through a streaming script to replace '.' for '_'.


*Example:*
{{json = {"a":{"b": 1}, "c.d": 2}}}

{{get_json_object(json, "a.b") returns: 1}}
{{get_json_object(json, "c.d") returns: NULL}}

In the present implementation of get_json_object, c.d is not addressable.


*Proposal:*
The desired behavior would be to allow the JSON path to be escape-able, like so:

{{get_json_object(json, '$.c.d') would return: 2}}

  was:
*Background:*
get_json_object extracts json objects from a json string based on a specified 
path.


*Problem:*
The current implementation of get_json_object can't see keys with a '.' in 
them.  Our data contains '.' in the keys, so we have to filter our json keys 
through a streaming script to replace '.' for '_'.


*Example:*
{{json = {"a":{"b": 1}, "c.d": 2}}}

{{get_json_object(json, "a.b") returns: 1}}
{{get_json_object(json, "c.d") returns: NULL}}

In the present implementation of get_json_object, c.d is not addressable.


*Proposal:*
The desired behavior would be to allow the JSON path to be escape-able, like so:

{{get_json_object(json, '$.c\\.d') would return: 2}}

Affects Version/s: (was: 0.8.1)
Fix Version/s: (was: 0.9.0)

> Allow escape character in get_json_object
> -
>
> Key: HIVE-2927
> URL: https://issues.apache.org/jira/browse/HIVE-2927
> Project: Hive
>  Issue Type: Improvement
>  Components: Security
>Reporter: Sean McNamara
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> *Background:*
> get_json_object extracts json objects from a json string based on a specified 
> path.
> *Problem:*
> The current implementation of get_json_object can't see keys with a '.' in 
> them.  Our data contains '.' in the keys, so we have to filter our json keys 
> through a streaming script to replace '.' for '_'.
> *Example:*
> {{json = {"a":{"b": 1}, "c.d": 2}}}
> {{get_json_object(json, "a.b") returns: 1}}
> {{get_json_object(json, "c.d") returns: NULL}}
> In the present implementation of get_json_object, c.d is not addressable.
> *Proposal:*
> The desired behavior would be to allow the JSON path to be escape-able, like 
> so:
> {{get_json_object(json, '$.c.d') would return: 2}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

2012-03-21 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2889:


Description: 
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
error when trying to copy in a log that already existed.  Now it re-names the 
file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that 
the file won't go in twice.  Using OVERWRITE will cause other logs in the 
table/partition to be deleted.


*Example:*
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2}}
{{test_b.bz2}}
{{test_b_copy_1.bz2}}
{{test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this 
instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does 
not exist in the table/partition, the log would go in normally.  If the log 
does exist in the table/partition hive would return an error and return an exit 
code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs 
PARTITION(ds='2012-03-19', hr='23')}}

  was:
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
error when trying to copy in a log that already existed.  Now it re-names the 
file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that 
the file won't go in twice.  Using OVERWRITE will cause other logs in the 
table/partition to be deleted.


*Example:*
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this 
instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does 
not exist in the table/partition, the log would go in normally.  If the log 
does exist in the table/partition hive would return an error and return an exit 
code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs 
PARTITION(ds='2012-03-19', hr='23')}}


> LOAD DATA IF NOT EXISTS functionality
> -
>
> Key: HIVE-2889
> URL: https://issues.apache.org/jira/browse/HIVE-2889
> Project: Hive
>  Issue Type: Improvement
>  Components: Import/Export
>Affects Versions: 0.8.1
>Reporter: Sean McNamara
> Fix For: 0.9.0
>
>
> *Background:*
> The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
> error when trying to copy in a log that already existed.  Now it re-names the 
> file with copy_1 so the file always goes into hdfs.
> *Original discussion:*
> http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
> *Issue:*
> There is no longer an atomic way to insert files into hive and guarantee that 
> the file won't go in twice.  Using OVERWRITE will cause other logs in the 
> table/partition to be deleted.
> *Example:*
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> *Result:*
> {{test_a.bz2}}
> {{test_

[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

2012-03-21 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2889:


Description: 
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
error when trying to copy in a log that already existed.  Now it re-names the 
file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that 
the file won't go in twice.  Using OVERWRITE will cause other logs in the 
table/partition to be deleted.


*Example:*
{{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this 
instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does 
not exist in the table/partition, the log would go in normally.  If the log 
does exist in the table/partition hive would return an error and return an exit 
code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs 
PARTITION(ds='2012-03-19', hr='23')}}

  was:
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
error when trying to copy in a log that already existed.  Now it re-names the 
file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that 
the file won't go in twice.  Using OVERWRITE will cause other logs in the 
table/partition to be deleted.


*Example:*
{{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this 
instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does 
not exist in the table/partition, the log would go in normally.  If the log 
does exist in the table/partition hive would return an error and return an exit 
code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs 
PARTITION(ds='2012-03-19', hr='23')}}



> LOAD DATA IF NOT EXISTS functionality
> -
>
> Key: HIVE-2889
> URL: https://issues.apache.org/jira/browse/HIVE-2889
> Project: Hive
>  Issue Type: Improvement
>  Components: Import/Export
>Affects Versions: 0.8.1
>Reporter: Sean McNamara
> Fix For: 0.9.0
>
>
> *Background:*
> The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
> error when trying to copy in a log that already existed.  Now it re-names the 
> file with copy_1 so the file always goes into hdfs.
> *Original discussion:*
> http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
> *Issue:*
> There is no longer an atomic way to insert files into hive and guarantee that 
> the file won't go in twice.  Using OVERWRITE will cause other logs in the 
> table/partition to be deleted.
> *Example:*
> {{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> *Result:*
> {{test_a.bz2
> test_b.bz2
> test_b_copy_1.bz2
> te

[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

2012-03-21 Thread Sean McNamara (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean McNamara updated HIVE-2889:


Description: 
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
error when trying to copy in a log that already existed.  Now it re-names the 
file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that 
the file won't go in twice.  Using OVERWRITE will cause other logs in the 
table/partition to be deleted.


*Example:*
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this 
instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does 
not exist in the table/partition, the log would go in normally.  If the log 
does exist in the table/partition hive would return an error and return an exit 
code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs 
PARTITION(ds='2012-03-19', hr='23')}}

  was:
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
error when trying to copy in a log that already existed.  Now it re-names the 
file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that 
the file won't go in twice.  Using OVERWRITE will cause other logs in the 
table/partition to be deleted.


*Example:*
{{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this 
instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does 
not exist in the table/partition, the log would go in normally.  If the log 
does exist in the table/partition hive would return an error and return an exit 
code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs 
PARTITION(ds='2012-03-19', hr='23')}}


> LOAD DATA IF NOT EXISTS functionality
> -
>
> Key: HIVE-2889
> URL: https://issues.apache.org/jira/browse/HIVE-2889
> Project: Hive
>  Issue Type: Improvement
>  Components: Import/Export
>Affects Versions: 0.8.1
>Reporter: Sean McNamara
> Fix For: 0.9.0
>
>
> *Background:*
> The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an 
> error when trying to copy in a log that already existed.  Now it re-names the 
> file with copy_1 so the file always goes into hdfs.
> *Original discussion:*
> http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
> *Issue:*
> There is no longer an atomic way to insert files into hive and guarantee that 
> the file won't go in twice.  Using OVERWRITE will cause other logs in the 
> table/partition to be deleted.
> *Example:*
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE 
> logs PARTITION(ds='2012-03-19', hr='23')"}}
> *Result:*
> {{test_a.bz2
> test_b.bz2
> test_b_co