[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object
[ https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2927: Description: *Background:* get_json_object extracts json objects from a json string based on a specified path. *Problem:* The current implementation of get_json_object can't see keys with a '.' in them. Our data contains '.' in the keys, so we have to filter our json keys through a streaming script to replace '.' for '_'. *Example:* {{json = {"a":{"b": 1}, "c.d": 2}}} {{get_json_object(json, "$.a.b") returns: 1}} {{get_json_object(json, "$.c.d") returns: NULL}} In the present implementation of get_json_object, c.d is not addressable. *Proposal:* The desired behavior would be to allow the JSON path to be escape-able, like so: {{get_json_object(json, '$.c\\\.d') would return: 2}} was: *Background:* get_json_object extracts json objects from a json string based on a specified path. *Problem:* The current implementation of get_json_object can't see keys with a '.' in them. Our data contains '.' in the keys, so we have to filter our json keys through a streaming script to replace '.' for '_'. *Example:* {{json = {"a":{"b": 1}, "c.d": 2}}} {{get_json_object(json, "a.b") returns: 1}} {{get_json_object(json, "c.d") returns: NULL}} In the present implementation of get_json_object, c.d is not addressable. *Proposal:* The desired behavior would be to allow the JSON path to be escape-able, like so: {{get_json_object(json, '$.c\\\.d') would return: 2}} Affects Version/s: 0.8.1 Fix Version/s: 0.9.0 > Allow escape character in get_json_object > - > > Key: HIVE-2927 > URL: https://issues.apache.org/jira/browse/HIVE-2927 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 0.8.1 >Reporter: Sean McNamara > Fix For: 0.9.0 > > Attachments: HIVE-2927.1.patch.txt > > Original Estimate: 0h > Remaining Estimate: 0h > > *Background:* > get_json_object extracts json objects from a json string based on a specified > path. > *Problem:* > The current implementation of get_json_object can't see keys with a '.' in > them. Our data contains '.' in the keys, so we have to filter our json keys > through a streaming script to replace '.' for '_'. > *Example:* > {{json = {"a":{"b": 1}, "c.d": 2}}} > {{get_json_object(json, "$.a.b") returns: 1}} > {{get_json_object(json, "$.c.d") returns: NULL}} > In the present implementation of get_json_object, c.d is not addressable. > *Proposal:* > The desired behavior would be to allow the JSON path to be escape-able, like > so: > {{get_json_object(json, '$.c\\\.d') would return: 2}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object
[ https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2927: Component/s: (was: Security) Serializers/Deserializers > Allow escape character in get_json_object > - > > Key: HIVE-2927 > URL: https://issues.apache.org/jira/browse/HIVE-2927 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Reporter: Sean McNamara > Attachments: HIVE-2927.1.patch.txt > > Original Estimate: 0h > Remaining Estimate: 0h > > *Background:* > get_json_object extracts json objects from a json string based on a specified > path. > *Problem:* > The current implementation of get_json_object can't see keys with a '.' in > them. Our data contains '.' in the keys, so we have to filter our json keys > through a streaming script to replace '.' for '_'. > *Example:* > {{json = {"a":{"b": 1}, "c.d": 2}}} > {{get_json_object(json, "a.b") returns: 1}} > {{get_json_object(json, "c.d") returns: NULL}} > In the present implementation of get_json_object, c.d is not addressable. > *Proposal:* > The desired behavior would be to allow the JSON path to be escape-able, like > so: > {{get_json_object(json, '$.c\\\.d') would return: 2}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object
[ https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2927: Needs code review. thnx! > Allow escape character in get_json_object > - > > Key: HIVE-2927 > URL: https://issues.apache.org/jira/browse/HIVE-2927 > Project: Hive > Issue Type: Improvement > Components: Security >Reporter: Sean McNamara > Attachments: HIVE-2927.1.patch.txt > > Original Estimate: 0h > Remaining Estimate: 0h > > *Background:* > get_json_object extracts json objects from a json string based on a specified > path. > *Problem:* > The current implementation of get_json_object can't see keys with a '.' in > them. Our data contains '.' in the keys, so we have to filter our json keys > through a streaming script to replace '.' for '_'. > *Example:* > {{json = {"a":{"b": 1}, "c.d": 2}}} > {{get_json_object(json, "a.b") returns: 1}} > {{get_json_object(json, "c.d") returns: NULL}} > In the present implementation of get_json_object, c.d is not addressable. > *Proposal:* > The desired behavior would be to allow the JSON path to be escape-able, like > so: > {{get_json_object(json, '$.c\\\.d') would return: 2}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object
[ https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2927: Attachment: HIVE-2927.1.patch.txt Patch adds ability to escape '.' in JSON keys using \\. > Allow escape character in get_json_object > - > > Key: HIVE-2927 > URL: https://issues.apache.org/jira/browse/HIVE-2927 > Project: Hive > Issue Type: Improvement > Components: Security >Reporter: Sean McNamara > Attachments: HIVE-2927.1.patch.txt > > Original Estimate: 0h > Remaining Estimate: 0h > > *Background:* > get_json_object extracts json objects from a json string based on a specified > path. > *Problem:* > The current implementation of get_json_object can't see keys with a '.' in > them. Our data contains '.' in the keys, so we have to filter our json keys > through a streaming script to replace '.' for '_'. > *Example:* > {{json = {"a":{"b": 1}, "c.d": 2}}} > {{get_json_object(json, "a.b") returns: 1}} > {{get_json_object(json, "c.d") returns: NULL}} > In the present implementation of get_json_object, c.d is not addressable. > *Proposal:* > The desired behavior would be to allow the JSON path to be escape-able, like > so: > {{get_json_object(json, '$.c\\\.d') would return: 2}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object
[ https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2927: Description: *Background:* get_json_object extracts json objects from a json string based on a specified path. *Problem:* The current implementation of get_json_object can't see keys with a '.' in them. Our data contains '.' in the keys, so we have to filter our json keys through a streaming script to replace '.' for '_'. *Example:* {{json = {"a":{"b": 1}, "c.d": 2}}} {{get_json_object(json, "a.b") returns: 1}} {{get_json_object(json, "c.d") returns: NULL}} In the present implementation of get_json_object, c.d is not addressable. *Proposal:* The desired behavior would be to allow the JSON path to be escape-able, like so: {{get_json_object(json, '$.c\\\.d') would return: 2}} was: *Background:* get_json_object extracts json objects from a json string based on a specified path. *Problem:* The current implementation of get_json_object can't see keys with a '.' in them. Our data contains '.' in the keys, so we have to filter our json keys through a streaming script to replace '.' for '_'. *Example:* {{json = {"a":{"b": 1}, "c.d": 2}}} {{get_json_object(json, "a.b") returns: 1}} {{get_json_object(json, "c.d") returns: NULL}} In the present implementation of get_json_object, c.d is not addressable. *Proposal:* The desired behavior would be to allow the JSON path to be escape-able, like so: {{get_json_object(json, '$.c.d') would return: 2}} > Allow escape character in get_json_object > - > > Key: HIVE-2927 > URL: https://issues.apache.org/jira/browse/HIVE-2927 > Project: Hive > Issue Type: Improvement > Components: Security >Reporter: Sean McNamara > Original Estimate: 0h > Remaining Estimate: 0h > > *Background:* > get_json_object extracts json objects from a json string based on a specified > path. > *Problem:* > The current implementation of get_json_object can't see keys with a '.' in > them. Our data contains '.' in the keys, so we have to filter our json keys > through a streaming script to replace '.' for '_'. > *Example:* > {{json = {"a":{"b": 1}, "c.d": 2}}} > {{get_json_object(json, "a.b") returns: 1}} > {{get_json_object(json, "c.d") returns: NULL}} > In the present implementation of get_json_object, c.d is not addressable. > *Proposal:* > The desired behavior would be to allow the JSON path to be escape-able, like > so: > {{get_json_object(json, '$.c\\\.d') would return: 2}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2927) Allow escape character in get_json_object
[ https://issues.apache.org/jira/browse/HIVE-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2927: Component/s: (was: Serializers/Deserializers) Security Description: *Background:* get_json_object extracts json objects from a json string based on a specified path. *Problem:* The current implementation of get_json_object can't see keys with a '.' in them. Our data contains '.' in the keys, so we have to filter our json keys through a streaming script to replace '.' for '_'. *Example:* {{json = {"a":{"b": 1}, "c.d": 2}}} {{get_json_object(json, "a.b") returns: 1}} {{get_json_object(json, "c.d") returns: NULL}} In the present implementation of get_json_object, c.d is not addressable. *Proposal:* The desired behavior would be to allow the JSON path to be escape-able, like so: {{get_json_object(json, '$.c.d') would return: 2}} was: *Background:* get_json_object extracts json objects from a json string based on a specified path. *Problem:* The current implementation of get_json_object can't see keys with a '.' in them. Our data contains '.' in the keys, so we have to filter our json keys through a streaming script to replace '.' for '_'. *Example:* {{json = {"a":{"b": 1}, "c.d": 2}}} {{get_json_object(json, "a.b") returns: 1}} {{get_json_object(json, "c.d") returns: NULL}} In the present implementation of get_json_object, c.d is not addressable. *Proposal:* The desired behavior would be to allow the JSON path to be escape-able, like so: {{get_json_object(json, '$.c\\.d') would return: 2}} Affects Version/s: (was: 0.8.1) Fix Version/s: (was: 0.9.0) > Allow escape character in get_json_object > - > > Key: HIVE-2927 > URL: https://issues.apache.org/jira/browse/HIVE-2927 > Project: Hive > Issue Type: Improvement > Components: Security >Reporter: Sean McNamara > Original Estimate: 0h > Remaining Estimate: 0h > > *Background:* > get_json_object extracts json objects from a json string based on a specified > path. > *Problem:* > The current implementation of get_json_object can't see keys with a '.' in > them. Our data contains '.' in the keys, so we have to filter our json keys > through a streaming script to replace '.' for '_'. > *Example:* > {{json = {"a":{"b": 1}, "c.d": 2}}} > {{get_json_object(json, "a.b") returns: 1}} > {{get_json_object(json, "c.d") returns: NULL}} > In the present implementation of get_json_object, c.d is not addressable. > *Proposal:* > The desired behavior would be to allow the JSON path to be escape-able, like > so: > {{get_json_object(json, '$.c.d') would return: 2}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality
[ https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2889: Description: *Background:* The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an error when trying to copy in a log that already existed. Now it re-names the file with copy_1 so the file always goes into hdfs. *Original discussion:* http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E *Issue:* There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice. Using OVERWRITE will cause other logs in the table/partition to be deleted. *Example:* {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} *Result:* {{test_a.bz2}} {{test_b.bz2}} {{test_b_copy_1.bz2}} {{test_b_copy_2.bz2}} _test_b data was inserted 3 times, which is not the desired behavior in this instance._ *Proposal:* Add _IF NOT EXISTS_ flag to indicate copy semantics. If the the log file does not exist in the table/partition, the log would go in normally. If the log does exist in the table/partition hive would return an error and return an exit code. *Proposed HiveQL Example:* {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}} was: *Background:* The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an error when trying to copy in a log that already existed. Now it re-names the file with copy_1 so the file always goes into hdfs. *Original discussion:* http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E *Issue:* There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice. Using OVERWRITE will cause other logs in the table/partition to be deleted. *Example:* {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} *Result:* {{test_a.bz2 test_b.bz2 test_b_copy_1.bz2 test_b_copy_2.bz2}} _test_b data was inserted 3 times, which is not the desired behavior in this instance._ *Proposal:* Add _IF NOT EXISTS_ flag to indicate copy semantics. If the the log file does not exist in the table/partition, the log would go in normally. If the log does exist in the table/partition hive would return an error and return an exit code. *Proposed HiveQL Example:* {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}} > LOAD DATA IF NOT EXISTS functionality > - > > Key: HIVE-2889 > URL: https://issues.apache.org/jira/browse/HIVE-2889 > Project: Hive > Issue Type: Improvement > Components: Import/Export >Affects Versions: 0.8.1 >Reporter: Sean McNamara > Fix For: 0.9.0 > > > *Background:* > The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an > error when trying to copy in a log that already existed. Now it re-names the > file with copy_1 so the file always goes into hdfs. > *Original discussion:* > http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E > *Issue:* > There is no longer an atomic way to insert files into hive and guarantee that > the file won't go in twice. Using OVERWRITE will cause other logs in the > table/partition to be deleted. > *Example:* > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > *Result:* > {{test_a.bz2}} > {{test_
[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality
[ https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2889: Description: *Background:* The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an error when trying to copy in a log that already existed. Now it re-names the file with copy_1 so the file always goes into hdfs. *Original discussion:* http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E *Issue:* There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice. Using OVERWRITE will cause other logs in the table/partition to be deleted. *Example:* {{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} *Result:* {{test_a.bz2 test_b.bz2 test_b_copy_1.bz2 test_b_copy_2.bz2}} _test_b data was inserted 3 times, which is not the desired behavior in this instance._ *Proposal:* Add _IF NOT EXISTS_ flag to indicate copy semantics. If the the log file does not exist in the table/partition, the log would go in normally. If the log does exist in the table/partition hive would return an error and return an exit code. *Proposed HiveQL Example:* {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}} was: *Background:* The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an error when trying to copy in a log that already existed. Now it re-names the file with copy_1 so the file always goes into hdfs. *Original discussion:* http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E *Issue:* There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice. Using OVERWRITE will cause other logs in the table/partition to be deleted. *Example:* {{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')" /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')" /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')" /usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} *Result:* {{test_a.bz2 test_b.bz2 test_b_copy_1.bz2 test_b_copy_2.bz2}} _test_b data was inserted 3 times, which is not the desired behavior in this instance._ *Proposal:* Add _IF NOT EXISTS_ flag to indicate copy semantics. If the the log file does not exist in the table/partition, the log would go in normally. If the log does exist in the table/partition hive would return an error and return an exit code. *Proposed HiveQL Example:* {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}} > LOAD DATA IF NOT EXISTS functionality > - > > Key: HIVE-2889 > URL: https://issues.apache.org/jira/browse/HIVE-2889 > Project: Hive > Issue Type: Improvement > Components: Import/Export >Affects Versions: 0.8.1 >Reporter: Sean McNamara > Fix For: 0.9.0 > > > *Background:* > The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an > error when trying to copy in a log that already existed. Now it re-names the > file with copy_1 so the file always goes into hdfs. > *Original discussion:* > http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E > *Issue:* > There is no longer an atomic way to insert files into hive and guarantee that > the file won't go in twice. Using OVERWRITE will cause other logs in the > table/partition to be deleted. > *Example:* > {{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > *Result:* > {{test_a.bz2 > test_b.bz2 > test_b_copy_1.bz2 > te
[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality
[ https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean McNamara updated HIVE-2889: Description: *Background:* The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an error when trying to copy in a log that already existed. Now it re-names the file with copy_1 so the file always goes into hdfs. *Original discussion:* http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E *Issue:* There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice. Using OVERWRITE will cause other logs in the table/partition to be deleted. *Example:* {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} *Result:* {{test_a.bz2 test_b.bz2 test_b_copy_1.bz2 test_b_copy_2.bz2}} _test_b data was inserted 3 times, which is not the desired behavior in this instance._ *Proposal:* Add _IF NOT EXISTS_ flag to indicate copy semantics. If the the log file does not exist in the table/partition, the log would go in normally. If the log does exist in the table/partition hive would return an error and return an exit code. *Proposed HiveQL Example:* {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}} was: *Background:* The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an error when trying to copy in a log that already existed. Now it re-names the file with copy_1 so the file always goes into hdfs. *Original discussion:* http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E *Issue:* There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice. Using OVERWRITE will cause other logs in the table/partition to be deleted. *Example:* {{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}} *Result:* {{test_a.bz2 test_b.bz2 test_b_copy_1.bz2 test_b_copy_2.bz2}} _test_b data was inserted 3 times, which is not the desired behavior in this instance._ *Proposal:* Add _IF NOT EXISTS_ flag to indicate copy semantics. If the the log file does not exist in the table/partition, the log would go in normally. If the log does exist in the table/partition hive would return an error and return an exit code. *Proposed HiveQL Example:* {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}} > LOAD DATA IF NOT EXISTS functionality > - > > Key: HIVE-2889 > URL: https://issues.apache.org/jira/browse/HIVE-2889 > Project: Hive > Issue Type: Improvement > Components: Import/Export >Affects Versions: 0.8.1 >Reporter: Sean McNamara > Fix For: 0.9.0 > > > *Background:* > The behavior of LOAD DATA LOCAL INPATH has changed. It used to give you an > error when trying to copy in a log that already existed. Now it re-names the > file with copy_1 so the file always goes into hdfs. > *Original discussion:* > http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E > *Issue:* > There is no longer an atomic way to insert files into hive and guarantee that > the file won't go in twice. Using OVERWRITE will cause other logs in the > table/partition to be deleted. > *Example:* > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE > logs PARTITION(ds='2012-03-19', hr='23')"}} > *Result:* > {{test_a.bz2 > test_b.bz2 > test_b_co