[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14300839#comment-14300839 ] Liao, Xiaoge commented on HIVE-6131: how did this bug fix? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974492#comment-13974492 ] Szehon Ho commented on HIVE-6131: - There is a related proposal over at HIVE-6835 to pass both to the serde and let them choose. However, it doesnt seem to solve the issue about RCFile serde. I agree its a mess, unfortunately the case like change column type for RCFile makes it possible for partition + table schemas to go out of sync with more than just #cols, and hard to do the 'merge' as you proposed. I think it would be nice not to support it going out of sync this way, not sure if its possible to change at this point. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967001#comment-13967001 ] Pala M Muthaia commented on HIVE-6131: -- You are right, types of existing columns may change so partition schema may never be same as table schema, so cannot pick one or the other. Let's say we support add columns DDL at partition level. What can be allowed? Can users add arbitrarily different columns compared to table, or should they only add columns that are present in table level, but are missing at partition level, in the same order? e.g: Initial schema: Table t (A, B, C, D), Partition p (A', B'). Can users only execute 'Alter table t partition (p) add columns C,D'? Or can they do something else also 'alter table t partition (p) add columns E, F, G'? If it is only the former, then we still can do the same programmatically, by 'merging' the partition and table schema at runtime. However, if the table schema itself can be wildly different compared to partition schema, then yes, DDL is the only option, and users have to manage it themselves. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966352#comment-13966352 ] Szehon Ho commented on HIVE-6131: - Yea thats what I meant, 'alter table partition (spec) add column', etc, I wonder why its not natural? Imo it would gives more flexibility to user. My concern with your approach is it locking user to have only one choice per Serde. For example take Rcfile's serdes, which would you pick? If it is 'table-schema' , then you hit the partition..11.q issue (error after column-type change). If it is 'partition-schema', then you hit the JIRA's issue (add column, new data loaded in that column is null) because partition schema is never updated. There might be users interested in both cases (we ourselves are interested to get the latter use case for RCFile). > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966111#comment-13966111 ] Pala M Muthaia commented on HIVE-6131: -- Do you mean allowing schema update statements also at partition level? Probably a more specific 'sync partition schema to table schema' command would be better, but even that i think is not natural. I think best approach is to change implementation - use table schema whenever a serde can support latest schema correctly, otherwise use partition schema. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964571#comment-13964571 ] Szehon Ho commented on HIVE-6131: - Makes sense. I had fixed LazyBinaryColumnarSerde to handle one of those case in HIVE-5788 (add column), but it did not fix handling change column-type as you observed. Maybe we can try fixing that, but I guess its not easily supported by all the serdes. An alternate thought I had is to introduce an 'alter partition' statement, to allow the user specify what schema to use on read. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964556#comment-13964556 ] Pala M Muthaia commented on HIVE-6131: -- Yes, one case where partition column is different than table column is the type of an existing column is changed (e.g: from string to int) at the table level, after a partition is created. So column1 is string on partition, but int on table. That is the cause of failure for the unit test 'org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat11' you see above. So we can't alter the partition schema to match table schema, because underlying data in LazyBinaryColumnarSerde was written to adhere to old schema. If you just change the schema, but not the data, the data will be read incorrectly. Yes serdes can handle additional columns by treating them as null valued. However, we are trying to address a more complex case: e.g: Table and partition has 3 columns. We add one more column at table level. Then at the partition, we append (not overwrite) data with 4th column too. Now the partition has some data with 3 columns, but some data with 4 columns. LazySimpleSerde can work in this scenario, but LazyBinaryColumnarSerde doesn't. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963728#comment-13963728 ] Szehon Ho commented on HIVE-6131: - Hm I understand file-format may differ between partition and table, that was the point of HIVE-3833. But just for my understanding, did you find any use for the partition-columns being different from table-columns (being the original)? In my experience, I had seen that LazyBinaryColumnarSerde (and other serde) can use a schema with more columns to de-serialize data than what the data was written with. If thats the case, cant we make the column set same for partition and table, during 'alter table'? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963641#comment-13963641 ] Pala M Muthaia commented on HIVE-6131: -- I looked into the failures above and revisited HIVE-3833 with more context now: 1. LazyBinaryColumnarSerde requires partition level metadata to read existing data, it needs exact metadata used when serializing the data. So cannot use table level metadata which could have changed. 2. Other serdes/format, which support schema change, needs updated schema to support newly appended data with new columns. So, it seems we should pass the table metadata or partition metadata selectively, depending on what the storage/serde supports. Is there a way to identify the serdes/format that do not support newer schema, programmatically? I don't see anything obvious. Alternative is to a. Add such metadata to serde info and populate that for all serdes. This may have been discussed briefly in HIVE-3833, and looks like this will be a large change because it essentially modifies interface for a plugin. b. Hardcode a white or blacklist of serdes and pass table/partition level metadata accordingly. [~ashutoshc], [~szehon], any thoughts on the above, particularly are there other alternatives? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955954#comment-13955954 ] Hive QA commented on HIVE-6131: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12637927/HIVE-6131.1.patch {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 5514 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat14 org.apache.hadoop.hive.metastore.TestRetryingHMSHandler.testRetryingHMSHandler {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2054/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2054/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12637927 > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955772#comment-13955772 ] Szehon Ho commented on HIVE-6131: - It should be there, are you looking at [http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/|http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/]? I think its either build 2054-2056 depending on when it was uploaded. Let's wait for those and see. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955767#comment-13955767 ] Pala M Muthaia commented on HIVE-6131: -- [~szehon], i have reuploaded the patch with expected name. I still don't see a job in progress for jenkins Hive precommit build. Let me know if something else is needed. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955734#comment-13955734 ] Szehon Ho commented on HIVE-6131: - Hi Pala, you can just re-upload the same patch again. Jenkins job will pick it up automatically. I think the first patch you uploaded got missed by the jenkins job during an outage. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Assignee: Szehon Ho >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955727#comment-13955727 ] Pala M Muthaia commented on HIVE-6131: -- Thanks, will do [~ashutoshc]. However, i need a login with apache jenkins. Could you or somebody else add a login for me? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955685#comment-13955685 ] Ashutosh Chauhan commented on HIVE-6131: There are lots of test cases in Hive covering these scenarios. It will be good to get this patch tested on those and see where we stand. You can follow https://cwiki.apache.org/confluence/display/Hive/Hive+PreCommit+Patch+Testing to get Hive QA to kick a CI build. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955570#comment-13955570 ] Pala M Muthaia commented on HIVE-6131: -- [~ashutoshc], any thoughts on this? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945824#comment-13945824 ] Szehon Ho commented on HIVE-6131: - bq. That thought did cross my mind. However, I wasn't sure if there are other dependencies that actually need the table schema snapshot during partition creation (which is stored as partition level schema). You mean a use-case to keep partition.sd() as the original table.sd()? Doesnt seem likely, but I could be wrong. bq. A related question is why store partition specific column schema if it is always identical to current table column schema? Does it represent current table schema, and if so, why we keep a copy separate from table schema. If not, then it represents schema during partition creation, so it is possible that it is out of date, leading to the inconsistency you describe. Yea I wonder that as well, thats why I had originally assumed it was by design that partition columns can be different than table column. Otherwise why waste all the metastore's memory/storage and store it in different places? But again I dont have the full context, does [~ashutoshc] or others have some background, if theres a use-case of what Pala described? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945773#comment-13945773 ] Pala M Muthaia commented on HIVE-6131: -- That thought did cross my mind. However, I wasn't sure if there are other dependencies that actually need the table schema snapshot during partition creation (which is stored as partition level schema). A related question is why store partition specific column schema if it is always identical to current table column schema? Then partition metadata should not include column schema at all and always pick it up from table metadata. I guess i am unsure about semantics of partition schema. Does it represent current table schema, and if so, why we keep a copy separate from table schema. If not, then it represents schema during partition creation, so it is possible that it is out of date, leading to the inconsistency you describe. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945746#comment-13945746 ] Szehon Ho commented on HIVE-6131: - Hi , I'm wondering if its possible to instead set the correct column metadata on the partition during 'alter table'? This patch changes hive to use table instead of partition metadata on initializing the de-serializer for partition. While it works to return correct query result, the partition metadata (for example if you do 'describe partition' after 'alter table add columns') still shows the old columns before the alter, which is now inconsistent with the results returned. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Affects Versions: 0.11.0, 0.12.0, 0.13.0 >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945589#comment-13945589 ] Pala M Muthaia commented on HIVE-6131: -- The above was intended to be code review description. [~ashutoshc], can you please review or have the right owner look at this change? Thanks. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > Attachments: HIVE-6131.1.patch.txt > > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943226#comment-13943226 ] Szehon Ho commented on HIVE-6131: - We saw this issue earlier while working with schema evolution for parquet and other serde, but had thought it was expected behavior (that different partition keep old column schema after HIVE-3833). This will be a good fix to have. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942429#comment-13942429 ] Ashutosh Chauhan commented on HIVE-6131: Currently, hive only allows to add new columns in newer partition via alter table add column, so Hive should insert nulls for new columns which are absent in old partitions. Serdes should honor this. On the other hand, if user is doing alter table replace columns than all bets are off and they should know what they are doing, Hive doesnt handle those scenarios and as far as I see it, its very hard to support that. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942410#comment-13942410 ] Pala M Muthaia commented on HIVE-6131: -- Ok, i'll work on the patch. [~ashutoshc], my fix assumes all serdes can work with latest table schema, even though partition data is in older schema. Is that valid? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941305#comment-13941305 ] Ashutosh Chauhan commented on HIVE-6131: [~pala] Your suggested fix makes sense. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941289#comment-13941289 ] Pala M Muthaia commented on HIVE-6131: -- Browsing the code, it seems like this was introduced by fix for HIVE-3833 in Hive 0.11. In that patch, the partition schema is used to read results, instead of the table schema as it was before. Since partition schema is a snapshot of table schema at the time of partition creation, it doesn't contain all the new columns added later. So, the result is read using stale schema and thus do not contain new column values, even though they are present in underlying data. Clearly the intent of the 3833 patch is to use partition specific metadata, to allow for multiple serdes for partitions of a table (as i understand it). This issue seems to be a regression introduced by that patch. One possible fix is to use partition metadata, except update column list from table metadata. It is quite possible that while this will work, this may not be the 'right' fix. [~namitjain]] [~ashutoshc], any thoughts on this? > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920691#comment-13920691 ] Ning Zhang commented on HIVE-6131: -- Verified this issue with Apache Hive 0.10, 0.11, 0.12 The conclusion is: Hive 0.11 and 0.12 have this issue, while Hive 0.10 doesn't. That is to say, starting from Hive 0.11, the issue was introduced. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863242#comment-13863242 ] James Vaughan commented on HIVE-6131: - Editing the local file. Step 7 is supposed to handle re-uploading the file using an OVERWRITE command like you say. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data
[ https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863231#comment-13863231 ] Xuefu Zhang commented on HIVE-6131: --- Can you please clarify in your step #6 whether you're editing local file or HDFS file? If you're editing your local file, then you will need to load the data again as you do in step #3, or manually replace the file on HDFS with your local edited file. > New columns after table alter result in null values despite data > > > Key: HIVE-6131 > URL: https://issues.apache.org/jira/browse/HIVE-6131 > Project: Hive > Issue Type: Bug >Reporter: James Vaughan >Priority: Minor > > Hi folks, > I found and verified a bug on our CDH 4.0.3 install of Hive when adding > columns to tables with Partitions using 'REPLACE COLUMNS'. I dug through the > Jira a little bit and didn't see anything for it so hopefully this isn't just > noise on the radar. > Basically, when you alter a table with partitions and then reupload data to > that partition, it doesn't seem to recognize the extra data that actually > exists in HDFS- as in, returns NULL values on the new column despite having > the data and recognizing the new column in the metadata. > Here's some steps to reproduce using a basic table: > 1. Run this hive command: CREATE TABLE jvaughan_test (col1 string) > partitioned by (day string); > 2. Create a simple file on the system with a couple of entries, something > like "hi" and "hi2" separated by newlines. > 3. Run this hive command, pointing it at the file: LOAD DATA LOCAL INPATH > '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02'); > 4. Confirm the data with: SELECT * FROM jvaughan_test WHERE day = > '2014-01-02'; > 5. Alter the column definitions: ALTER TABLE jvaughan_test REPLACE COLUMNS > (col1 string, col2 string); > 6. Edit your file and add a second column using the default separator > (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the > first row and "hi4" on the second > 7. Run step 3 again > 8. Check the data again like in step 4 > For me, this is the results that get returned: > hive> select * from jvaughan_test where day = '2014-01-01'; > OK > hiNULL2014-01-02 > hi2 NULL2014-01-02 > This is despite the fact that there is data in the file stored by the > partition in HDFS. > Let me know if you need any other information. The only workaround for me > currently is to drop partitions for any I'm replacing data in and THEN > reupload the new data file. > Thanks, > -James -- This message was sent by Atlassian JIRA (v6.1.5#6160)