[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2015-02-01 Thread Liao, Xiaoge (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14300839#comment-14300839
 ] 

Liao, Xiaoge commented on HIVE-6131:


how did this bug fix?

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-18 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974492#comment-13974492
 ] 

Szehon Ho commented on HIVE-6131:
-

There is a related proposal over at HIVE-6835 to pass both to the serde and let 
them choose.

However, it doesnt seem to solve the issue about RCFile serde.   I agree its a 
mess, unfortunately the case like change column type for RCFile makes it 
possible for partition + table schemas to go out of sync with more than just 
#cols, and hard to do the 'merge' as you proposed.  I think it would be nice 
not to support it going out of sync this way, not sure if its possible to 
change at this point.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-11 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967001#comment-13967001
 ] 

Pala M Muthaia commented on HIVE-6131:
--

You are right, types of existing columns may change so partition schema may 
never be same as table schema, so cannot pick one or the other. 

Let's say we support add columns DDL at partition level. What can be allowed? 
Can users add arbitrarily different columns compared to table, or should they 
only add columns that are present in table level, but are missing at partition 
level, in the same order? 

e.g: Initial schema: Table t (A, B, C, D), Partition p (A', B'). Can users only 
execute 'Alter table t partition (p) add columns C,D'? Or can they do something 
else also 'alter table t partition (p) add columns E, F, G'? 

If it is only the former, then we still can do the same programmatically, by 
'merging' the partition and table schema at runtime. However, if the table 
schema itself can be wildly different compared to partition schema, then yes, 
DDL is the only option, and users have to manage it themselves.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-11 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966352#comment-13966352
 ] 

Szehon Ho commented on HIVE-6131:
-

Yea thats what I meant, 'alter table partition (spec) add column', etc, I 
wonder why its not natural?  Imo it would gives more flexibility to user.

My concern with your approach is it locking user to have only one choice per 
Serde.  For example take Rcfile's serdes, which would you pick?  If it is 
'table-schema' , then you hit the partition..11.q issue (error after 
column-type change).  If it is 'partition-schema', then you hit the JIRA's 
issue (add column, new data loaded in that column is null) because partition 
schema is never updated.  There might be users interested in both cases (we 
ourselves are interested to get the latter use case for RCFile).









> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-10 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13966111#comment-13966111
 ] 

Pala M Muthaia commented on HIVE-6131:
--

Do you mean allowing schema update statements also at partition level? Probably 
a more specific 'sync partition schema to table schema' command would be 
better, but even that i think is not natural.

I think best approach is to change implementation - use table schema whenever a 
serde can support latest schema correctly, otherwise use partition schema.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-09 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964571#comment-13964571
 ] 

Szehon Ho commented on HIVE-6131:
-

Makes sense.  I had fixed LazyBinaryColumnarSerde to handle one of those case 
in HIVE-5788 (add column), but it did not fix handling change column-type as 
you observed.  Maybe we can try fixing that, but I guess its not easily 
supported by all the serdes.

An alternate thought I had is to introduce an 'alter partition' statement, to 
allow the user specify what schema to use on read.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-09 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964556#comment-13964556
 ] 

Pala M Muthaia commented on HIVE-6131:
--

Yes, one case where partition column is different than table column is the type 
of an existing column is changed (e.g: from string to int) at the table level, 
after a partition is created. So column1 is string on partition, but int on 
table.

That is the cause of failure for the unit test 
'org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat11'
 you see above.

So we can't alter the partition schema to match table schema, because 
underlying data in LazyBinaryColumnarSerde was written to adhere to old schema. 
If you just change the schema, but not the data, the data will be read 
incorrectly.

Yes serdes can handle additional columns by treating them as null valued. 
However, we are trying to address a more complex case: e.g: Table and partition 
has 3 columns. We add one more column at table level. Then at the partition, we 
append (not overwrite) data with 4th column too. Now the partition has some 
data with 3 columns, but some data with 4 columns. LazySimpleSerde can work in 
this scenario, but LazyBinaryColumnarSerde doesn't.  

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-08 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963728#comment-13963728
 ] 

Szehon Ho commented on HIVE-6131:
-

Hm I understand file-format may differ between partition and table, that was 
the point of HIVE-3833.   But just for my understanding, did you find any use 
for the partition-columns being different from table-columns (being the 
original)? 

In my experience, I had seen that LazyBinaryColumnarSerde (and other serde) can 
use a schema with more columns to de-serialize data than what the data was 
written with.  If thats the case, cant we make the column set same for 
partition and table, during 'alter table'?

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-04-08 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963641#comment-13963641
 ] 

Pala M Muthaia commented on HIVE-6131:
--

I looked into the failures above and revisited HIVE-3833 with more context now:
1. LazyBinaryColumnarSerde requires partition level metadata to read existing 
data, it needs exact metadata used when serializing the data. So cannot use 
table level metadata which could have changed.
2. Other serdes/format, which support schema change, needs updated schema to 
support newly appended data with new columns.

So, it seems we should pass the table metadata or partition metadata 
selectively, depending on what the storage/serde supports. Is there a way to 
identify the serdes/format that do not support newer schema, programmatically? 
I don't see anything obvious. Alternative is to
a.  Add such metadata to serde info and populate that for all serdes. This may 
have been discussed briefly in HIVE-3833, and looks like this will be a large 
change because it essentially modifies interface for a plugin.
b.  Hardcode a white or blacklist of serdes and pass table/partition level 
metadata accordingly.

[~ashutoshc], [~szehon], any thoughts on the above, particularly are there 
other alternatives?



> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-31 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955954#comment-13955954
 ] 

Hive QA commented on HIVE-6131:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12637927/HIVE-6131.1.patch

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 5514 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat11
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat12
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat13
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_partition_wise_fileformat14
org.apache.hadoop.hive.metastore.TestRetryingHMSHandler.testRetryingHMSHandler
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2054/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2054/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12637927

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-31 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955772#comment-13955772
 ] 

Szehon Ho commented on HIVE-6131:
-

It should be there, are you looking at 
[http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/|http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/]?
   I think its either build 2054-2056 depending on when it was uploaded.  Let's 
wait for those and see.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-31 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955767#comment-13955767
 ] 

Pala M Muthaia commented on HIVE-6131:
--

[~szehon], i have reuploaded the patch with expected name. I still don't see a 
job in progress for jenkins Hive precommit build. Let me know if something else 
is needed. 

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-31 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955734#comment-13955734
 ] 

Szehon Ho commented on HIVE-6131:
-

Hi Pala, you can just re-upload the same patch again.  Jenkins job will pick it 
up automatically.  I think the first patch you uploaded got missed by the 
jenkins job during an outage.


> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Assignee: Szehon Ho
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-31 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955727#comment-13955727
 ] 

Pala M Muthaia commented on HIVE-6131:
--

Thanks, will do [~ashutoshc]. However, i need a login with apache jenkins. 
Could you or somebody else add a login for me?

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-31 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955685#comment-13955685
 ] 

Ashutosh Chauhan commented on HIVE-6131:


There are lots of test cases in Hive covering these scenarios. It will be good 
to get this patch tested on those and see where we stand. You can follow 
https://cwiki.apache.org/confluence/display/Hive/Hive+PreCommit+Patch+Testing 
to get Hive QA to kick a CI build.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-31 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955570#comment-13955570
 ] 

Pala M Muthaia commented on HIVE-6131:
--

[~ashutoshc], any thoughts on this?

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-24 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945824#comment-13945824
 ] 

Szehon Ho commented on HIVE-6131:
-

bq. That thought did cross my mind. However, I wasn't sure if there are other 
dependencies that actually need the table schema snapshot during partition 
creation (which is stored as partition level schema).

You mean a use-case to keep partition.sd() as the original table.sd()?  Doesnt 
seem likely, but I could be wrong.

bq. A related question is why store partition specific column schema if it is 
always identical to current table column schema?  Does it represent current 
table schema, and if so, why we keep a copy separate from table schema. If not, 
then it represents schema during partition creation, so it is possible that it 
is out of date, leading to the inconsistency you describe.

Yea I wonder that as well, thats why I had originally assumed it was by design 
that partition columns can be different than table column.  Otherwise why waste 
all the metastore's memory/storage and store it in different places?   But 
again I dont have the full context, does [~ashutoshc] or others have some 
background, if theres a use-case of what Pala described?

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-24 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945773#comment-13945773
 ] 

Pala M Muthaia commented on HIVE-6131:
--

That thought did cross my mind. However, I wasn't sure if there are other 
dependencies that actually need the table schema snapshot during partition 
creation (which is stored as partition level schema). 

A related question is why store partition specific column schema if it is 
always identical to current table column schema? Then partition metadata should 
not include column schema at all and always pick it up from table metadata. 

I guess i am unsure about semantics of partition schema. Does it represent 
current table schema, and if so, why we keep a copy separate from table schema. 
If not, then it represents schema during partition creation, so it is possible 
that it is out of date, leading to the inconsistency you describe.


> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-24 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945746#comment-13945746
 ] 

Szehon Ho commented on HIVE-6131:
-

Hi , I'm wondering if its possible to instead set the correct column metadata 
on the partition during 'alter table'? 

This patch changes hive to use table instead of partition metadata on 
initializing the de-serializer for partition.  While it works to return correct 
query result, the partition metadata (for example if you do 'describe 
partition' after 'alter table add columns') still shows the old columns before 
the alter, which is now inconsistent with the results returned.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.11.0, 0.12.0, 0.13.0
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-24 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945589#comment-13945589
 ] 

Pala M Muthaia commented on HIVE-6131:
--

The above was intended to be code review description. [~ashutoshc], can you 
please review or have the right owner look at this change? Thanks.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
> Attachments: HIVE-6131.1.patch.txt
>
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-21 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943226#comment-13943226
 ] 

Szehon Ho commented on HIVE-6131:
-

We saw this issue earlier while working with schema evolution for parquet and 
other serde, but had thought it was expected behavior (that different partition 
keep old column schema after HIVE-3833).  This will be a good fix to have.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942429#comment-13942429
 ] 

Ashutosh Chauhan commented on HIVE-6131:


Currently, hive only allows to add new columns in newer partition via alter 
table add column, so Hive should insert nulls for new columns which are absent 
in old partitions. Serdes should honor this. 
On the other hand, if user is doing alter table replace columns than all bets 
are off and they should know what they are doing, Hive doesnt handle those 
scenarios and as far as I see it, its very hard to support that.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-20 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942410#comment-13942410
 ] 

Pala M Muthaia commented on HIVE-6131:
--

Ok, i'll work on the patch.

[~ashutoshc], my fix assumes all serdes can work with latest table schema, even 
though partition data is in older schema. Is that valid? 

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941305#comment-13941305
 ] 

Ashutosh Chauhan commented on HIVE-6131:


[~pala] Your suggested fix makes sense.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-19 Thread Pala M Muthaia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941289#comment-13941289
 ] 

Pala M Muthaia commented on HIVE-6131:
--

Browsing the code, it seems like this was introduced by fix for HIVE-3833 in 
Hive 0.11. In that patch, the partition schema is used to read results, instead 
of the table schema as it was before. Since partition schema is a snapshot of 
table schema at the time of partition creation, it doesn't contain all the new 
columns added later. So, the result is read using stale schema and thus do not 
contain new column values, even though they are present in underlying data.

Clearly the intent of the 3833 patch is to use partition specific metadata, to 
allow for multiple serdes for partitions of a table (as i understand it). This 
issue seems to be a regression introduced by that patch.

One possible fix is to use partition metadata, except update column list from 
table metadata. It is quite possible that while this will work, this may not be 
the 'right' fix. [~namitjain]] [~ashutoshc], any thoughts on this?



> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-03-05 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920691#comment-13920691
 ] 

Ning Zhang commented on HIVE-6131:
--

Verified this issue with Apache Hive 0.10, 0.11, 0.12

The conclusion is: Hive 0.11 and 0.12 have this issue, while Hive 0.10 doesn't. 
That is to say, starting from Hive 0.11, the issue was introduced.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-01-06 Thread James Vaughan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863242#comment-13863242
 ] 

James Vaughan commented on HIVE-6131:
-

Editing the local file.  Step 7 is supposed to handle re-uploading the file 
using an OVERWRITE command like you say.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

2014-01-06 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863231#comment-13863231
 ] 

Xuefu Zhang commented on HIVE-6131:
---

Can you please clarify in your step #6 whether you're editing local file or 
HDFS file? If you're editing your local file, then you will need to load the 
data again as you do in step #3, or manually replace the file on HDFS with your 
local edited file.

> New columns after table alter result in null values despite data
> 
>
> Key: HIVE-6131
> URL: https://issues.apache.org/jira/browse/HIVE-6131
> Project: Hive
>  Issue Type: Bug
>Reporter: James Vaughan
>Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hiNULL2014-01-02
> hi2   NULL2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)