[ 
https://issues.apache.org/jira/browse/IMPALA-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489375#comment-16489375
 ] 

Philip Zeyliger commented on IMPALA-6119:
-----------------------------------------

Does it make sense to be allowing two partitions to have the same location? Is 
Impala's behavior consistent with Hive and Spark when this happens?

> Inconsistent file metadata updates when multiple partitions point to the same 
> path
> ----------------------------------------------------------------------------------
>
>                 Key: IMPALA-6119
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6119
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0
>            Reporter: bharath v
>            Assignee: Gabor Kaszab
>            Priority: Critical
>              Labels: correctness, ramp-up
>
> Following steps can give inconsistent results.
> {noformat}
> // Create a partitioned table
> create table test(a int) partitioned by (b int);
> // Create two partitions b=1/b=2 mapped to the same HDFS location.
> insert into test partition(b=1) values (1);
> alter table test add partition (b=2) location 
> 'hdfs://localhost:20500/test-warehouse/test/b=1/' 
> [localhost:21000] > show partitions test;
> Query: show partitions test
> +-------+-------+--------+------+--------------+-------------------+--------+-------------------+------------------------------------------------+
> | b     | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | 
> Incremental stats | Location                                       |
> +-------+-------+--------+------+--------------+-------------------+--------+-------------------+------------------------------------------------+
> | 1     | -1    | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | 
> false             | hdfs://localhost:20500/test-warehouse/test/b=1 |
> | 2     | -1    | 1      | 2B   | NOT CACHED   | NOT CACHED        | TEXT   | 
> false             | hdfs://localhost:20500/test-warehouse/test/b=1 |
> | Total | -1    | 2      | 4B   | 0B           |                   |        | 
>                   |                                                |
> +-------+-------+--------+------+--------------+-------------------+--------+-------------------+------------------------------------------------+
> // Insert new data into one of the partitions
> insert into test partition(b=1) values (2);
> // Newly added file is reflected only in the added partition files. 
> show files in test;
> Query: show files in test
> +----------------------------------------------------------------------------------------------------+------+-----------+
> | Path                                                                        
>                        | Size | Partition |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> | 
> hdfs://localhost:20500/test-warehouse/test/b=1/2e44cd49e8c3d30d-572fc97800000000_627280230_data.0.
>  | 2B   | b=1       |
> | 
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
>  | 2B   | b=1       |
> | 
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
>  | 2B   | b=2       |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> invalidate metadata test;
>  show files in test;
> // After invalidation, the newly added file now shows up in both the 
> partitions.
> Query: show files in test
> +----------------------------------------------------------------------------------------------------+------+-----------+
> | Path                                                                        
>                        | Size | Partition |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> | 
> hdfs://localhost:20500/test-warehouse/test/b=1/2e44cd49e8c3d30d-572fc97800000000_627280230_data.0.
>  | 2B   | b=1       |
> | 
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
>  | 2B   | b=1       |
> | 
> hdfs://localhost:20500/test-warehouse/test/b=1/2e44cd49e8c3d30d-572fc97800000000_627280230_data.0.
>  | 2B   | b=2       |
> | 
> hdfs://localhost:20500/test-warehouse/test/b=1/e44245ad5c0ef020-a08716d00000000_1244237483_data.0.
>  | 2B   | b=2       |
> +----------------------------------------------------------------------------------------------------+------+-----------+
> {noformat}
> So, depending whether the user invalidates the table, they can see different 
> results. The bug is in the following code.
> {noformat}
> private FileMetadataLoadStats resetAndLoadFileMetadata(
>       Path partDir, List<HdfsPartition> partitions) throws IOException {
>     FileMetadataLoadStats loadStats = new FileMetadataLoadStats(partDir);
> ....
> ....
> ....
>  for (HdfsPartition partition: partitions) 
> partition.setFileDescriptors(newFileDescs);  <======
> {noformat}
> We only update the added file metadata for the new partition (copy-on-write 
> way). Instead we should update the source descriptors so that it is reflected 
> in the other partitions too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to