[jira] [Comment Edited] (HIVE-22077) Inserting overwrite partitions clause does not clean directories while partitions' info is not stored in metadata

2020-05-22 Thread Jeffrey(Xilang) Yan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113829#comment-17113829
 ] 

Jeffrey(Xilang) Yan edited comment on HIVE-22077 at 5/22/20, 7:57 AM:
--

We meet exactly same issue on production. Insert overwrite sql failed due to 
hive metastore lock, retry the sql doesn't remove old data which make many many 
duplicate data left in hdfs. It is a nightmare now, we have to find all 
partition which have duplicate data.
 Could someone help to revew this patch? 

[~kgyrtkirk] [~jcamachorodriguez] [~mgergely] [~ashutoshc]


was (Author: xilangyan):
We meet exactly same issue on production. Insert overwrite sql failed due to 
hive metastore lock, retry the sql doesn't remove old data which make many many 
duplicate data left in hdfs. It is a nightmare now, we have to find all 
partition which have duplicate data.
Could someone help to revew this patch? 

[~kgyrtkirk] [~jcamachorodriguez]

> Inserting overwrite partitions clause does not clean directories while 
> partitions' info is not stored in metadata
> -
>
> Key: HIVE-22077
> URL: https://issues.apache.org/jira/browse/HIVE-22077
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.1.1, 4.0.0, 2.3.4
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
> Attachments: HIVE-22077.patch.1
>
>
> Inserting overwrite static partitions may not clean related HDFS location if 
> partitions' info is not stored in metadata.
> Steps to reproduce this issue : 
> 
> 1. Create a managed table :
> 
> {code:sql}
>  CREATE TABLE `test`(   
>`id` string) 
>  PARTITIONED BY (   
>`dayno` string)  
>  ROW FORMAT SERDE   
>'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  
>  STORED AS INPUTFORMAT  
>'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
>  OUTPUTFORMAT   
>'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
>  LOCATION   
>'hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test' 
>  TBLPROPERTIES (
>'transient_lastDdlTime'='1564731656')   
> {code}
> 
> 2. Create partition's directory and put some data in it
> 
> {code:java}
> hdfs dfs -mkdir 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> hdfs dfs -put test.data 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> {code}
> 
> 3. Insert overwrite partition dayno=20190802
> 
> {code:sql}
> INSERT OVERWRITE TABLE test PARTITION(dayno='20190802')
> SELECT "some value";
> {code}
> 
> 4. We could see the test.data under partition directory is not deleted.
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HIVE-22077) Inserting overwrite partitions clause does not clean directories while partitions' info is not stored in metadata

2019-09-03 Thread Hui An (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921275#comment-16921275
 ] 

Hui An edited comment on HIVE-22077 at 9/3/19 9:21 AM:
---

[~kgyrtkirk] Could you please review this patch?


was (Author: bone an):
[~kgyrtkirk]Could you please review this patch?

> Inserting overwrite partitions clause does not clean directories while 
> partitions' info is not stored in metadata
> -
>
> Key: HIVE-22077
> URL: https://issues.apache.org/jira/browse/HIVE-22077
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.1.1, 4.0.0, 2.3.4
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
> Attachments: HIVE-22077.patch.1
>
>
> Inserting overwrite static partitions may not clean related HDFS location if 
> partitions' info is not stored in metadata.
> Steps to reproduce this issue : 
> 
> 1. Create a managed table :
> 
> {code:sql}
>  CREATE TABLE `test`(   
>`id` string) 
>  PARTITIONED BY (   
>`dayno` string)  
>  ROW FORMAT SERDE   
>'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  
>  STORED AS INPUTFORMAT  
>'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
>  OUTPUTFORMAT   
>'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
>  LOCATION   
>'hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test' 
>  TBLPROPERTIES (
>'transient_lastDdlTime'='1564731656')   
> {code}
> 
> 2. Create partition's directory and put some data in it
> 
> {code:java}
> hdfs dfs -mkdir 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> hdfs dfs -put test.data 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> {code}
> 
> 3. Insert overwrite partition dayno=20190802
> 
> {code:sql}
> INSERT OVERWRITE TABLE test PARTITION(dayno='20190802')
> SELECT "some value";
> {code}
> 
> 4. We could see the test.data under partition directory is not deleted.
> 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (HIVE-22077) Inserting overwrite partitions clause does not clean directories while partitions' info is not stored in metadata

2019-08-04 Thread Hui An (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898685#comment-16898685
 ] 

Hui An edited comment on HIVE-22077 at 8/5/19 1:46 AM:
---

This issue is caused by method loadPartitionInternal of Hive.java
{code:java}
Path oldPartPath = (oldPart != null) ? oldPart.getDataLocation() : null;
Path newPartPath = null;

if (inheritLocation) {
  newPartPath = genPartPathFromTable(tbl, partSpec, tblDataLocationPath);

  if(oldPart != null) {
/*
 * If we are moving the partition across filesystem boundaries
 * inherit from the table properties. Otherwise (same filesystem) use the
 * original partition location.
 *
 * See: HIVE-1707 and HIVE-2117 for background
 */
FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());
FileSystem loadPathFS = loadPath.getFileSystem(getConf());
if (FileUtils.equalsFileSystem(oldPartPathFS,loadPathFS)) {
  newPartPath = oldPartPath;
}
  }
} else {
  newPartPath = oldPartPath == null
? genPartPathFromTable(tbl, partSpec, tblDataLocationPath) : oldPartPath;
}
{code}
Actually, oldPart is null does not mean oldPartPath does not exist in HDFS, but 
it just set oldPartPath is null, and give null value to following method 
replaceFiles.
I think we could just give newPartPath value to the oldPartPath when oldPart is 
null, may this causes other problems? Or should we check partitions directory 
before mr work and throw errors to the end user if there are files under it? 


was (Author: bone an):
This issue is caused by method loadPartitionInternal of Hive.java
{code:java}
Path oldPartPath = (oldPart != null) ? oldPart.getDataLocation() : null;
Path newPartPath = null;

if (inheritLocation) {
  newPartPath = genPartPathFromTable(tbl, partSpec, tblDataLocationPath);

  if(oldPart != null) {
/*
 * If we are moving the partition across filesystem boundaries
 * inherit from the table properties. Otherwise (same filesystem) use the
 * original partition location.
 *
 * See: HIVE-1707 and HIVE-2117 for background
 */
FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());
FileSystem loadPathFS = loadPath.getFileSystem(getConf());
if (FileUtils.equalsFileSystem(oldPartPathFS,loadPathFS)) {
  newPartPath = oldPartPath;
}
  }
} else {
  newPartPath = oldPartPath == null
? genPartPathFromTable(tbl, partSpec, tblDataLocationPath) : oldPartPath;
}
{code}
Actually, oldPart is null does not mean oldPartPath is not exists in HDFS, but 
it just set oldPartPath is null, and give null value to following method 
replaceFiles.
I think we could just give newPartPath value to the oldPartPath when oldPart is 
null, may this causes other problems? Or should we check partitions directory 
before mr work and throw errors to the end user if there are files under it? 

> Inserting overwrite partitions clause does not clean directories while 
> partitions' info is not stored in metadata
> -
>
> Key: HIVE-22077
> URL: https://issues.apache.org/jira/browse/HIVE-22077
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.1.1, 4.0.0, 2.3.4
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>
> Inserting overwrite static partitions may not clean related HDFS location if 
> partitions' info is not stored in metadata.
> Steps to Reproduce this issue : 
> 
> 1. Create a managed table :
> 
> {code:sql}
>  CREATE TABLE `test`(   
>`id` string) 
>  PARTITIONED BY (   
>`dayno` string)  
>  ROW FORMAT SERDE   
>'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  
>  STORED AS INPUTFORMAT  
>'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
>  OUTPUTFORMAT   
>'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
>  LOCATION   |
>'hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test' 
>  TBLPROPERTIES (
>'transient_lastDdlTime'='1564731656')   
> {code}
> 
> 2. Create partition's directory and put some data under it
> 
> {code:java}
> hdfs dfs -mkdir 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> hdfs dfs -put test.data 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> {code}
> 

[jira] [Comment Edited] (HIVE-22077) Inserting overwrite partitions clause does not clean directories while partitions' info is not stored in metadata

2019-08-02 Thread Hui An (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898685#comment-16898685
 ] 

Hui An edited comment on HIVE-22077 at 8/2/19 8:14 AM:
---

This issue is caused by method loadPartitionInternal of Hive.java
{code:java}
Path oldPartPath = (oldPart != null) ? oldPart.getDataLocation() : null;
Path newPartPath = null;

if (inheritLocation) {
  newPartPath = genPartPathFromTable(tbl, partSpec, tblDataLocationPath);

  if(oldPart != null) {
/*
 * If we are moving the partition across filesystem boundaries
 * inherit from the table properties. Otherwise (same filesystem) use the
 * original partition location.
 *
 * See: HIVE-1707 and HIVE-2117 for background
 */
FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());
FileSystem loadPathFS = loadPath.getFileSystem(getConf());
if (FileUtils.equalsFileSystem(oldPartPathFS,loadPathFS)) {
  newPartPath = oldPartPath;
}
  }
} else {
  newPartPath = oldPartPath == null
? genPartPathFromTable(tbl, partSpec, tblDataLocationPath) : oldPartPath;
}
{code}
Actually, oldPart is null does not mean oldPartPath is not exists in HDFS, but 
it just set oldPartPath is null, and give null value to following method 
replaceFiles.
I think we could just give newPartPath value to the oldPartPath when oldPart is 
null, may this causes other problems? Or should we check partitions directory 
before mr work and throw errors to the end user if there are files under it? 


was (Author: bone an):
This issue is caused by method loadPartitionInternal of Hive.java
{code:java}
Path oldPartPath = (oldPart != null) ? oldPart.getDataLocation() : null;
Path newPartPath = null;

if (inheritLocation) {
  newPartPath = genPartPathFromTable(tbl, partSpec, tblDataLocationPath);

  if(oldPart != null) {
/*
 * If we are moving the partition across filesystem boundaries
 * inherit from the table properties. Otherwise (same filesystem) use the
 * original partition location.
 *
 * See: HIVE-1707 and HIVE-2117 for background
 */
FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());
FileSystem loadPathFS = loadPath.getFileSystem(getConf());
if (FileUtils.equalsFileSystem(oldPartPathFS,loadPathFS)) {
  newPartPath = oldPartPath;
}
  }
} else {
  newPartPath = oldPartPath == null
? genPartPathFromTable(tbl, partSpec, tblDataLocationPath) : oldPartPath;
}
{code}
Actually, oldPart is null does not mean oldPartPath is not exists in HDFS, but 
it just set oldPartPath is null, and give null value to following method 
replaceFiles.
I think we could just give newPartPath value to the oldPartPath when oldPart is 
null, but may this causes other problems, or should we check partitions 
directory before mr work and throw errors to the end user if there are files 
under it? 

> Inserting overwrite partitions clause does not clean directories while 
> partitions' info is not stored in metadata
> -
>
> Key: HIVE-22077
> URL: https://issues.apache.org/jira/browse/HIVE-22077
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.1.1, 4.0.0, 2.3.4
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>
> Inserting overwrite static partitions may not clean related HDFS location if 
> partitions' info is not stored in metadata.
> Steps to Reproduce this issue : 
> 
> 1. Create a managed table :
> 
> {code:sql}
>  CREATE TABLE `test`(   
>`id` string) 
>  PARTITIONED BY (   
>`dayno` string)  
>  ROW FORMAT SERDE   
>'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  
>  STORED AS INPUTFORMAT  
>'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
>  OUTPUTFORMAT   
>'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
>  LOCATION   |
>'hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test' 
>  TBLPROPERTIES (
>'transient_lastDdlTime'='1564731656')   
> {code}
> 
> 2. Create partition's directory and put some data under it
> 
> {code:java}
> hdfs dfs -mkdir 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> hdfs dfs -put test.data 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> {code}
> 

[jira] [Comment Edited] (HIVE-22077) Inserting overwrite partitions clause does not clean directories while partitions' info is not stored in metadata

2019-08-02 Thread Hui An (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-22077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898685#comment-16898685
 ] 

Hui An edited comment on HIVE-22077 at 8/2/19 8:12 AM:
---

This issue is caused by method loadPartitionInternal of Hive.java
{code:java}
Path oldPartPath = (oldPart != null) ? oldPart.getDataLocation() : null;
Path newPartPath = null;

if (inheritLocation) {
  newPartPath = genPartPathFromTable(tbl, partSpec, tblDataLocationPath);

  if(oldPart != null) {
/*
 * If we are moving the partition across filesystem boundaries
 * inherit from the table properties. Otherwise (same filesystem) use the
 * original partition location.
 *
 * See: HIVE-1707 and HIVE-2117 for background
 */
FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());
FileSystem loadPathFS = loadPath.getFileSystem(getConf());
if (FileUtils.equalsFileSystem(oldPartPathFS,loadPathFS)) {
  newPartPath = oldPartPath;
}
  }
} else {
  newPartPath = oldPartPath == null
? genPartPathFromTable(tbl, partSpec, tblDataLocationPath) : oldPartPath;
}
{code}
Actually, oldPart is null does not mean oldPartPath is not exists in HDFS, but 
it just set oldPartPath is null, and give null value to following method 
replaceFiles.
I think we could just give newPartPath value to the oldPartPath when oldPart is 
null, but may this causes other problems, or should we check partitions 
directory before mr work and throw errors to the end user if there are files 
under it? 


was (Author: bone an):
This issue is caused by method loadPartitionInternal of Hive.java
{code:java}
Path oldPartPath = (oldPart != null) ? oldPart.getDataLocation() : null;
Path newPartPath = null;

if (inheritLocation) {
  newPartPath = genPartPathFromTable(tbl, partSpec, tblDataLocationPath);

  if(oldPart != null) {
/*
 * If we are moving the partition across filesystem boundaries
 * inherit from the table properties. Otherwise (same filesystem) use the
 * original partition location.
 *
 * See: HIVE-1707 and HIVE-2117 for background
 */
FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());
FileSystem loadPathFS = loadPath.getFileSystem(getConf());
if (FileUtils.equalsFileSystem(oldPartPathFS,loadPathFS)) {
  newPartPath = oldPartPath;
}
  }
} else {
  newPartPath = oldPartPath == null
? genPartPathFromTable(tbl, partSpec, tblDataLocationPath) : oldPartPath;
}
{code}
Actually, oldPart is null does not mean oldPartPath is not exists in HDFS, but 
it just set oldPartPath is null, and give null value to following method 
replaceFiles.

> Inserting overwrite partitions clause does not clean directories while 
> partitions' info is not stored in metadata
> -
>
> Key: HIVE-22077
> URL: https://issues.apache.org/jira/browse/HIVE-22077
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.1.1, 4.0.0, 2.3.4
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>
> Inserting overwrite static partitions may not clean related HDFS location if 
> partitions' info is not stored in metadata.
> Steps to Reproduce this issue : 
> 
> 1. Create a managed table :
> 
> {code:sql}
>  CREATE TABLE `test`(   
>`id` string) 
>  PARTITIONED BY (   
>`dayno` string)  
>  ROW FORMAT SERDE   
>'org.apache.hadoop.hive.ql.io.orc.OrcSerde'  
>  STORED AS INPUTFORMAT  
>'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
>  OUTPUTFORMAT   
>'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
>  LOCATION   |
>'hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test' 
>  TBLPROPERTIES (
>'transient_lastDdlTime'='1564731656')   
> {code}
> 
> 2. Create partition's directory and put some data under it
> 
> {code:java}
> hdfs dfs -mkdir 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> hdfs dfs -put test.data 
> hdfs://test-dev-hdfs/user/hive/warehouse/test.db/test/dayno=20190802
> {code}
> 
> 3. Insert overwrite partition dayno=20190802
> 
> {code:sql}
> INSERT OVERWRITE TABLE test PARTITION(dayno='20190802')
> SELECT 1;
> {code}
>