[jira] [Comment Edited] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter
[ https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455520#comment-16455520 ] Aaron Fabbri edited comment on HIVE-16295 at 4/26/18 11:08 PM: --- This is a really cool prototype [~stakiar], thank you for doing this. I don't have much Hive knowledge but will try to spend some more time looking at the code. I'm also happy to work w/ [~ste...@apache.org] on stabilizing the _SUCCESS file manifest (which enumerates the files committed) if that works for your dynamic partitioning problem. edit: need more coffee. was (Author: fabbri): This is a really cool prototype [~stakiar], thank you for doing this. I don't have much Hive knowledge but will try to spend some more time looking at the code. I'm also happy to work w/ [~ste...@apache.org] on stabilizing the _SUCCESS file manifest (which enumerates the uploaded-but-not-completed multipart uploads to S3) if that works for your dynamic partitioning problem. > Add support for using Hadoop's S3A OutputCommitter > -- > > Key: HIVE-16295 > URL: https://issues.apache.org/jira/browse/HIVE-16295 > Project: Hive > Issue Type: Sub-task >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch > > > Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a > {{NullOutputCommitter}} and uses its own commit logic spread across > {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}. > The Hadoop community is building an {{OutputCommitter}} that integrates with > S3Guard and does a safe, coordinate commit of data on S3 inside individual > tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} > there would be a lot of benefits to Hive-on-S3: > * Data is only written once; directly committing data at a task level means > no renames are necessary > * The commit is done safely, in a coordinated manner; duplicate tasks (from > task retries or speculative execution) should not step on each other -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16295) Add support for using Hadoop's S3A OutputCommitter
[ https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455520#comment-16455520 ] Aaron Fabbri commented on HIVE-16295: - This is a really cool prototype [~stakiar], thank you for doing this. I don't have much Hive knowledge but will try to spend some more time looking at the code. I'm also happy to work w/ [~ste...@apache.org] on stabilizing the _SUCCESS file manifest (which enumerates the uploaded-but-not-completed multipart uploads to S3) if that works for your dynamic partitioning problem. > Add support for using Hadoop's S3A OutputCommitter > -- > > Key: HIVE-16295 > URL: https://issues.apache.org/jira/browse/HIVE-16295 > Project: Hive > Issue Type: Sub-task >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Major > Attachments: HIVE-16295.1.WIP.patch, HIVE-16295.2.WIP.patch > > > Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a > {{NullOutputCommitter}} and uses its own commit logic spread across > {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}. > The Hadoop community is building an {{OutputCommitter}} that integrates with > S3Guard and does a safe, coordinate commit of data on S3 inside individual > tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} > there would be a lot of benefits to Hive-on-S3: > * Data is only written once; directly committing data at a task level means > no renames are necessary > * The commit is done safely, in a coordinated manner; duplicate tasks (from > task retries or speculative execution) should not step on each other -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16295) Add support for using Hadoop's OutputCommitter
[ https://issues.apache.org/jira/browse/HIVE-16295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273948#comment-16273948 ] Aaron Fabbri commented on HIVE-16295: - Just FYI for watchers: the S3 Output Committer has been merged to trunk in Hadoop Common (HADOOP-13786). > Add support for using Hadoop's OutputCommitter > -- > > Key: HIVE-16295 > URL: https://issues.apache.org/jira/browse/HIVE-16295 > Project: Hive > Issue Type: Sub-task >Reporter: Sahil Takiar >Assignee: Sahil Takiar > > Hive doesn't have integration with Hadoop's {{OutputCommitter}}, it uses a > {{NullOutputCommitter}} and uses its own commit logic spread across > {{FileSinkOperator}}, {{MoveTask}}, and {{Hive}}. > The Hadoop community is building an {{OutputCommitter}} that integrates with > S3Guard and does a safe, coordinate commit of data on S3 inside individual > tasks (HADOOP-13786). If Hive can integrate with this new {{OutputCommitter}} > there would be a lot of benefits to Hive-on-S3: > * Data is only written once; directly committing data at a task level means > no renames are necessary > * The commit is done safely, in a coordinated manner; duplicate tasks (from > task retries or speculative execution) should not step on each other -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files
[ https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319577#comment-15319577 ] Aaron Fabbri commented on HIVE-13778: - Thanks.. You could also resolve as duplicated by. > DROP TABLE PURGE on S3A table with too many files does not delete the files > --- > > Key: HIVE-13778 > URL: https://issues.apache.org/jira/browse/HIVE-13778 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Sailesh Mukil >Priority: Critical > Labels: metastore, s3 > > I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A > that has many files, the files never get deleted. However, the Hive metastore > logs do say that the path was deleted: > "Not moving [path] to trash" > "Deleted the diretory [path]" > I initially thought that this was due to the eventually consistent nature of > S3 for deletes, however, a week later, the files still exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files
[ https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303387#comment-15303387 ] Aaron Fabbri commented on HIVE-13778: - [~sailesh] can you assign this to me please? I will resolve it. > DROP TABLE PURGE on S3A table with too many files does not delete the files > --- > > Key: HIVE-13778 > URL: https://issues.apache.org/jira/browse/HIVE-13778 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Sailesh Mukil >Priority: Critical > Labels: metastore, s3 > > I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A > that has many files, the files never get deleted. However, the Hive metastore > logs do say that the path was deleted: > "Not moving [path] to trash" > "Deleted the diretory [path]" > I initially thought that this was due to the eventually consistent nature of > S3 for deletes, however, a week later, the files still exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files
[ https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303380#comment-15303380 ] Aaron Fabbri edited comment on HIVE-13778 at 5/27/16 3:01 AM: -- Note this is the same as [IMPALA-3558|https://issues.cloudera.org/projects/IMPALA/issues/IMPALA-3558]. See that issue for my explanation that this is expected behavior. was (Author: fabbri): Note this is the same as [IMPALA-3558|https://issues.cloudera.org/projects/IMPALA/issues/IMPALA-3558] > DROP TABLE PURGE on S3A table with too many files does not delete the files > --- > > Key: HIVE-13778 > URL: https://issues.apache.org/jira/browse/HIVE-13778 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Sailesh Mukil >Priority: Critical > Labels: metastore, s3 > > I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A > that has many files, the files never get deleted. However, the Hive metastore > logs do say that the path was deleted: > "Not moving [path] to trash" > "Deleted the diretory [path]" > I initially thought that this was due to the eventually consistent nature of > S3 for deletes, however, a week later, the files still exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files
[ https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303380#comment-15303380 ] Aaron Fabbri commented on HIVE-13778: - Note this is the same as [IMPALA-3558|https://issues.cloudera.org/projects/IMPALA/issues/IMPALA-3558] > DROP TABLE PURGE on S3A table with too many files does not delete the files > --- > > Key: HIVE-13778 > URL: https://issues.apache.org/jira/browse/HIVE-13778 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Sailesh Mukil >Priority: Critical > Labels: metastore, s3 > > I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A > that has many files, the files never get deleted. However, the Hive metastore > logs do say that the path was deleted: > "Not moving [path] to trash" > "Deleted the diretory [path]" > I initially thought that this was due to the eventually consistent nature of > S3 for deletes, however, a week later, the files still exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13778) DROP TABLE PURGE on S3A table with too many files does not delete the files
[ https://issues.apache.org/jira/browse/HIVE-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294575#comment-15294575 ] Aaron Fabbri commented on HIVE-13778: - Thanks for the details [~sailesh]. Namenode should not be involved with s3a paths. Can you re-run with some s3a logging on? i.e. org.apache.hadoop.fs.s3a=DEBUG > DROP TABLE PURGE on S3A table with too many files does not delete the files > --- > > Key: HIVE-13778 > URL: https://issues.apache.org/jira/browse/HIVE-13778 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Sailesh Mukil >Priority: Critical > Labels: metastore, s3 > > I've noticed that when we do a DROP TABLE tablename PURGE on a table on S3A > that has many files, the files never get deleted. However, the Hive metastore > logs do say that the path was deleted: > "Not moving [path] to trash" > "Deleted the diretory [path]" > I initially thought that this was due to the eventually consistent nature of > S3 for deletes, however, a week later, the files still exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)