[ 
https://issues.apache.org/jira/browse/IMPALA-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar resolved IMPALA-9293.
----------------------------------
    Fix Version/s: Impala 3.4.0
       Resolution: Fixed

> Impala Doc: Revise explanation of HDFS trashcan usage on S3
> -----------------------------------------------------------
>
>                 Key: IMPALA-9293
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9293
>             Project: IMPALA
>          Issue Type: Task
>          Components: Docs
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>             Fix For: Impala 3.4.0
>
>
> The Impala docs state:
> {quote}
> By default, when you drop an internal (managed) table, the data files are 
> moved to the HDFS trashcan. This operation is expensive for tables that 
> reside on the Amazon S3 filesystem. Therefore, for S3 tables, prefer to use 
> DROP TABLE table_name PURGE rather than the default DROP TABLE statement. The 
> PURGE clause makes Impala delete the data files immediately, skipping the 
> HDFS trashcan.
> {quote}
> and
> {quote}
> The default DROP TABLE/PARTITION is slow because Impala copies the files to 
> the HDFS trash folder, and Impala waits until all the data is moved. DROP 
> TABLE/PARTITION .. PURGE is a fast delete operation, and the Impala statement 
> finishes quickly even though the change might not have propagated fully 
> throughout S3.
> {quote}
> The confusing part is "Impala copies the files to the HDFS trash folder". 
> Users might think that when a managed Impala table on S3 is dropped, Impala 
> actually copies the data from S3 to a trashcan folder *stored on HDFS*. This 
> isn't true. The term "HDFS trashcan" is used to refer to a feature of HDFS 
> where all deleted data is moved to a trash folder rather than being deleted 
> immediately. See 
> https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#File+Deletes+and+Undeletes
>  for details.
> What actually happens is that there is a trashcan folder on S3 itself, and 
> when a S3 managed table is dropped, the data is copied from from the managed 
> table folder to the trashcan folder *stored on S3*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to