[ https://issues.apache.org/jira/browse/IMPALA-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545825#comment-16545825 ]
Pranay Singh commented on IMPALA-6994: -------------------------------------- Here is a description of the problem which this jira aims to fix. Components in the Hadoop ecosystem are loosely coupled so it is difficult to guarantee atomicity of operations that span multiple components (no distributed transactions across components). An INSERT in Impala may modify HDFS and Hive Meta Store. Unfortunately, if one of the steps in an INSERT fails it may require human intervention to clean up. Impala's flow of operations during an INSERT: -------------------------------------------------------------- a) Process SELECT portion and write results into temporary invisible HDFS files in parallel. Once the SELECT portion has completed and all temporary files have been written, the coordinator moves the temporary files into their permanent location in HDFS (resulting files will not be hidden any more). b) The coordinator contacts the catalogd to update Impala's metadata cache with the new files and/or partitions On the catalogd: 1) The file and block metadata of existing partitions that were modified is refreshed 2) New partitions are created in the Hive Meta Store, if necessary The table metadata (schema/location etc.) is refreshed from the Hive Meta Store. This step is not needed if INSERT happens to an existing partition. So we can reduce these error scenarios for the cases by not calling Hive Meta Store for the cases when it is not needed. > Avoid reloading a table's HMS data for file-only operations > ----------------------------------------------------------- > > Key: IMPALA-6994 > URL: https://issues.apache.org/jira/browse/IMPALA-6994 > Project: IMPALA > Issue Type: Improvement > Components: Catalog > Affects Versions: Impala 2.12.0 > Reporter: Balazs Jeszenszky > Assignee: Pranay Singh > Priority: Major > > Reloading file metadata for HDFS tables (e.g. as a final step in an 'insert') > is done via > https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L628 > , which calls > https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1243 > HdfsTable.load has no option to only load file metadata. HMS metadata will > also be reloaded every time, which is an unnecessary overhead (and potential > point of failure) when adding files to existing locations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org