[ https://issues.apache.org/jira/browse/IMPALA-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong updated IMPALA-8561: ---------------------------------- Description: {color:red}colored text{color}The file handle cache relies on the mtime to distinguish between different versions of a file. For example, if file X exists with mtime=1, then it is overwritten and the metadata is updated so that now it is at mtime=2, the file handle cache treats them as completely different things and can never use a single file handle to serve both. However, some codepaths generate ScanRanges with an mtime of -1. This removes the ability to distinguish these two versions of a file and can read to consistency problems. A specific example is the code that reads the parquet footer [HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354]. We don't know ahead of time how big the Parquet footer is. So, we read 100KB (determined by [FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]). If the footer size encoded in the last few bytes of the file indicates that the footer is larger than that [code here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414], then we issue a separate read for the actual size of the footer. That separate read does not inherit the mtime of the original read and instead uses an mtime of -1. I verified this by adding tracing and issuing a select against functional_parquet.widetable_1000_cols. A failure scenario associated with this is that we read the last 100KB using a ScanRange with mtime=2, then we find that the footer is larger than 100KB and issue a ScanRange with mtime=-1. This uses a file handle that is from a previous version of the file equivalent to mtime=1. The data it is reading may not come from the end of the file, or it may be at the end of the file but the footer has a different length. (There is no validation on the new read to check the magic value or metadata size reported by the new buffer.) Either would result in a failure to deserialize the thrift for the footer. For example, a problem case could produce an error message like: {noformat} File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has invalid file metadata at file offset 462017. Error = couldn't deserialize thrift msg: TProtocolException: Invalid data .{noformat} To fix this, we should examine all locations that can result in ScanRanges with mtime=-1 and eliminate any that we can. For example, the HdfsParquetScanner::ProcessFooter() code should create a ScanRange that inherits the mtime from the original footer ScanRange. Also, the file handle cache should refuse to cache file handles with mtime=-1. The code in HdfsParquetScanner::ProcessFooter() should add validation for the magic value and metadata size when reading a footer larger than 100KB to verify that we are reading something valid. The thrift deserialize failure gives some information, but catching this case more specifically would provide a better error message. h2. Workarounds * This is most often caused by overwriting files in-place (e.g. INSERT OVERWRITE from Hive) without refreshing the metadata. You can avoid the issue by avoiding these in-place rewrites or by consistently running REFRESH <tbl> in Impala after the modifications. * Setting --max_cached_file_handles=0 in the impalad startup options can work around the issue, at the cost of performance. was: {color:red}colored text{color}The file handle cache relies on the mtime to distinguish between different versions of a file. For example, if file X exists with mtime=1, then it is overwritten and the metadata is updated so that now it is at mtime=2, the file handle cache treats them as completely different things and can never use a single file handle to serve both. However, some codepaths generate ScanRanges with an mtime of -1. This removes the ability to distinguish these two versions of a file and can read to consistency problems. A specific example is the code that reads the parquet footer [HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354]. We don't know ahead of time how big the Parquet footer is. So, we read 100KB (determined by [FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]). If the footer size encoded in the last few bytes of the file indicates that the footer is larger than that [code here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414], then we issue a separate read for the actual size of the footer. That separate read does not inherit the mtime of the original read and instead uses an mtime of -1. I verified this by adding tracing and issuing a select against functional_parquet.widetable_1000_cols. A failure scenario associated with this is that we read the last 100KB using a ScanRange with mtime=2, then we find that the footer is larger than 100KB and issue a ScanRange with mtime=-1. This uses a file handle that is from a previous version of the file equivalent to mtime=1. The data it is reading may not come from the end of the file, or it may be at the end of the file but the footer has a different length. (There is no validation on the new read to check the magic value or metadata size reported by the new buffer.) Either would result in a failure to deserialize the thrift for the footer. For example, a problem case could produce an error message like: {noformat} File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has invalid file metadata at file offset 462017. Error = couldn't deserialize thrift msg: TProtocolException: Invalid data .{noformat} To fix this, we should examine all locations that can result in ScanRanges with mtime=-1 and eliminate any that we can. For example, the HdfsParquetScanner::ProcessFooter() code should create a ScanRange that inherits the mtime from the original footer ScanRange. Also, the file handle cache should refuse to cache file handles with mtime=-1. The code in HdfsParquetScanner::ProcessFooter() should add validation for the magic value and metadata size when reading a footer larger than 100KB to verify that we are reading something valid. The thrift deserialize failure gives some information, but catching this case more specifically would provide a better error message. > ScanRanges with mtime=-1 can lead to inconsistent reads when using the file > handle cache > ---------------------------------------------------------------------------------------- > > Key: IMPALA-8561 > URL: https://issues.apache.org/jira/browse/IMPALA-8561 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 3.3.0 > Reporter: Joe McDonnell > Assignee: Joe McDonnell > Priority: Blocker > Fix For: Impala 3.3.0 > > > {color:red}colored text{color}The file handle cache relies on the mtime to > distinguish between different versions of a file. For example, if file X > exists with mtime=1, then it is overwritten and the metadata is updated so > that now it is at mtime=2, the file handle cache treats them as completely > different things and can never use a single file handle to serve both. > However, some codepaths generate ScanRanges with an mtime of -1. This removes > the ability to distinguish these two versions of a file and can read to > consistency problems. > A specific example is the code that reads the parquet footer > [HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354]. > We don't know ahead of time how big the Parquet footer is. So, we read 100KB > (determined by > [FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]). > If the footer size encoded in the last few bytes of the file indicates that > the footer is larger than that [code > here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414], > then we issue a separate read for the actual size of the footer. That > separate read does not inherit the mtime of the original read and instead > uses an mtime of -1. I verified this by adding tracing and issuing a select > against functional_parquet.widetable_1000_cols. > A failure scenario associated with this is that we read the last 100KB using > a ScanRange with mtime=2, then we find that the footer is larger than 100KB > and issue a ScanRange with mtime=-1. This uses a file handle that is from a > previous version of the file equivalent to mtime=1. The data it is reading > may not come from the end of the file, or it may be at the end of the file > but the footer has a different length. (There is no validation on the new > read to check the magic value or metadata size reported by the new buffer.) > Either would result in a failure to deserialize the thrift for the footer. > For example, a problem case could produce an error message like: > > {noformat} > File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has > invalid file metadata at file offset 462017. Error = couldn't deserialize > thrift msg: > TProtocolException: Invalid data > .{noformat} > To fix this, we should examine all locations that can result in ScanRanges > with mtime=-1 and eliminate any that we can. For example, the > HdfsParquetScanner::ProcessFooter() code should create a ScanRange that > inherits the mtime from the original footer ScanRange. Also, the file handle > cache should refuse to cache file handles with mtime=-1. > The code in HdfsParquetScanner::ProcessFooter() should add validation for the > magic value and metadata size when reading a footer larger than 100KB to > verify that we are reading something valid. The thrift deserialize failure > gives some information, but catching this case more specifically would > provide a better error message. > h2. Workarounds > * This is most often caused by overwriting files in-place (e.g. INSERT > OVERWRITE from Hive) without refreshing the metadata. You can avoid the issue > by avoiding these in-place rewrites or by consistently running REFRESH <tbl> > in Impala after the modifications. > * Setting --max_cached_file_handles=0 in the impalad startup options can work > around the issue, at the cost of performance. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org