[ 
https://issues.apache.org/jira/browse/IMPALA-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-8561:
----------------------------------
    Description: 
{color:red}colored text{color}The file handle cache relies on the mtime to 
distinguish between different versions of a file. For example, if file X exists 
with mtime=1, then it is overwritten and the metadata is updated so that now it 
is at mtime=2, the file handle cache treats them as completely different things 
and can never use a single file handle to serve both. However, some codepaths 
generate ScanRanges with an mtime of -1. This removes the ability to 
distinguish these two versions of a file and can read to consistency problems.

A specific example is the code that reads the parquet footer 
[HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354].
 We don't know ahead of time how big the Parquet footer is. So, we read 100KB 
(determined by 
[FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]).
 If the footer size encoded in the last few bytes of the file indicates that 
the footer is larger than that [code 
here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414],
 then we issue a separate read for the actual size of the footer. That separate 
read does not inherit the mtime of the original read and instead uses an mtime 
of -1. I verified this by adding tracing and issuing a select against 
functional_parquet.widetable_1000_cols.

A failure scenario associated with this is that we read the last 100KB using a 
ScanRange with mtime=2, then we find that the footer is larger than 100KB and 
issue a ScanRange with mtime=-1. This uses a file handle that is from a 
previous version of the file equivalent to mtime=1. The data it is reading may 
not come from the end of the file, or it may be at the end of the file but the 
footer has a different length. (There is no validation on the new read to check 
the magic value or metadata size reported by the new buffer.) Either would 
result in a failure to deserialize the thrift for the footer. For example, a 
problem case could produce an error message like:

 
{noformat}
File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has 
invalid file metadata at file offset 462017. Error = couldn't deserialize 
thrift msg:
TProtocolException: Invalid data
.{noformat}
To fix this, we should examine all locations that can result in ScanRanges with 
mtime=-1 and eliminate any that we can. For example, the 
HdfsParquetScanner::ProcessFooter() code should create a ScanRange that 
inherits the mtime from the original footer ScanRange. Also, the file handle 
cache should refuse to cache file handles with mtime=-1.

The code in HdfsParquetScanner::ProcessFooter() should add validation for the 
magic value and metadata size when reading a footer larger than 100KB to verify 
that we are reading something valid. The thrift deserialize failure gives some 
information, but catching this case more specifically would provide a better 
error message.

h2. Workarounds
* This is most often caused by overwriting files in-place (e.g. INSERT 
OVERWRITE from Hive) without refreshing the metadata. You can avoid the issue 
by avoiding these in-place rewrites or by consistently running REFRESH <tbl> in 
Impala after the modifications.
* Setting --max_cached_file_handles=0 in the impalad startup options can work 
around the issue, at the cost of performance.



 

  was:
{color:red}colored text{color}The file handle cache relies on the mtime to 
distinguish between different versions of a file. For example, if file X exists 
with mtime=1, then it is overwritten and the metadata is updated so that now it 
is at mtime=2, the file handle cache treats them as completely different things 
and can never use a single file handle to serve both. However, some codepaths 
generate ScanRanges with an mtime of -1. This removes the ability to 
distinguish these two versions of a file and can read to consistency problems.

A specific example is the code that reads the parquet footer 
[HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354].
 We don't know ahead of time how big the Parquet footer is. So, we read 100KB 
(determined by 
[FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]).
 If the footer size encoded in the last few bytes of the file indicates that 
the footer is larger than that [code 
here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414],
 then we issue a separate read for the actual size of the footer. That separate 
read does not inherit the mtime of the original read and instead uses an mtime 
of -1. I verified this by adding tracing and issuing a select against 
functional_parquet.widetable_1000_cols.

A failure scenario associated with this is that we read the last 100KB using a 
ScanRange with mtime=2, then we find that the footer is larger than 100KB and 
issue a ScanRange with mtime=-1. This uses a file handle that is from a 
previous version of the file equivalent to mtime=1. The data it is reading may 
not come from the end of the file, or it may be at the end of the file but the 
footer has a different length. (There is no validation on the new read to check 
the magic value or metadata size reported by the new buffer.) Either would 
result in a failure to deserialize the thrift for the footer. For example, a 
problem case could produce an error message like:

 
{noformat}
File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has 
invalid file metadata at file offset 462017. Error = couldn't deserialize 
thrift msg:
TProtocolException: Invalid data
.{noformat}
To fix this, we should examine all locations that can result in ScanRanges with 
mtime=-1 and eliminate any that we can. For example, the 
HdfsParquetScanner::ProcessFooter() code should create a ScanRange that 
inherits the mtime from the original footer ScanRange. Also, the file handle 
cache should refuse to cache file handles with mtime=-1.

The code in HdfsParquetScanner::ProcessFooter() should add validation for the 
magic value and metadata size when reading a footer larger than 100KB to verify 
that we are reading something valid. The thrift deserialize failure gives some 
information, but catching this case more specifically would provide a better 
error message.



 


> ScanRanges with mtime=-1 can lead to inconsistent reads when using the file 
> handle cache
> ----------------------------------------------------------------------------------------
>
>                 Key: IMPALA-8561
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8561
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.3.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>             Fix For: Impala 3.3.0
>
>
> {color:red}colored text{color}The file handle cache relies on the mtime to 
> distinguish between different versions of a file. For example, if file X 
> exists with mtime=1, then it is overwritten and the metadata is updated so 
> that now it is at mtime=2, the file handle cache treats them as completely 
> different things and can never use a single file handle to serve both. 
> However, some codepaths generate ScanRanges with an mtime of -1. This removes 
> the ability to distinguish these two versions of a file and can read to 
> consistency problems.
> A specific example is the code that reads the parquet footer 
> [HdfsParquetScanner::ProcessFooter()|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1354].
>  We don't know ahead of time how big the Parquet footer is. So, we read 100KB 
> (determined by 
> [FOOTER_SIZE|https://github.com/apache/impala/blob/449fe73d2145bd22f0f857623c3652a097f06d73/be/src/exec/hdfs-scanner.h#L331]).
>  If the footer size encoded in the last few bytes of the file indicates that 
> the footer is larger than that [code 
> here|https://github.com/apache/impala/blob/832c9de7810b47b5f782bccb761e07264e7548e5/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1414],
>  then we issue a separate read for the actual size of the footer. That 
> separate read does not inherit the mtime of the original read and instead 
> uses an mtime of -1. I verified this by adding tracing and issuing a select 
> against functional_parquet.widetable_1000_cols.
> A failure scenario associated with this is that we read the last 100KB using 
> a ScanRange with mtime=2, then we find that the footer is larger than 100KB 
> and issue a ScanRange with mtime=-1. This uses a file handle that is from a 
> previous version of the file equivalent to mtime=1. The data it is reading 
> may not come from the end of the file, or it may be at the end of the file 
> but the footer has a different length. (There is no validation on the new 
> read to check the magic value or metadata size reported by the new buffer.) 
> Either would result in a failure to deserialize the thrift for the footer. 
> For example, a problem case could produce an error message like:
>  
> {noformat}
> File hdfs://test-warehouse/example_file.parq of length 1048576 bytes has 
> invalid file metadata at file offset 462017. Error = couldn't deserialize 
> thrift msg:
> TProtocolException: Invalid data
> .{noformat}
> To fix this, we should examine all locations that can result in ScanRanges 
> with mtime=-1 and eliminate any that we can. For example, the 
> HdfsParquetScanner::ProcessFooter() code should create a ScanRange that 
> inherits the mtime from the original footer ScanRange. Also, the file handle 
> cache should refuse to cache file handles with mtime=-1.
> The code in HdfsParquetScanner::ProcessFooter() should add validation for the 
> magic value and metadata size when reading a footer larger than 100KB to 
> verify that we are reading something valid. The thrift deserialize failure 
> gives some information, but catching this case more specifically would 
> provide a better error message.
> h2. Workarounds
> * This is most often caused by overwriting files in-place (e.g. INSERT 
> OVERWRITE from Hive) without refreshing the metadata. You can avoid the issue 
> by avoiding these in-place rewrites or by consistently running REFRESH <tbl> 
> in Impala after the modifications.
> * Setting --max_cached_file_handles=0 in the impalad startup options can work 
> around the issue, at the cost of performance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to