[jira] [Commented] (HADOOP-16109) Parquet reading S3AFileSystem causes EOF

Steve Loughran (JIRA) Thu, 28 Feb 2019 08:13:15 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-16109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780686#comment-16780686
 ]


Steve Loughran commented on HADOOP-16109:
-----------------------------------------

yeah, I can replicate this in a test
{code:java}
[ERROR] 
testReadPastReadahead(org.apache.hadoop.fs.contract.s3a.ITestS3AContractRandomSeek)
  Time elapsed: 3.061 s  <<< ERROR!
java.io.EOFException: End of file reached before reading fully.
        at 
org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:707)
        at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:121)
        at 
org.apache.hadoop.fs.contract.s3a.ITestS3AContractSeek.testReadPastReadahead(ITestS3AContractSeek.java:79)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:745)

[INFO] 

 {code}

> Parquet reading S3AFileSystem causes EOF
> ----------------------------------------
>
>                 Key: HADOOP-16109
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16109
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.1.0
>            Reporter: Dave Christianson
>            Assignee: Steve Loughran
>            Priority: Blocker
>
> When using S3AFileSystem to read Parquet files a specific set of 
> circumstances causes an  EOFException that is not thrown when reading the 
> same file from local disk
> Note this has only been observed under specific circumstances:
>   - when the reader is doing a projection (will cause it to do a seek 
> backwards and put the filesystem into random mode)
>  - when the file is larger than the readahead buffer size
>  - when the seek behavior of the Parquet reader causes the reader to seek 
> towards the end of the current input stream without reopening, such that the 
> next read on the currently open stream will read past the end of the 
> currently open stream.
> Exception from Parquet reader is as follows:
> {code}
> Caused by: java.io.EOFException: Reached the end of stream with 51 bytes left 
> to read
>  at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>  at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>  at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>  at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>  at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>  at 
> org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.fetchNext(HadoopInputFormatBase.java:206)
>  at 
> org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.reachedEnd(HadoopInputFormatBase.java:199)
>  at 
> org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> The following example program generate the same root behavior (sans finding a 
> Parquet file that happens to trigger this condition) by purposely reading 
> past the already active readahead range on any file >= 1029 bytes in size.. 
> {code:java}
> final Configuration conf = new Configuration();
> conf.set("fs.s3a.readahead.range", "1K");
> conf.set("fs.s3a.experimental.input.fadvise", "random");
> final FileSystem fs = FileSystem.get(path.toUri(), conf);
> // forward seek reading across readahead boundary
> try (FSDataInputStream in = fs.open(path)) {
>     final byte[] temp = new byte[5];
>     in.readByte();
>     in.readFully(1023, temp); // <-- works
> }
> // forward seek reading from end of readahead boundary
> try (FSDataInputStream in = fs.open(path)) {
>  final byte[] temp = new byte[5];
>  in.readByte();
>  in.readFully(1024, temp); // <-- throws EOFException
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-16109) Parquet reading S3AFileSystem causes EOF

Reply via email to