[jira] [Commented] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging

Ryan Blue (JIRA) Wed, 16 Dec 2015 12:08:03 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060737#comment-15060737
 ]


Ryan Blue commented on PARQUET-408:
-----------------------------------

[~tu...@avast.com], thank you for the very detailed report of what you're 
seeing. Bug reports like this are much easier to work with.

I'm not sure I'm convinced that this is a bug with Parquet. If I understand 
correctly, you're using a Java 7 try-with-resources block to instantiate the 
parquet-avro writer, which is calls close on that writer when it exits. That 
happens only when the thread is signaled via AtomicBoolean finished. Finished 
is set in your shutdown hook, so your when hook runs, it gracefully shuts down 
the writing thread, which closes the file. But the error you're seeing is from 
closing the Parquet file after the underlying file stream has been closed. 
Parquet doesn't have a shutdown hook that causes that, I think [HDFS 
does|https://github.com/apache/hadoop/blob/cbc7b6bf97a80c39d4bbb3005e42dacae6726baf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L2873].

I think the problem is that your hook gets called to shut down after the HDFS 
client closes all of its connections. That's solved by your work-around, which 
I think isn't a work around so much as the only way to do what you want to with 
HDFS. :)

Does that sound reasonable? I think the logging issues are unrelated, too.

> Shutdown hook in parquet-avro library corrupts data and disables logging
> ------------------------------------------------------------------------
>
>                 Key: PARQUET-408
>                 URL: https://issues.apache.org/jira/browse/PARQUET-408
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-avro
>    Affects Versions: 1.8.1
>            Reporter: Michal Turek
>             Fix For: 1.9.0
>
>         Attachments: parquet-broken-shutdown_2015-12-16.tar.gz
>
>
> Parquet-avro and probably also other Parquet libraries are not well behaved. 
> It registers a shutdown hook that bypasses application shutdown sequence, 
> corrupts data written to currently opened Parquet file(s) and disables or 
> reconfigures slf4j/logback logger so no further log message is visible.
> h3. Scope
> Our application is a microservice that handles stop request in form of signal 
> SIGTERM, resp. JVM shutdown hook. If it arrives the application will close 
> all opened files (writers), release all other resources and gracefully 
> shutdown. We are swiching from sequence files to Parquet at the moment and 
> using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is 
> current latest version. We are using 
> {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM.
> h3. Example code
> See archive in attachment.
> - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your 
> Hadoop.
> - Use {{mvn package}} to compile.
> - Copy Hadoop configuration XMLs to {{config}} directory.
> - Update configuration at the top of {{ParquetBrokenShutdown}} class.
> - Execute {{ParquetBrokenShutdown}} class.
> - Send SIGTERM to shutdown the application ({{kill PID}}).
> h3. Initial analysis
> Parquet library tries to care about application shutdown but this introduces 
> more issues than solves. If application is writing to a file and the library 
> asynchronously decides to close underlying writer, data loss will occur. The 
> handle is just closed and all remaining records can't be written.
> {noformat}
> Writing to HDFS/Parquet failed
> java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, 
> uncompressed_page_size:14, compressed_page_size:34, 
> dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN))
>       at org.apache.parquet.format.Util.write(Util.java:224)
>       at org.apache.parquet.format.Util.writePageHeader(Util.java:61)
>       at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760)
>       at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307)
>       at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179)
>       at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165)
>       at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>       at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297)
>       at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86)
>       at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53)
>       at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: parquet.org.apache.thrift.transport.TTransportException: 
> java.nio.channels.ClosedChannelException
>       at 
> parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
>       at 
> parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105)
>       at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424)
>       at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431)
>       at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:194)
>       at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:176)
>       at 
> org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
>       at org.apache.parquet.format.PageHeader.write(PageHeader.java:918)
>       at org.apache.parquet.format.Util.write(Util.java:222)
>       ... 16 more
> Caused by: java.nio.channels.ClosedChannelException
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1635)
>       at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
>       at java.io.DataOutputStream.write(DataOutputStream.java:107)
>       at 
> parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
>       ... 24 more
> {noformat}
> Stored file is also currupted in such case. It contains only four bytes 
> "PAR1" header so actually all data are lost.
> Together with that whole application logging using slf4j/logback is 
> reconfigured or completely disabled so no more application's log messages 
> appear in the output. This may be related to PARQUET-369 and will be 
> hopefully fixed in PARQUET-401. No such issue was observed when sequence 
> files were used in the past.
> h3. Known workaround
> Java executes threads of shutdown hooks independently and in parallel, but it 
> is possible for the application to force its own shutdown to be executed 
> before the Parquet's one. Parquet library uses Hadoop's 
> {{ShutdownHookManager}} to register the hook which has a possibility to 
> define a priority.
> {noformat}
> if (!USE_WORKAROUND) {
>       Runtime.getRuntime().addShutdownHook(shutdownHook);
> } else {
>       int maxPriority = Integer.MAX_VALUE; // Execute application hook before 
> hook in Hadoop library
>       ShutdownHookManager.get().addShutdownHook(shutdownHook, maxPriority);
> }
> {noformat}
> This workaround would be impossible if there are two or more such nasty 
> libraries.
> No workaround has been found for the second issue with disabled logging until 
> now.
> h3. Suggestions for fix
> - Never touch logging configuration in libraries.
> -- Use slf4j-api interfaces and just output log messages.
> -- Application is responsible for configuration of logging.
> - Do not register any shutdown hook in libraries that releases resources that 
> are still possibly in use.
> -- Instead provide an API to the application to close/free the allocated 
> resources (already present).
> -- The application is responsible to call close on all Closeables during 
> shutdown in correct order and correct time.
> -- No library has enough information for that.
> - Be well behaved library, do not try to be smarter than the application, it 
> isn't possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging

Reply via email to