[ https://issues.apache.org/jira/browse/COMPRESS-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459610#comment-16459610 ]
Stefan Bodewig commented on COMPRESS-450: ----------------------------------------- [~tijmen] could you have a look at the last two commits in the COMPRESS-450 branch - [https://github.com/apache/commons-compress/commits/COMPRESS-450] and verify this would work for you? Also, could you provide a test case for the desired behavior, if not I'll have to corrupt a test archive myself :) > Enable skipping past invalid tar header entries > ----------------------------------------------- > > Key: COMPRESS-450 > URL: https://issues.apache.org/jira/browse/COMPRESS-450 > Project: Commons Compress > Issue Type: Improvement > Components: Archivers > Affects Versions: 1.16.1 > Reporter: Tijmen R > Priority: Minor > Labels: newbie > Attachments: TarArchiveInputStream.java > > > In TarArchiveInputStream::getNextTarEntry(), if reading an parsing the header > fails, an IOException is thrown. State (e.g. currEntry) is not cleared, and > trying to get any further entries/data from the archive is thus not possible. > In our use case, we sometimes encounter corrupt tar archives where the data > following a header (that specifies a non-zero data size) is completely or > partly missing; for example as for hdr_b in the stream: > > {noformat} > ...[hdr_a][data_a1]...[data_an][hdr_b][hdr_c][data_c1][data_c2]...[data_cn]...{noformat} > > We have no influence on how these archives are created, so cannot fix it on > that side. However, it would be nice to be able to at least pick up reading > the tar file at the next valid header it finds, so at least most of the data > can be retrieved. In other words, similar to the behaviour of gnu tar: > * If reading/parsing the header fails, and no header was read successfully > before, or the previous header read attempt failed as well, then fail > completely > * Otherwise if reading/parsing the header fails, throw an error. A next call > to getNextTarEntry will read blocks until it finds one that has a valid > header checksum, and try to parse that as a header. > The attached version of TarArchiveInputStream does this. > Some issues with this approach: > * In the example stream given above, the hdr_c and subsequent blocks > (depending on the data size specified in hdr_b) will already have been > returned/read as data for b. However, that is also the case in the current > version of TarArchiveInputStream. > * So, (at least) file c is lost, and the next entry to be picked up will > likely be hdr_d (or even later). Data blocks that look like a tar header at > first sight but actually (in the current context) aren't, might be > misinterpreted to be headers (this can occur for example with a tar archive > stored inside a main tar archive). > * Currently, the code just throws an IOException with a different error > message, as I didn't want to change the behaviour too much. But it would be a > lot better to have a different exception (child of IOException) for a "header > parse" error, to distinguish it from a general IO exception reading the > underlying stream. > * I'm not too sure about what to do in case of a "fatal" error (skip to the > end of file?) > Still, the above has been useful for us, and maybe this benefits others as > well. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)