[ 
https://issues.apache.org/jira/browse/COMPRESS-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459610#comment-16459610
 ] 

Stefan Bodewig commented on COMPRESS-450:
-----------------------------------------

[~tijmen] could you have a look at the last two commits in the COMPRESS-450 
branch - [https://github.com/apache/commons-compress/commits/COMPRESS-450] and 
verify this would work for you? Also, could you provide a test case for the 
desired behavior, if not I'll have to corrupt a test archive myself :)

> Enable skipping past invalid tar header entries
> -----------------------------------------------
>
>                 Key: COMPRESS-450
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-450
>             Project: Commons Compress
>          Issue Type: Improvement
>          Components: Archivers
>    Affects Versions: 1.16.1
>            Reporter: Tijmen R
>            Priority: Minor
>              Labels: newbie
>         Attachments: TarArchiveInputStream.java
>
>
> In TarArchiveInputStream::getNextTarEntry(), if reading an parsing the header 
> fails, an IOException is thrown. State (e.g. currEntry) is not cleared, and 
> trying to get any further entries/data from the archive is thus not possible.
> In our use case, we sometimes encounter corrupt tar archives where the data 
> following a header (that specifies a non-zero data size) is completely or 
> partly missing; for example as for hdr_b in the stream:
>  
> {noformat}
> ...[hdr_a][data_a1]...[data_an][hdr_b][hdr_c][data_c1][data_c2]...[data_cn]...{noformat}
>  
> We have no influence on how these archives are created, so cannot fix it on 
> that side. However, it would be nice to be able to at least pick up reading 
> the tar file at the next valid header it finds, so at least most of the data 
> can be retrieved. In other words, similar to the behaviour of gnu tar:
>  * If reading/parsing the header fails, and no header was read successfully 
> before, or the previous header read attempt failed as well, then fail 
> completely
>  * Otherwise if reading/parsing the header fails, throw an error. A next call 
> to getNextTarEntry will read blocks until it finds one that has a valid 
> header checksum, and try to parse that as a header.
> The attached version of TarArchiveInputStream does this.
> Some issues with this approach:
>  * In the example stream given above, the hdr_c and subsequent blocks 
> (depending on the data size specified in hdr_b) will already have been 
> returned/read as data for b. However, that is also the case in the current 
> version of TarArchiveInputStream.
>  * So, (at least) file c is lost, and the next entry to be picked up will 
> likely be hdr_d (or even later). Data blocks that look like a tar header at 
> first sight but actually (in the current context) aren't, might be 
> misinterpreted to be headers (this can occur for example with a tar archive 
> stored inside a main tar archive).
>  * Currently, the code just throws an IOException with a different error 
> message, as I didn't want to change the behaviour too much. But it would be a 
> lot better to have a different exception (child of IOException) for a "header 
> parse" error, to distinguish it from a general IO exception reading the 
> underlying stream.
>  * I'm not too sure about what to do in case of a "fatal" error (skip to the 
> end of file?)
> Still, the above has been useful for us, and maybe this benefits others as 
> well.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to