[jira] [Work logged] (COMPRESS-540) Random access on Tar archive

ASF GitHub Bot (Jira) Wed, 18 Nov 2020 08:37:39 -0800


     [ 
https://issues.apache.org/jira/browse/COMPRESS-540?focusedWorklogId=513621&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-513621
 ]


ASF GitHub Bot logged work on COMPRESS-540:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Nov/20 16:36
            Start Date: 18/Nov/20 16:36
    Worklog Time Spent: 10m 
      Work Description: theobisproject commented on pull request #113:
URL: https://github.com/apache/commons-compress/pull/113#issuecomment-729799712


   @kinow Thank you for your feedback!
   
   
   > > In the SparseFilesTest extracting of sparsefile-0.1 in 
pax_gnu_sparse.tar is skipped because of problems with tar on Ubuntu 16.04. I 
tested this on Ubuntu 18.04 and everything worked fine. Should we reenable 
extracting of the file?
   > 
   > +1, if that works with newer versions. Would be good to leave a 
NOTE/comment somewhere though, in case the test fails again. Even better if we 
were able to pinpoint what broke it in Ubuntu 16.04 (but can be done later 👍 )
   > 
   > To keep this PR simpler, we can leave as-is, and create a new JIRA to 
follow-up on this.
   
   > Ah. I missed this in my last reviews. It's weird. :)
   > I tested it on Ubuntu 18.04 and it do works now. That test was failed on 
Ubuntu 16.04 but unfortunly I don't have a Ubuntu 16.04 now. Agree with @kinow 
that we can leave it as-is and have it in a separate PR.
   
   
   I can create a new JIRA and fix this on master next weekend if nothing get's 
in my way.
   
   
   
   
   > @theobisproject , @PeterAlfredLee , my only other comment is about the 
name `TarFile`. Is there any risk in using this class name, with a feature that 
is not part of the tar format (I don't know the format well enough, but I think 
you mentioned it in your e-mail to the mailing list Lee?)?
   > 
   > I mean, could it be that in the future we may need to create a `TarFile`, 
that doesn't/cannot support the random read, and in which case we would have to 
find another name?
   
   I don't think there is any risk even that the random access is not supported 
by the format but emulated by the implementation. You still can use the 
`TarFile` class to read the content sequentially since the order of the entries 
is preserved. Also this naming would be consistent with the already exisiting 
`ZipFile` implementation. 
   From a performance point of view the overhead needed to allow the random 
access is as small as possible since only the headers of the entries are read 
(excluding some special cases with e.g. long filenames where the information is 
stored in a different entry) and all file data is jumped over by setting the 
position in the file to the start of the next header. My testing showed the 
`TarFile` and `TarArchiveStream` performance is equivalent for sequential 
access. The performance gain for accessing a single entry in the `TarFile` is 
huge because the `TarArchiveStream` needs to read and skip all data until it 
reaches the entry.  
   If you would like some numbers for some cases feel free to name them. I can 
create some benchmarks and share them with you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 513621)
    Time Spent: 4h  (was: 3h 50m)

> Random access on Tar archive
> ----------------------------
>
>                 Key: COMPRESS-540
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-540
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Robin Schimpf
>            Priority: Major
>          Time Spent: 4h
>  Remaining Estimate: 0h
>
> The TarArchiveInputStream only provides sequential access. If only a small 
> amount of files from the archive is needed large amount of data in the input 
> stream needs to be skipped.
> Therefore I was working on a implementation to provide random access to 
> TarFiles equal to the ZipFile api. The basic idea behind the implementation 
> is the following
>  * Random access is backed by a SeekableByteChannel
>  * Read all headers of the tar file and save the place to the data of every 
> header
>  * User can request an input stream for any entry in the archive multiple 
> times



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (COMPRESS-540) Random access on Tar archive

Reply via email to