Re: [compress] XZ support and inconsistencies in the existing compressors

Stefan Bodewig Thu, 04 Aug 2011 08:41:55 -0700

On 2011-08-04, Lasse Collin wrote:

> On 2011-08-04 Stefan Bodewig wrote:


>> This is in a big part due to the history of Commons Compress which
>> combined several different codebases with separate APIs and provided a
>> first attempt to layer a unifying API on top of it.  We are aware of
>> quite a few problems and want to address them in Commons Compress 2.x
>> and it would be really great if you would participate in the design of
>> the new APIs once that discussion kicks off.

> I'm not sure how much I can help, but I can try (depending on how much
> I have time).

Thanks.

>> On 2011-08-03, Lasse Collin wrote:

>>> (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
>>>     doesn't flush data buffered by BZip2CompressorOutputStream.
>>>     Thus not all data written to the Bzip2 stream will be available
>>>     in the underlying output stream after flushing. This kind of
>>>     flush() implementation doesn't seem very useful.

>> Agreed, do you want to open a JIRA issue for this?

> There is already this:

>     https://issues.apache.org/jira/browse/COMPRESS-42

Ahh, I knew I once fiddled with flush there but a quick grep through the
changes file didn't show anything - because it was before the 1.0
release.

> I tried to understand how flushing could be done properly. I'm not
> really familiar with bzip2 so the following might have errors.

As I already said, neither of us is terribly familiar with the format
right now.  I for one didn't even know you could have multiple streams
in a single file so it took your mail for me to make sense out of
COMPRESS-146.

> I checked libbzip2 and how it's BZ_FLUSH works. It finishes the block,
> but it doesn't flush the last bits, and thus the complete block isn't
> available in the output stream. The blocks in the .bz2 format aren't
> aligned to full bytes, and there is no padding between blocks.

> The lack of alignment makes flushing tricky. One may need to write out
> up to seven bits of data from the future. The bright side is that those
> future bits can only come from the block header magic or from the end
> of stream magic. Both are constants so there are only two possibilities
> what those seven bits can be.

> Using bits from the end of stream magic doesn't make sense, because then
> one would be forced to finish the stream. Using the bits from the
> block header magic means that one must add at least one more block.
> This is fine if the application will want to encode at least one more
> byte. If the application calls close() right after flushing, then
> there's a problem unless .bz2 format allows empty blocks. I get a
> feeling from the code that .bz2 would support empty blocks, but I'm not
> sure at all.

It should be possible to write some unit tests to see what works and to
create some test archives for interop testing with native tools.

>>> (4) The decompressor streams don't support concatenated .gz and .bz2
>>>     files. This can be OK when compressed data is used inside
>>>     another file format or protocol, but with regular
>>>     (standalone) .gz and .bz2 files it is bad to stop after the
>>>     first compressed stream and silently ignore the remaining
>>>     compressed data.

>>>     Fixing this in BZip2CompressorInputStream should be relatively
>>>     easy because it stops right after the last byte of the
>>>     compressed stream.

>> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

> Yes. I didn't check the suggested fix though.

Would be nice if you'd find the time to do so.

>>>     Fixing GzipCompressorInputStream is harder because the problem
>>>     is inherited from java.util.zip.GZIPInputStream which reads
>>>     input past the end of the first stream. One might need to
>>>     reimplement .gz container support on top of
>>>     java.util.zip.InflaterInputStream or java.util.zip.Inflater.

>> Sounds doable but would need somebody to code it, I guess ;-)

> There is a little bit hackish solution in the comments of the following
> bug report, but it lacks license information:

>     http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

Yes.  I agree it is hacky.

>> In the past we have incorporated external codebases (ar and cpio) that
>> used to be under compatible licenses to make things simpler for our
>> users, but if you prefer to develop your code base outside of Commons
>> Compress then I can fully understand that.

> I will develop it in my own tree, but it's possible to include a copy
> in Commons Compress with modified "package" and "import" lines in the
> source files. Changes in my tree would need to be copied to Commons
> Compress now and then. I don't know if this is better than having an
> external dependency.

Don't know either.  It depends on who'd do the work of syncing, I guess.

> org.tukaani.xz will include features that aren't necessarily interesting
> in Commons Compress, for example, advanced compression options and
> random access reading. Most developers probably won't care about these.

We'll need standalone compressors for other formats as well (and we do
need LZMA 8-).  Some of the options your code provides might be
interesting for the ZIP package as well when we want to implement some
of the other supported methods.

>> From the dependency management POV I know many
>> developers prefer dependencies that are available from a Maven
>> repository, is this the case for the org.tukaani.xz package (I'm too
>> lazy to check).

> There is only build.xml for Ant.

If you need help with publishing your package to a Maven repository -
some of your users will ask for it sooner or later - I know where to
find people who can help.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [compress] XZ support and inconsistencies in the existing compressors

Reply via email to