Re: [compress] How to move forward with XZ support

2011-08-12 Thread Lasse Collin
On 2011-08-12 Stefan Bodewig wrote:
> the GNU core utils come with an xz command

Minor correction: GNU coreutils doesn't include compression tools.
GNU gzip is its own package and so are bzip2 (bzip.org) and xz
(tukaani.org).

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [compress] XZ support and inconsistencies in the existing compressors

2011-08-04 Thread Lasse Collin
On 2011-08-04 Stefan Bodewig wrote:
> On 2011-08-04, Lasse Collin wrote:
> > Using bits from the end of stream magic doesn't make sense, because
> > then one would be forced to finish the stream. Using the bits from
> > the block header magic means that one must add at least one more
> > block. This is fine if the application will want to encode at least
> > one more byte. If the application calls close() right after
> > flushing, then there's a problem unless .bz2 format allows empty
> > blocks. I get a feeling from the code that .bz2 would support empty
> > blocks, but I'm not sure at all.
> 
> It should be possible to write some unit tests to see what works and
> to create some test archives for interop testing with native tools.

Maybe, if it is possible to even create such files.

Making flush() equivalent to finish() (except that one can continue
after flush()) with bzip2 sounds much lazier and safer, even if it can
create its own problems too.

> >>> (4) The decompressor streams don't support concatenated .gz
> >>> and .bz2 files. This can be OK when compressed data is used inside
> >>> another file format or protocol, but with regular
> >>> (standalone) .gz and .bz2 files it is bad to stop after the
> >>> first compressed stream and silently ignore the remaining
> >>> compressed data.
> 
> >>> Fixing this in BZip2CompressorInputStream should be relatively
> >>> easy because it stops right after the last byte of the
> >>> compressed stream.
> 
> >> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?
> 
> > Yes. I didn't check the suggested fix though.
> 
> Would be nice if you'd find the time to do so.

It uses in.available() == 0. It duplicates the test for "BZh" magic
bytes and a little more from init() into complete(). I think this bug
can be fixed in a nicer way.

Is there a need to have a bzip2 decompressor that does stop after the
first stream (like the current code does)? Maybe .zip needs it?

> We'll need standalone compressors for other formats as well (and we do
> need LZMA 8-).  Some of the options your code provides might be
> interesting for the ZIP package as well when we want to implement some
> of the other supported methods.

The .lzma format is legacy. While it may have some uses, people should
usually move to .xz and LZMA2.

The .zip format has LZMA marked as "Early Feature Specification". Minor
details are a little bit weird. For example, it requires storing the
LZMA SDK version that was used for compression (what if you don't use
unmodified LZMA SDK).

What else needs LZMA? Do you plan .7z support?

> If you need help with publishing your package to a Maven repository -
> some of your users will ask for it sooner or later - I know where to
> find people who can help.

Thanks.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [compress] ZIP64: API imposed limits vs limits of the format

2011-08-04 Thread Lasse Collin
On 2011-08-04 Stefan Bodewig wrote:
> There are a few places where our implementation doesn't allow for the
> full range the ZIP format would support.  Some are easy to fix, some
> hard and I'm asking for feedback whether you consider it worth the
> effort to fix them at all.

I guess that these are enough for the foreseeable future:

Max archive size: Long.MAX_VALUE
Max size of individual entry: Long.MAX_VALUE
Max number of file entries:   Integer.MAX_VALUE

Java APIs don't suppport bigger files and I guess that so big files
won't be common even if file system sizes allowed them. If you write
ten terabytes per second, it will still take well over a week to
create an archive of 2^63-1 bytes.

I don't know how much memory one file entry needs, but let's assume
it takes only 50 bytes, including the overhead of the linked list
etc. Keeping a list of 2^31-1 files will then need 100 GiB of RAM.
While it might be OK in some situations, I hope such archives won't
become common. ;-) Even if the number of files is limited to
Integer.MAX_VALUE, it can be good to think about the memory usage
of the data structures used for the file entries.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [compress] XZ support and inconsistencies in the existing compressors

2011-08-04 Thread Lasse Collin
On 2011-08-04 Stefan Bodewig wrote:
> On 2011-08-03, Lasse Collin wrote:
> > I looked at the APIs and code in Commons Compress to see how XZ
> > support could be added. I was especially looking for details where
> > one would need to be careful to make different compressors behave
> > consistently compared to each other.
> 
> This is in a big part due to the history of Commons Compress which
> combined several different codebases with separate APIs and provided a
> first attempt to layer a unifying API on top of it.  We are aware of
> quite a few problems and want to address them in Commons Compress 2.x
> and it would be really great if you would participate in the design of
> the new APIs once that discussion kicks off.

I'm not sure how much I can help, but I can try (depending on how much
I have time).

> > (2) BZip2CompressorOutputStream.flush() calls out.flush() but it
> > doesn't flush data buffered by BZip2CompressorOutputStream.
> > Thus not all data written to the Bzip2 stream will be available
> > in the underlying output stream after flushing. This kind of
> > flush() implementation doesn't seem very useful.
> 
> Agreed, do you want to open a JIRA issue for this?

There is already this:

https://issues.apache.org/jira/browse/COMPRESS-42

I tried to understand how flushing could be done properly. I'm not
really familiar with bzip2 so the following might have errors.

I checked libbzip2 and how it's BZ_FLUSH works. It finishes the block,
but it doesn't flush the last bits, and thus the complete block isn't
available in the output stream. The blocks in the .bz2 format aren't
aligned to full bytes, and there is no padding between blocks.

The lack of alignment makes flushing tricky. One may need to write out
up to seven bits of data from the future. The bright side is that those
future bits can only come from the block header magic or from the end
of stream magic. Both are constants so there are only two possibilities
what those seven bits can be.

Using bits from the end of stream magic doesn't make sense, because then
one would be forced to finish the stream. Using the bits from the
block header magic means that one must add at least one more block.
This is fine if the application will want to encode at least one more
byte. If the application calls close() right after flushing, then
there's a problem unless .bz2 format allows empty blocks. I get a
feeling from the code that .bz2 would support empty blocks, but I'm not
sure at all.

Since bzip2 works on blocks that are compressed independently from each
other, the compression ratio doesn't get a big penalty if the stream is
finished and then a new stream is started. This would make it much
simpler to implement flushing. The downside is that implementations,
that don't support decoding concatenated .bz2 files, will stop after
the first stream.

> > (4) The decompressor streams don't support concatenated .gz and .bz2
> > files. This can be OK when compressed data is used inside
> > another file format or protocol, but with regular
> > (standalone) .gz and .bz2 files it is bad to stop after the
> > first compressed stream and silently ignore the remaining
> > compressed data.
> 
> > Fixing this in BZip2CompressorInputStream should be relatively
> > easy because it stops right after the last byte of the
> > compressed stream.
> 
> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>?

Yes. I didn't check the suggested fix though.

> > Fixing GzipCompressorInputStream is harder because the problem
> > is inherited from java.util.zip.GZIPInputStream which reads
> > input past the end of the first stream. One might need to
> > reimplement .gz container support on top of
> > java.util.zip.InflaterInputStream or java.util.zip.Inflater.
> 
> Sounds doable but would need somebody to code it, I guess ;-)

There is a little bit hackish solution in the comments of the following
bug report, but it lacks license information:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

> In the past we have incorporated external codebases (ar and cpio) that
> used to be under compatible licenses to make things simpler for our
> users, but if you prefer to develop your code base outside of Commons
> Compress then I can fully understand that.

I will develop it in my own tree, but it's possible to include a copy
in Commons Compress with modified "package" and "import" lines in the
source files. Changes in my tree would need to be copied to Commons
Compress now and then. I don't know if this is better than having an
external dependency.

org.tukaani.xz will include features that aren&

[compress] XZ support and inconsistencies in the existing compressors

2011-08-03 Thread Lasse Collin
Hi!

I have been working on XZ data compression implementation in Java
<http://tukaani.org/xz/java.html>. I was told that it could be nice
to get XZ support into Commons Compress.

I looked at the APIs and code in Commons Compress to see how XZ
support could be added. I was especially looking for details where
one would need to be careful to make different compressors behave
consistently compared to each other. I found a few possible problems
in the existing code:

(1) CompressorOutputStream should have finish(). Now
BZip2CompressorOutputStream has finish() but
GzipCompressorOutputStream doesn't. This should be easy to
fix because java.util.zip.GZIPOutputStream supports finish().

(2) BZip2CompressorOutputStream.flush() calls out.flush() but it
doesn't flush data buffered by BZip2CompressorOutputStream.
Thus not all data written to the Bzip2 stream will be available
in the underlying output stream after flushing. This kind of
flush() implementation doesn't seem very useful.

GzipCompressorOutputStream.flush() is the default version
from InputStream and thus does nothing. Adding flush()
into GzipCompressorOutputStream is hard because
java.util.zip.GZIPOutputStream and java.util.zip.Deflater don't
support sync flushing before Java 7. To get Gzip flushing in
older Java versions one might need a complete reimplementation
of the Deflate algorithm which isn't necessarily practical.

(3) BZip2CompressorOutputStream has finalize() that finishes a stream
that hasn't been explicitly finished or closed. This doesn't seem
useful. GzipCompressorOutputStream doesn't have an equivalent
finalize().

(4) The decompressor streams don't support concatenated .gz and .bz2
files. This can be OK when compressed data is used inside another
file format or protocol, but with regular (standalone) .gz and
.bz2 files it is bad to stop after the first compressed stream
and silently ignore the remaining compressed data.

Fixing this in BZip2CompressorInputStream should be relatively
easy because it stops right after the last byte of the compressed
stream. Fixing GzipCompressorInputStream is harder because the
problem is inherited from java.util.zip.GZIPInputStream
which reads input past the end of the first stream. One
might need to reimplement .gz container support on top of
java.util.zip.InflaterInputStream or java.util.zip.Inflater.

The XZ compressor supports finish() and flush(). The XZ decompressor
supports concatenated .xz files, but there is also a single-stream
version that behaves similarly to the current version of
BZip2CompressorInputStream.

Assuming that there will be some interest in adding XZ support into
Commons Compress, is it OK make Commons Compress depend on the XZ
package org.tukaani.xz, or should the XZ code be modified so that
it could be included as an internal part in Commons Compress? I
would prefer depending on org.tukaani.xz because then there is
just one code base to keep up to date.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org