On 2011-08-04, Lasse Collin wrote: > On 2011-08-04 Stefan Bodewig wrote:
>> This is in a big part due to the history of Commons Compress which >> combined several different codebases with separate APIs and provided a >> first attempt to layer a unifying API on top of it. We are aware of >> quite a few problems and want to address them in Commons Compress 2.x >> and it would be really great if you would participate in the design of >> the new APIs once that discussion kicks off. > I'm not sure how much I can help, but I can try (depending on how much > I have time). Thanks. >> On 2011-08-03, Lasse Collin wrote: >>> (2) BZip2CompressorOutputStream.flush() calls out.flush() but it >>> doesn't flush data buffered by BZip2CompressorOutputStream. >>> Thus not all data written to the Bzip2 stream will be available >>> in the underlying output stream after flushing. This kind of >>> flush() implementation doesn't seem very useful. >> Agreed, do you want to open a JIRA issue for this? > There is already this: > https://issues.apache.org/jira/browse/COMPRESS-42 Ahh, I knew I once fiddled with flush there but a quick grep through the changes file didn't show anything - because it was before the 1.0 release. > I tried to understand how flushing could be done properly. I'm not > really familiar with bzip2 so the following might have errors. As I already said, neither of us is terribly familiar with the format right now. I for one didn't even know you could have multiple streams in a single file so it took your mail for me to make sense out of COMPRESS-146. > I checked libbzip2 and how it's BZ_FLUSH works. It finishes the block, > but it doesn't flush the last bits, and thus the complete block isn't > available in the output stream. The blocks in the .bz2 format aren't > aligned to full bytes, and there is no padding between blocks. > The lack of alignment makes flushing tricky. One may need to write out > up to seven bits of data from the future. The bright side is that those > future bits can only come from the block header magic or from the end > of stream magic. Both are constants so there are only two possibilities > what those seven bits can be. > Using bits from the end of stream magic doesn't make sense, because then > one would be forced to finish the stream. Using the bits from the > block header magic means that one must add at least one more block. > This is fine if the application will want to encode at least one more > byte. If the application calls close() right after flushing, then > there's a problem unless .bz2 format allows empty blocks. I get a > feeling from the code that .bz2 would support empty blocks, but I'm not > sure at all. It should be possible to write some unit tests to see what works and to create some test archives for interop testing with native tools. >>> (4) The decompressor streams don't support concatenated .gz and .bz2 >>> files. This can be OK when compressed data is used inside >>> another file format or protocol, but with regular >>> (standalone) .gz and .bz2 files it is bad to stop after the >>> first compressed stream and silently ignore the remaining >>> compressed data. >>> Fixing this in BZip2CompressorInputStream should be relatively >>> easy because it stops right after the last byte of the >>> compressed stream. >> Is this <https://issues.apache.org/jira/browse/COMPRESS-146>? > Yes. I didn't check the suggested fix though. Would be nice if you'd find the time to do so. >>> Fixing GzipCompressorInputStream is harder because the problem >>> is inherited from java.util.zip.GZIPInputStream which reads >>> input past the end of the first stream. One might need to >>> reimplement .gz container support on top of >>> java.util.zip.InflaterInputStream or java.util.zip.Inflater. >> Sounds doable but would need somebody to code it, I guess ;-) > There is a little bit hackish solution in the comments of the following > bug report, but it lacks license information: > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425 Yes. I agree it is hacky. >> In the past we have incorporated external codebases (ar and cpio) that >> used to be under compatible licenses to make things simpler for our >> users, but if you prefer to develop your code base outside of Commons >> Compress then I can fully understand that. > I will develop it in my own tree, but it's possible to include a copy > in Commons Compress with modified "package" and "import" lines in the > source files. Changes in my tree would need to be copied to Commons > Compress now and then. I don't know if this is better than having an > external dependency. Don't know either. It depends on who'd do the work of syncing, I guess. > org.tukaani.xz will include features that aren't necessarily interesting > in Commons Compress, for example, advanced compression options and > random access reading. Most developers probably won't care about these. We'll need standalone compressors for other formats as well (and we do need LZMA 8-). Some of the options your code provides might be interesting for the ZIP package as well when we want to implement some of the other supported methods. >> From the dependency management POV I know many >> developers prefer dependencies that are available from a Maven >> repository, is this the case for the org.tukaani.xz package (I'm too >> lazy to check). > There is only build.xml for Ant. If you need help with publishing your package to a Maven repository - some of your users will ask for it sooner or later - I know where to find people who can help. Stefan --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org