[ https://issues.apache.org/jira/browse/COMPRESS-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536631#comment-17536631 ]
Michael Osipov edited comment on COMPRESS-620 at 5/13/22 1:01 PM: ------------------------------------------------------------------ Though, I am not a Commons Compress developer, it is a bug in Commons Compress for me. Let's analyze: The offending entry: {noformat} 10E92 00004 50 4B 01 02 CENTRAL HEADER #10 02014B50 10E96 00001 0B Created Zip Spec 0B '1.1' 10E97 00001 00 Created OS 00 'MS-DOS' 10E98 00001 0A Extract Zip Spec 0A '1.0' 10E99 00001 00 Extract OS 00 'MS-DOS' 10E9A 00002 00 00 General Purpose Flag 0000 [Bit 1] 0 '4k Sliding Dictionary' [Bit 2] 0 '2 Shannon-Fano Trees' 10E9C 00002 06 00 Compression Method 0006 'Imploded' 10E9E 00004 EE 40 79 19 Last Mod Time 197940EE 'Wed Nov 25 08:07:28 1992' 10EA2 00004 47 B9 D7 53 CRC 53D7B947 10EA6 00004 BE 08 00 00 Compressed Length 000008BE 10EAA 00004 4F 5E 00 00 Uncompressed Length 00005E4F 10EAE 00002 09 00 Filename Length 0009 10EB0 00002 00 00 Extra Length 0000 10EB2 00002 00 00 Comment Length 0000 10EB4 00002 00 00 Disk Start 0000 10EB6 00002 00 00 Int File Attributes 0000 [Bit 0] 0 'Binary Data' 10EB8 00004 20 00 00 00 Ext File Attributes 00000020 [Bit 5] Archive 10EBC 00004 16 C6 00 00 Local Header Offset 0000C616 10EC0 00009 41 F9 43 F9 Filename 'A▒C▒E.ANS' 45 2E 41 4E 53 {noformat} >From the ZIP note: {quote} APPENDIX D - Language Encoding (EFS) ------------------------------------ D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change. D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment MUST support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM). {quote} bit 11 is not set, so we should assume CP437 here. The file is correct and not defect for me. BTW, there is NO ANSI encoding. That is an American institute. Please be precise. Now the fauly code [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L306]: {code:java} 306 final GeneralPurposeBit gpFlag = GeneralPurposeBit.parse(lfhBuf, off); 307 final boolean hasUTF8Flag = gpFlag.usesUTF8ForNames(); 308 final ZipEncoding entryEncoding = hasUTF8Flag ? ZipEncodingHelper.UTF8_ZIP_ENCODING : zipEncoding; 309 current.hasDataDescriptor = gpFlag.usesDataDescriptor(); 310 current.entry.setGeneralPurposeBit(gpFlag); {code} Unless you specifiy {{zipEncoding}} it is [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L187]: {code;java} 187 public ZipArchiveInputStream(final InputStream inputStream) { 188 this(inputStream, ZipEncodingHelper.UTF8); 189 } {code} Although the note says SHOULD, I still would expect CP437 here, for UTF-8 there is bit 11. Anything else is non-sense. This deviation is not documented which is just bad. was (Author: michael-o): Though, I am not a Commons Compress developer, it is a bug in Commons Compress for me. Let's analyze: The offending entry: {noformat} 10E92 00004 50 4B 01 02 CENTRAL HEADER #10 02014B50 10E96 00001 0B Created Zip Spec 0B '1.1' 10E97 00001 00 Created OS 00 'MS-DOS' 10E98 00001 0A Extract Zip Spec 0A '1.0' 10E99 00001 00 Extract OS 00 'MS-DOS' 10E9A 00002 00 00 General Purpose Flag 0000 [Bit 1] 0 '4k Sliding Dictionary' [Bit 2] 0 '2 Shannon-Fano Trees' 10E9C 00002 06 00 Compression Method 0006 'Imploded' 10E9E 00004 EE 40 79 19 Last Mod Time 197940EE 'Wed Nov 25 08:07:28 1992' 10EA2 00004 47 B9 D7 53 CRC 53D7B947 10EA6 00004 BE 08 00 00 Compressed Length 000008BE 10EAA 00004 4F 5E 00 00 Uncompressed Length 00005E4F 10EAE 00002 09 00 Filename Length 0009 10EB0 00002 00 00 Extra Length 0000 10EB2 00002 00 00 Comment Length 0000 10EB4 00002 00 00 Disk Start 0000 10EB6 00002 00 00 Int File Attributes 0000 [Bit 0] 0 'Binary Data' 10EB8 00004 20 00 00 00 Ext File Attributes 00000020 [Bit 5] Archive 10EBC 00004 16 C6 00 00 Local Header Offset 0000C616 10EC0 00009 41 F9 43 F9 Filename 'A▒C▒E.ANS' 45 2E 41 4E 53 {noformat} >From the ZIP note: {quote} APPENDIX D - Language Encoding (EFS) ------------------------------------ D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change. D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment MUST support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM). {quote} bit 11 is not set, so we must assume CP437 here. The file is correct and not defect for me. BTW, there is NO ANSI encoding. That is an American institute. Please be precise. Now the fauly code [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L306]: {code:java} 306 final GeneralPurposeBit gpFlag = GeneralPurposeBit.parse(lfhBuf, off); 307 final boolean hasUTF8Flag = gpFlag.usesUTF8ForNames(); 308 final ZipEncoding entryEncoding = hasUTF8Flag ? ZipEncodingHelper.UTF8_ZIP_ENCODING : zipEncoding; 309 current.hasDataDescriptor = gpFlag.usesDataDescriptor(); 310 current.entry.setGeneralPurposeBit(gpFlag); {code} Unless you specifiy {{zipEncoding}} it is [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L187]: {code;java} 187 public ZipArchiveInputStream(final InputStream inputStream) { 188 this(inputStream, ZipEncodingHelper.UTF8); 189 } {code} Although the note says SHOULD, I still would expect CP437 here, for UTF-8 there is bit 11. Anything else is non-sense. This deviation is not documented which is just bad. > ArchiveInputStream fails reading filenames with ANSI characters > --------------------------------------------------------------- > > Key: COMPRESS-620 > URL: https://issues.apache.org/jira/browse/COMPRESS-620 > Project: Commons Compress > Issue Type: Bug > Components: Archivers > Affects Versions: 1.21 > Reporter: Avi > Priority: Major > > I attempted to extract ANSI art packs from [SixteenColors ANSI > archive|https://github.com/sixteencolors/sixteencolors-archive] but many of > them fail. > > Upon some debugging it appears that as many of the file names contain ANSI > characters which are parsed by the ArchiveInputStream as question marks, the > file fails to be saved to disk as question mark is a bad character to be had > in a filename. > Specific code: > ArchiveInputStream archiveInputStream = > archiveStreamFactory.createArchiveInputStream(ArchiveStreamFactory.ZIP, > inputStream); > ArchiveEntry archiveEntry = null; > while((archiveEntry = archiveInputStream.getNextEntry()) != null) { > Path path = Paths.get(extractDirectory, archiveEntry.getName()); > example of a non parseable filename in an archive: > https://github.com/sixteencolors/sixteencolors-archive/blob/master/1992/ace-r%232.zip > A∙C∙E.ANS > Bad ZIP file example: -- This message was sent by Atlassian Jira (v8.20.7#820007)