[ https://issues.apache.org/jira/browse/COMPRESS-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543410#comment-13543410 ]
Woo Ju Shin commented on COMPRESS-212: -------------------------------------- I have tried a workaround to this. Next code is getNextTarEntry() of TarArchiveInputStream.java. /** * Get the next entry in this tar archive. This will skip * over any remaining data in the current entry, if there * is one, and place the input stream at the header of the * next entry, and read the header and instantiate a new * TarEntry from the header bytes and return that entry. * If there are no more entries in the archive, null will * be returned to indicate that the end of the archive has * been reached. * * @return The next TarEntry in the archive, or null. * @throws IOException on error */ public TarArchiveEntry getNextTarEntry() throws IOException { if (hasHitEOF) { return null; } if (currEntry != null) { long numToSkip = entrySize - entryOffset; while (numToSkip > 0) { long skipped = skip(numToSkip); if (skipped <= 0) { throw new RuntimeException("failed to skip current tar entry"); } numToSkip -= skipped; } readBuf = null; } byte[] headerBuf = getRecord(); if (hasHitEOF) { currEntry = null; return null; } try { currEntry = new TarArchiveEntry(headerBuf, encoding); } catch (IllegalArgumentException e) { IOException ioe = new IOException("Error detected parsing the header"); ioe.initCause(e); throw ioe; } entryOffset = 0; entrySize = currEntry.getSize(); if (currEntry.isGNULongNameEntry()) { // read in the name StringBuffer longName = new StringBuffer(); byte[] buf = new byte[SMALL_BUFFER_SIZE]; int length = 0; while ((length = read(buf)) >= 0) { longName.append(new String(buf, 0, length)); // TODO default charset? } getNextEntry(); if (currEntry == null) { // Bugzilla: 40334 // Malformed tar file - long entry name not followed by entry return null; } // remove trailing null terminator if (longName.length() > 0 && longName.charAt(longName.length() - 1) == 0) { longName.deleteCharAt(longName.length() - 1); } currEntry.setName(longName.toString()); } if (currEntry.isPaxHeader()){ // Process Pax headers paxHeaders(); } if (currEntry.isGNUSparse()){ // Process sparse files readGNUSparse(); } // If the size of the next element in the archive has changed // due to a new size being reported in the posix header // information, we update entrySize here so that it contains // the correct value. entrySize = currEntry.getSize(); return currEntry; } There's a comment '//TODO default charset?'. This part seems to neglect the encoding set to TarArchiveInputStream(). I tried to get the encoding that I first set to TarArchiveInputStream() and constructed the filename in one byte[] variable, and then used the encoding to change the byte[] variable to String so that it could be set to entry by entry.setName(). This workaround works well for now. But obviously I need more tests to be done. I'll be trying more tests until next week. > TarArchiveEntry getName() returns wrongly encoded name even when you set > encoding to TarArchiveInputStream > ---------------------------------------------------------------------------------------------------------- > > Key: COMPRESS-212 > URL: https://issues.apache.org/jira/browse/COMPRESS-212 > Project: Commons Compress > Issue Type: Bug > Affects Versions: 1.4.1 > Environment: Red Hat Enterprise Linux, MS Windows 7 > Reporter: Woo Ju Shin > Priority: Minor > > I have two file systems. One is Red Hat Linux, the other is MS Windows. > I created a *.tgz file in Red Hat Linux and tried to decompress it in MS > Windows using Commons Compress. > The default system encoding are different. UTF-8 in Red Hat Linux and CP949 > in MS Windows. > It seems that the file name encoding follows the default encoding even though > when I use the following to untar it. > FileInputStream fis = new FileInputStream(new File(*.tgz)); > TarArchiveInputStream zis = new TarArchiveInputStream(new > BufferedInputStream(fis),encodingOfRedHatLinux); > while ((entry = (TarArchiveEntry)zis.getNextEntry()) != null) > { > entry.getName(); // filename is not UTF-8 it is encoded in CP949 and so the > filename isn't consistent > } > By referring to this > /** > * Constructor for TarInputStream. > * @param is the input stream to use > * @param encoding name of the encoding to use for file names > * @since Commons Compress 1.4 > */ > public TarArchiveInputStream(InputStream is, String encoding) { > this(is, TarBuffer.DEFAULT_BLKSIZE, TarBuffer.DEFAULT_RCDSIZE, > encoding); > } > encoding should be used for file names. > But actually this doesn't seem to work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira