[ 
https://issues.apache.org/jira/browse/COMPRESS-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543410#comment-13543410
 ] 

Woo Ju Shin commented on COMPRESS-212:
--------------------------------------

I have tried a workaround to this.
Next code is getNextTarEntry() of TarArchiveInputStream.java.

    /**
     * Get the next entry in this tar archive. This will skip
     * over any remaining data in the current entry, if there
     * is one, and place the input stream at the header of the
     * next entry, and read the header and instantiate a new
     * TarEntry from the header bytes and return that entry.
     * If there are no more entries in the archive, null will
     * be returned to indicate that the end of the archive has
     * been reached.
     *
     * @return The next TarEntry in the archive, or null.
     * @throws IOException on error
     */
    public TarArchiveEntry getNextTarEntry() throws IOException {
        if (hasHitEOF) {
            return null;
        }

        if (currEntry != null) {
            long numToSkip = entrySize - entryOffset;

            while (numToSkip > 0) {
                long skipped = skip(numToSkip);
                if (skipped <= 0) {
                    throw new RuntimeException("failed to skip current tar 
entry");
                }
                numToSkip -= skipped;
            }

            readBuf = null;
        }

        byte[] headerBuf = getRecord();

        if (hasHitEOF) {
            currEntry = null;
            return null;
        }

        try {
            currEntry = new TarArchiveEntry(headerBuf, encoding);
        } catch (IllegalArgumentException e) {
            IOException ioe = new IOException("Error detected parsing the 
header");
            ioe.initCause(e);
            throw ioe;
        }
        entryOffset = 0;
        entrySize = currEntry.getSize();

        if (currEntry.isGNULongNameEntry()) {
            // read in the name
            StringBuffer longName = new StringBuffer();
            byte[] buf = new byte[SMALL_BUFFER_SIZE];
            int length = 0;
            while ((length = read(buf)) >= 0) {
                longName.append(new String(buf, 0, length)); // TODO default 
charset?
            }
            getNextEntry();
            if (currEntry == null) {
                // Bugzilla: 40334
                // Malformed tar file - long entry name not followed by entry
                return null;
            }
            // remove trailing null terminator
            if (longName.length() > 0
                && longName.charAt(longName.length() - 1) == 0) {
                longName.deleteCharAt(longName.length() - 1);
            }
            currEntry.setName(longName.toString());
        }

        if (currEntry.isPaxHeader()){ // Process Pax headers
            paxHeaders();
        }

        if (currEntry.isGNUSparse()){ // Process sparse files
            readGNUSparse();
        }

        // If the size of the next element in the archive has changed
        // due to a new size being reported in the posix header
        // information, we update entrySize here so that it contains
        // the correct value.
        entrySize = currEntry.getSize();
        return currEntry;
    }

There's a comment '//TODO default charset?'.
This part seems to neglect the encoding set to TarArchiveInputStream().
I tried to get the encoding that I first set to TarArchiveInputStream() and 
constructed the filename in one byte[] variable, and then used the encoding to 
change the byte[] variable to String so that it could be set to entry by 
entry.setName().

This workaround works well for now. But obviously I need more tests to be done.
I'll be trying more tests until next week.
                
> TarArchiveEntry getName() returns wrongly encoded name even when you set 
> encoding to TarArchiveInputStream
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: COMPRESS-212
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-212
>             Project: Commons Compress
>          Issue Type: Bug
>    Affects Versions: 1.4.1
>         Environment: Red Hat Enterprise Linux, MS Windows 7
>            Reporter: Woo Ju Shin
>            Priority: Minor
>
> I have two file systems. One is Red Hat Linux, the other is MS Windows.
> I created a *.tgz file in Red Hat Linux and tried to decompress it in MS 
> Windows using Commons Compress.
> The default system encoding are different. UTF-8 in Red Hat Linux and CP949 
> in MS Windows.
> It seems that the file name encoding follows the default encoding even though 
> when I use the following to untar it.
> FileInputStream fis = new FileInputStream(new File(*.tgz));
> TarArchiveInputStream zis = new TarArchiveInputStream(new 
> BufferedInputStream(fis),encodingOfRedHatLinux);
> while ((entry = (TarArchiveEntry)zis.getNextEntry()) != null)
> {
> entry.getName(); // filename is not UTF-8 it is encoded in CP949 and so the 
> filename isn't consistent
> }
> By referring to this
>     /**
>      * Constructor for TarInputStream.
>      * @param is the input stream to use
>      * @param encoding name of the encoding to use for file names
>      * @since Commons Compress 1.4
>      */
>     public TarArchiveInputStream(InputStream is, String encoding) {
>         this(is, TarBuffer.DEFAULT_BLKSIZE, TarBuffer.DEFAULT_RCDSIZE, 
> encoding);
>     }
> encoding should be used for file names.
> But actually this doesn't seem to work.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to