[jira] [Comment Edited] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Liyu Yi (JIRA) Fri, 26 Oct 2012 17:11:14 -0700

    [ 
https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300
 ]


Liyu Yi edited comment on IO-354 at 10/27/12 12:10 AM:
-------------------------------------------------------

I used a "hacky" fix to reconstruct the String with right encoding in the 
handler class. 

        private String rebuildUTF8String(String line) {
                int len = line.length();
                byte[] bytes = new byte[len];
                for (int i=0; i<len; i++) {
                        bytes[i] = (byte)line.charAt(i);
                }
                return new String(bytes, UTF8);
        }

However, the right approach is to pass in the encoding in the "create" method 
and handle it in the Tailer.
                
      was (Author: liyuyi):
    I used a "hacky" fix to reconstruct the String with right encoding in the 
handler class. 

        private String rebuildUTF8String(String line) {
                int len = line.length();
                byte[] bytes = new byte[len];
                for (int i=0; i<len; i++) {
                        bytes[i] = (byte)line.charAt(i);
                }
                return new String(bytes, UTF8);
        }

However, the right approach is to pass in the encoding in the "create" method 
and handling it in the Tailer.
                  
> Commons IO Tailer does not respect UTF-8 Charset
> ------------------------------------------------
>
>                 Key: IO-354
>                 URL: https://issues.apache.org/jira/browse/IO-354
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.3
>         Environment: JDK 7 
> RHEL Linux
> Apache Commons IO version 2.4
>            Reporter: Liyu Yi
>              Labels: Charset, Encoding, Tailer
>
> I just realized there is a defect in the source code of 
> "org.apache.commons.io.input.Tailer.java". Basically, the current 
> implementation does not work for multi-byte encoded files. See the following 
> snippet,
> 448    private long readLines(RandomAccessFile reader) throws IOException {
> 449        StringBuilder sb = new StringBuilder();
> 450
> 451        long pos = reader.getFilePointer();
> 452        long rePos = pos; // position to re-read
> 453
> 454        int num;
> 455        boolean seenCR = false;
> 456        while (run && ((num = reader.read(inbuf)) != -1)) {
> 457            for (int i = 0; i < num; i++) {
> 458                byte ch = inbuf[i];
> 459                switch (ch) {
> 460                case '\n':
> 461                    seenCR = false; // swallow CR before LF
> 462                    listener.handle(sb.toString());
> 463                    sb.setLength(0);
> 464                    rePos = pos + i + 1;
> 465                    break;
> 466                case '\r':
> 467                    if (seenCR) {
> 468                        sb.append('\r');
> 469                    }
> 470                    seenCR = true;
> 471                    break;
> 472                default:
> 473                    if (seenCR) {
> 474                        seenCR = false; // swallow final CR
> 475                        listener.handle(sb.toString());
> 476                        sb.setLength(0);
> 477                        rePos = pos + i + 1;
> 478                    }
> 479                    sb.append((char) ch); // add character, not its ascii 
> value
> 480                }
> 481            }
> 482
> 483            pos = reader.getFilePointer();
> 484        }
> 485
> 486        reader.seek(rePos); // Ensure we can re-read if necessary
> 487        return rePos;
> 488    }
> At line 479, the conversion of byte to char types breaks the encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (IO-354) Commons IO Tailer does not respect UTF-8 Charset

Reply via email to