[ https://issues.apache.org/jira/browse/IO-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485300#comment-13485300 ]
Liyu Yi edited comment on IO-354 at 10/27/12 12:10 AM: ------------------------------------------------------- I used a "hacky" fix to reconstruct the String with right encoding in the handler class. private String rebuildUTF8String(String line) { int len = line.length(); byte[] bytes = new byte[len]; for (int i=0; i<len; i++) { bytes[i] = (byte)line.charAt(i); } return new String(bytes, UTF8); } However, the right approach is to pass in the encoding in the "create" method and handle it in the Tailer. was (Author: liyuyi): I used a "hacky" fix to reconstruct the String with right encoding in the handler class. private String rebuildUTF8String(String line) { int len = line.length(); byte[] bytes = new byte[len]; for (int i=0; i<len; i++) { bytes[i] = (byte)line.charAt(i); } return new String(bytes, UTF8); } However, the right approach is to pass in the encoding in the "create" method and handling it in the Tailer. > Commons IO Tailer does not respect UTF-8 Charset > ------------------------------------------------ > > Key: IO-354 > URL: https://issues.apache.org/jira/browse/IO-354 > Project: Commons IO > Issue Type: Bug > Components: Utilities > Affects Versions: 2.3 > Environment: JDK 7 > RHEL Linux > Apache Commons IO version 2.4 > Reporter: Liyu Yi > Labels: Charset, Encoding, Tailer > > I just realized there is a defect in the source code of > "org.apache.commons.io.input.Tailer.java". Basically, the current > implementation does not work for multi-byte encoded files. See the following > snippet, > 448 private long readLines(RandomAccessFile reader) throws IOException { > 449 StringBuilder sb = new StringBuilder(); > 450 > 451 long pos = reader.getFilePointer(); > 452 long rePos = pos; // position to re-read > 453 > 454 int num; > 455 boolean seenCR = false; > 456 while (run && ((num = reader.read(inbuf)) != -1)) { > 457 for (int i = 0; i < num; i++) { > 458 byte ch = inbuf[i]; > 459 switch (ch) { > 460 case '\n': > 461 seenCR = false; // swallow CR before LF > 462 listener.handle(sb.toString()); > 463 sb.setLength(0); > 464 rePos = pos + i + 1; > 465 break; > 466 case '\r': > 467 if (seenCR) { > 468 sb.append('\r'); > 469 } > 470 seenCR = true; > 471 break; > 472 default: > 473 if (seenCR) { > 474 seenCR = false; // swallow final CR > 475 listener.handle(sb.toString()); > 476 sb.setLength(0); > 477 rePos = pos + i + 1; > 478 } > 479 sb.append((char) ch); // add character, not its ascii > value > 480 } > 481 } > 482 > 483 pos = reader.getFilePointer(); > 484 } > 485 > 486 reader.seek(rePos); // Ensure we can re-read if necessary > 487 return rePos; > 488 } > At line 479, the conversion of byte to char types breaks the encoding. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira