Liyu Yi created IO-354: -------------------------- Summary: Commons IO Tailer does not respect UTF-8 Charset Key: IO-354 URL: https://issues.apache.org/jira/browse/IO-354 Project: Commons IO Issue Type: Bug Components: Utilities Affects Versions: 2.3 Environment: JDK 7 RHEL Linux Apache Commons IO version 2.4 Reporter: Liyu Yi
I just realized there is a defect in the source code of "org.apache.commons.io.input.Tailer.java". Basically, the current implementation does not work for multi-byte encoded files. See the following snippet, 448 private long readLines(RandomAccessFile reader) throws IOException { 449 StringBuilder sb = new StringBuilder(); 450 451 long pos = reader.getFilePointer(); 452 long rePos = pos; // position to re-read 453 454 int num; 455 boolean seenCR = false; 456 while (run && ((num = reader.read(inbuf)) != -1)) { 457 for (int i = 0; i < num; i++) { 458 byte ch = inbuf[i]; 459 switch (ch) { 460 case '\n': 461 seenCR = false; // swallow CR before LF 462 listener.handle(sb.toString()); 463 sb.setLength(0); 464 rePos = pos + i + 1; 465 break; 466 case '\r': 467 if (seenCR) { 468 sb.append('\r'); 469 } 470 seenCR = true; 471 break; 472 default: 473 if (seenCR) { 474 seenCR = false; // swallow final CR 475 listener.handle(sb.toString()); 476 sb.setLength(0); 477 rePos = pos + i + 1; 478 } 479 sb.append((char) ch); // add character, not its ascii value 480 } 481 } 482 483 pos = reader.getFilePointer(); 484 } 485 486 reader.seek(rePos); // Ensure we can re-read if necessary 487 return rePos; 488 } At line 479, the conversion of byte to char types breaks the encoding. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira