[
https://issues.apache.org/jira/browse/FLUME-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779266#comment-13779266
]
Sven Meys commented on FLUME-2182:
----------------------------------
I found the same error.
Happens when the byte buffer reaches it's end, so only a part of a character is
read.
I also tried to solve it using the method described in the bug report, by
ensuring there are at least 32 bytes in the buffer at all times. But I found
that somehow, between the the millions of lines parsed, there is one extra line
written to the output that contains exactly as many characters as the extra
space you allow for the buffer refill.
To solve this, I commented out the line chan.position(position); in the
refilBuff() method.
However, this may cause a whole series of other problems, which I have no time
researching right now.
> Spooling Directory Source can't ingest data completely, when a file contain
> some wide character, such as chinese character.
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: FLUME-2182
> URL: https://issues.apache.org/jira/browse/FLUME-2182
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.4.0
> Reporter: syntony liu
> Priority: Critical
>
> the bug is in ResettableFileInputStream.java: int readChar().
> if the last byte of buf is only a partial of a wide character, readChar()
> shouldn't return -1(ResettableFileInputStream.java:186). it
> loses the remanent data in a file.
> I fix it such as:
> public synchronized int readChar() throws IOException {
> // if (!buf.hasRemaining()) {
> if(buf.limit()- buf.position < 10){
> refillBuf();
> }
> int start = buf.position();
> charBuf.clear();
> boolean isEndOfInput = false;
> if (position >= fileSize) {
> isEndOfInput = true;
> }
> CoderResult res = decoder.decode(buf, charBuf, isEndOfInput);
> if (res.isMalformed() || res.isUnmappable()) {
> res.throwException();
> }
> int delta = buf.position() - start;
> charBuf.flip();
> if (charBuf.hasRemaining()) {
> char c = charBuf.get();
> // don't increment the persisted location if we are in between a
> // surrogate pair, otherwise we may never recover if we seek() to this
> // location!
> incrPosition(delta, !Character.isHighSurrogate(c));
> return c;
> // there may be a partial character in the decoder buffer
> } else {
> incrPosition(delta, false);
> return -1;
> }
> }
> it avoid a partial character, but have new issue. sometime, some lines of a
> log file have a repeated character.
> eg.
> original file: 123456
> sink file: 1233456
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira