[ https://issues.apache.org/jira/browse/FLUME-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541735#comment-14541735 ]
Jeffrey Theobald commented on FLUME-2215: ----------------------------------------- Hi there, This bug doesn't just affect ucs-4. It also affects utf-8 characters that are four bytes long, (and only four bytes long, two and three bytes don't seem to be a problem). This might become a more significant issue since there are a bunch of emoji that are four bytes long in UTF-8 that are probably going to be increasingly common in text data. Specifically ingesting a line like: {code} "two bytes: § three bytes: ⚂ four bytes: 😏. This text will never be read" {code} from a spooling directory will cause a premature EOF at the emoticon, so the rest of the line and the rest of the file will be lost. Applying patch FLUME-2215-3.patch to commit 619e78fe68658db242808a18f41ee5137b127748 created a build that fixed this issue in my case. But this patch seems to be a long way out of date from the current trunk. > ResettableFileInputStream can't support ucs-4 character > -------------------------------------------------------- > > Key: FLUME-2215 > URL: https://issues.apache.org/jira/browse/FLUME-2215 > Project: Flume > Issue Type: Bug > Affects Versions: v1.5.0 > Reporter: syntony liu > Assignee: Santiago M. Mola > Priority: Critical > Labels: patch > Attachments: > 0001-FLUME-2215-Fixes-reading-surrogate-based-chars.patch, > FLUME-2215-0-README.txt, FLUME-2215-0.patch, FLUME-2215-1-README.txt, > FLUME-2215-1.patch > > > ResettableFileInputStream.java:readChar() not handle ucs-4 character. it need > 2 charBuf. it cause an unexpected termination。 > a temporary solution: > if (res.isOverflow() && !charBuf.hasRemaining()){ > logger.warn("decoder ucs-4 at postion: {}" , buf.position()); > tmpBuf.clear(); > res = decoder.decode(buf, tmpBuf, isEndOfInput); > incrPosition( buf.position() - start, false); > return '?'; > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)