[
https://issues.apache.org/jira/browse/FLUME-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541735#comment-14541735
]
Jeffrey Theobald commented on FLUME-2215:
-----------------------------------------
Hi there,
This bug doesn't just affect ucs-4. It also affects utf-8 characters that are
four bytes long, (and only four bytes long, two and three bytes don't seem to
be a problem). This might become a more significant issue since there are a
bunch of emoji that are four bytes long in UTF-8 that are probably going to be
increasingly common in text data.
Specifically ingesting a line like:
{code}
"two bytes: § three bytes: ⚂ four bytes: 😏. This text will never be read"
{code}
from a spooling directory will cause a premature EOF at the emoticon, so the
rest of the line and the rest of the file will be lost.
Applying patch FLUME-2215-3.patch to commit
619e78fe68658db242808a18f41ee5137b127748 created a build that fixed this issue
in my case. But this patch seems to be a long way out of date from the current
trunk.
> ResettableFileInputStream can't support ucs-4 character
> --------------------------------------------------------
>
> Key: FLUME-2215
> URL: https://issues.apache.org/jira/browse/FLUME-2215
> Project: Flume
> Issue Type: Bug
> Affects Versions: v1.5.0
> Reporter: syntony liu
> Assignee: Santiago M. Mola
> Priority: Critical
> Labels: patch
> Attachments:
> 0001-FLUME-2215-Fixes-reading-surrogate-based-chars.patch,
> FLUME-2215-0-README.txt, FLUME-2215-0.patch, FLUME-2215-1-README.txt,
> FLUME-2215-1.patch
>
>
> ResettableFileInputStream.java:readChar() not handle ucs-4 character. it need
> 2 charBuf. it cause an unexpected termination。
> a temporary solution:
> if (res.isOverflow() && !charBuf.hasRemaining()){
> logger.warn("decoder ucs-4 at postion: {}" , buf.position());
> tmpBuf.clear();
> res = decoder.decode(buf, tmpBuf, isEndOfInput);
> incrPosition( buf.position() - start, false);
> return '?';
> }
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)