[ 
https://issues.apache.org/jira/browse/FLUME-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541735#comment-14541735
 ] 

Jeffrey Theobald commented on FLUME-2215:
-----------------------------------------

Hi there,

This bug doesn't just affect ucs-4.  It also affects utf-8 characters that are 
four bytes long, (and only four bytes long, two and three bytes don't seem to 
be a problem).  This might become a more significant issue since there are a 
bunch of emoji that are four bytes long in UTF-8 that are probably going to be 
increasingly common in text data.

Specifically ingesting a line like:
{code}
"two bytes: § three bytes: ⚂ four bytes: 😏. This text will never be read"
{code}

from a spooling directory will cause a premature EOF at the emoticon, so the 
rest of the line and the rest of the file will be lost.

Applying patch FLUME-2215-3.patch to commit 
619e78fe68658db242808a18f41ee5137b127748 created a build that fixed this issue 
in my case.  But this patch seems to be a long way out of date from the current 
trunk.  

> ResettableFileInputStream can't support  ucs-4 character
> --------------------------------------------------------
>
>                 Key: FLUME-2215
>                 URL: https://issues.apache.org/jira/browse/FLUME-2215
>             Project: Flume
>          Issue Type: Bug
>    Affects Versions: v1.5.0
>            Reporter: syntony liu
>            Assignee: Santiago M. Mola
>            Priority: Critical
>              Labels: patch
>         Attachments: 
> 0001-FLUME-2215-Fixes-reading-surrogate-based-chars.patch, 
> FLUME-2215-0-README.txt, FLUME-2215-0.patch, FLUME-2215-1-README.txt, 
> FLUME-2215-1.patch
>
>
> ResettableFileInputStream.java:readChar() not handle ucs-4 character. it need 
> 2 charBuf. it cause an unexpected termination。
>  a temporary solution:
>      if (res.isOverflow() && !charBuf.hasRemaining()){ 
>          logger.warn("decoder ucs-4 at postion: {}" , buf.position()); 
>         tmpBuf.clear();  
>         res = decoder.decode(buf, tmpBuf, isEndOfInput); 
>         incrPosition( buf.position() - start, false); 
>        return '?'; 
>      } 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to