I tried using mmutf8fix as shown below, but it didn't seem to fix the
problem. What I am doing is monitoring a log file with imfile action,
parsing it with mmnormalize and sending JSON to Elasticsearch with
omelasticsearch.

I check the encoding of the log file using "file -bi" and it says
"text/plain; charset=us-ascii". However, it contains some Hindi characters,
which I assume are encoded with us-ascii. If I understand correctly,
us-ascii is a subset of UTF-8. If this is the case, do I really need to us
mmutf8fix?

To me it seems like the Hindi characters are UTF-8 encoded with 3-byte
sequences and when they are received by Elasticsearch the byte sequence is
incorrectly decoded to invalid Unicode sequence, such as "\u00.4". Is this
plausible?

module(load = "imfile")
module(load="mmutf8fix")
module(load = "mmnormalize")
module(load = "omelasticsearch")

input(type = "imfile" Ruleset="X" ...)
ruleset(name = "X") {
  action(type="mmutf8fix")
  action(type = "mmnormalize" ...)
  action(type = "omelasticsearch" ...)
}

Thanks,

Alec

On Tue, Jun 28, 2016 at 4:49 PM, Alec Swan <[email protected]> wrote:

> Thanks for the suggestion, Dave.  I noticed that on the client side the
> log contained Hindi characters that got translated to "\u00E0\u00.4�\"
> which eventually caused the error. I'll give mmutf8fix plugin a try.
>
> Thanks,
>
> Alec
>
> On Tue, Jun 28, 2016 at 3:24 PM, Dave Caplinger <
> [email protected]> wrote:
>
>> On Jun 28, 2016, at 4:04 PM, Alec Swan <[email protected]> wrote:
>> >
>> > I think the root cause of the problem is that there is an invalid UTF-8
>> > sequence "\u00.4" in the value if the "message" field. In fact, I just
>> > confirmed that {"message":"\u00.4"} is not a valid JSON on
>> > http://jsonlint.com/.
>>
>> I've run into something similar where the original message source was
>> sending Windows-1252 or other character set.  Rsyslog doesn't know the
>> incoming character set, so it doesn't know that it needs to be converted to
>> UTF-8. (That particular input would receive logs from various sources, so
>> the character set could vary per message).
>>
>> The fix we used was to add action(type="mmutf8fix") to the affected
>> ruleset prior to any JSON template use.  This isn't strictly accurate
>> because you lose the 'invalid' character in the resulting string, but at
>> least that string is JSON-safe.  In the ideal case you'd know what the
>> original character set was and explicitly convert it UTF-8, but that wasn't
>> practical in our use case.
>>
>> --
>> Dave Caplinger | Director, Technical Product Management
>> Solutionary — An NTT Group Security Company
>>
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to