I tried using mmutf8fix as shown below, but it didn't seem to fix the
problem. What I am doing is monitoring a log file with imfile action,
parsing it with mmnormalize and sending JSON to Elasticsearch with
omelasticsearch.
I check the encoding of the log file using "file -bi" and it says
"text/plain; charset=us-ascii". However, it contains some Hindi characters,
which I assume are encoded with us-ascii. If I understand correctly,
us-ascii is a subset of UTF-8. If this is the case, do I really need to us
mmutf8fix?
To me it seems like the Hindi characters are UTF-8 encoded with 3-byte
sequences and when they are received by Elasticsearch the byte sequence is
incorrectly decoded to invalid Unicode sequence, such as "\u00.4". Is this
plausible?
module(load = "imfile")
module(load="mmutf8fix")
module(load = "mmnormalize")
module(load = "omelasticsearch")
input(type = "imfile" Ruleset="X" ...)
ruleset(name = "X") {
action(type="mmutf8fix")
action(type = "mmnormalize" ...)
action(type = "omelasticsearch" ...)
}
Thanks,
Alec
On Tue, Jun 28, 2016 at 4:49 PM, Alec Swan <[email protected]> wrote:
> Thanks for the suggestion, Dave. I noticed that on the client side the
> log contained Hindi characters that got translated to "\u00E0\u00.4�\"
> which eventually caused the error. I'll give mmutf8fix plugin a try.
>
> Thanks,
>
> Alec
>
> On Tue, Jun 28, 2016 at 3:24 PM, Dave Caplinger <
> [email protected]> wrote:
>
>> On Jun 28, 2016, at 4:04 PM, Alec Swan <[email protected]> wrote:
>> >
>> > I think the root cause of the problem is that there is an invalid UTF-8
>> > sequence "\u00.4" in the value if the "message" field. In fact, I just
>> > confirmed that {"message":"\u00.4"} is not a valid JSON on
>> > http://jsonlint.com/.
>>
>> I've run into something similar where the original message source was
>> sending Windows-1252 or other character set. Rsyslog doesn't know the
>> incoming character set, so it doesn't know that it needs to be converted to
>> UTF-8. (That particular input would receive logs from various sources, so
>> the character set could vary per message).
>>
>> The fix we used was to add action(type="mmutf8fix") to the affected
>> ruleset prior to any JSON template use. This isn't strictly accurate
>> because you lose the 'invalid' character in the resulting string, but at
>> least that string is JSON-safe. In the ideal case you'd know what the
>> original character set was and explicitly convert it UTF-8, but that wasn't
>> practical in our use case.
>>
>> --
>> Dave Caplinger | Director, Technical Product Management
>> Solutionary — An NTT Group Security Company
>>
>> _______________________________________________
>> rsyslog mailing list
>> http://lists.adiscon.net/mailman/listinfo/rsyslog
>> http://www.rsyslog.com/professional-services/
>> What's up with rsyslog? Follow https://twitter.com/rgerhards
>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
>> DON'T LIKE THAT.
>>
>
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE
THAT.