David, as you suggested, I extracted the log lines containing Hindi
characters in a separate file and ran "file -bi" which returned
"text/plain; charset=utf-8". Which confirms that logs are written in UTF-8.
Any thoughts what would cause rsyslog to send messages like "\u00E0\u00.4�
Description in Hindi" causing Elasticsearch to throw an exception?

Thanks,

Alec

On Wed, Jun 29, 2016 at 4:08 PM, alecswan <[email protected]> wrote:

> I looked at the code that produces this log file and it's writing the log
> with utf-8 encoding. What else could cause this problem? Could it be that
> Hindi characters may require 3 bytes for encoding? Just grasping at straws
> here ...
>
>
> Thanks,
>
> Alec
>
>
> -------- Original message --------
> From: David Lang
> Date:29/06/2016 2:00 PM (GMT-07:00)
> To: rsyslog-users
> Subject: Re: [rsyslog] Invalid JSON from
> mmnormalize/liblognorm/omelasticsearch
>
> On Wed, 29 Jun 2016, Alec Swan wrote:
>
> > I tried using mmutf8fix as shown below, but it didn't seem to fix the
> > problem. What I am doing is monitoring a log file with imfile action,
> > parsing it with mmnormalize and sending JSON to Elasticsearch with
> > omelasticsearch.
> >
> > I check the encoding of the log file using "file -bi" and it says
> > "text/plain; charset=us-ascii".
>
> > However, it contains some Hindi characters, which I assume are encoded
> with
> > us-ascii.
>
> There is no way to encode Hindi characters as us-ascii. us-ascii is the
> most
> basic character set, English uppper case, lower case and punctuation only.
>
> So whatever character set it is in, it's not us-ascii
>
> > If I understand correctly,
> > us-ascii is a subset of UTF-8. If this is the case, do I really need to
> us
> > mmutf8fix?
>
> It all depends on what character set it's actually in. try making a copy
> of the
> file that has the Hindi characters near the beginning of it and try the
> file -bi
> again, see if it gives a more accurate answer.
>
> otherwise, you will have to track down what's writing the messages and try
> to
> set the character set there (or at least find out what character set it's
> using)
>
> David Lang
>
> > To me it seems like the Hindi characters are UTF-8 encoded with 3-byte
> > sequences and when they are received by Elasticsearch the byte sequence
> is
> > incorrectly decoded to invalid Unicode sequence, such as "\u00.4". Is
> this
> > plausible?
> >
> > module(load = "imfile")
> > module(load="mmutf8fix")
> > module(load = "mmnormalize")
> > module(load = "omelasticsearch")
> >
> > input(type = "imfile" Ruleset="X" ...)
> > ruleset(name = "X") {
> >  action(type="mmutf8fix")
> >  action(type = "mmnormalize" ...)
> >  action(type = "omelasticsearch" ...)
> > }
> >
> > Thanks,
> >
> > Alec
> >
> > On Tue, Jun 28, 2016 at 4:49 PM, Alec Swan <[email protected]> wrote:
> >
> >> Thanks for the suggestion, Dave.  I noticed that on the client side the
> >> log contained Hindi characters that got translated to "\u00E0\u00.4???\"
> >> which eventually caused the error. I'll give mmutf8fix plugin a try.
> >>
> >> Thanks,
> >>
> >> Alec
> >>
> >> On Tue, Jun 28, 2016 at 3:24 PM, Dave Caplinger <
> >> [email protected]> wrote:
> >>
> >>> On Jun 28, 2016, at 4:04 PM, Alec Swan <[email protected]> wrote:
> >>> >
> >>> > I think the root cause of the problem is that there is an invalid
> UTF-8
> >>> > sequence "\u00.4" in the value if the "message" field. In fact, I
> just
> >>> > confirmed that {"message":"\u00.4"} is not a valid JSON on
> >>> > http://jsonlint.com/.
> >>>
> >>> I've run into something similar where the original message source was
> >>> sending Windows-1252 or other character set.  Rsyslog doesn't know the
> >>> incoming character set, so it doesn't know that it needs to be
> converted to
> >>> UTF-8. (That particular input would receive logs from various sources,
> so
> >>> the character set could vary per message).
> >>>
> >>> The fix we used was to add action(type="mmutf8fix") to the affected
> >>> ruleset prior to any JSON template use.  This isn't strictly accurate
> >>> because you lose the 'invalid' character in the resulting string, but
> at
> >>> least that string is JSON-safe.  In the ideal case you'd know what the
> >>> original character set was and explicitly convert it UTF-8, but that
> wasn't
> >>> practical in our use case.
> >>>
> >>> --
> >>> Dave Caplinger | Director, Technical Product Management
> >>> Solutionary — An NTT Group Security Company
> >>>
> >>> _______________________________________________
> >>> rsyslog mailing list
> >>> http://lists.adiscon.net/mailman/listinfo/rsyslog
> >>> http://www.rsyslog.com/professional-services/
> >>> What's up with rsyslog? Follow https://twitter.com/rgerhards
> >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a
> myriad
> >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> >>> DON'T LIKE THAT.
> >>>
> >>
> >>
> > _______________________________________________
> > rsyslog mailing list
> > http://lists.adiscon.net/mailman/listinfo/rsyslog
> > http://www.rsyslog.com/professional-services/
> > What's up with rsyslog? Follow https://twitter.com/rgerhards
> > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
> _______________________________________________
> rsyslog mailing list
> http://lists.adiscon.net/mailman/listinfo/rsyslog
> http://www.rsyslog.com/professional-services/
> What's up with rsyslog? Follow https://twitter.com/rgerhards
> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad
> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you
> DON'T LIKE THAT.
>
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of 
sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE 
THAT.

Reply via email to