David, as you suggested, I extracted the log lines containing Hindi characters in a separate file and ran "file -bi" which returned "text/plain; charset=utf-8". Which confirms that logs are written in UTF-8. Any thoughts what would cause rsyslog to send messages like "\u00E0\u00.4� Description in Hindi" causing Elasticsearch to throw an exception?
Thanks, Alec On Wed, Jun 29, 2016 at 4:08 PM, alecswan <[email protected]> wrote: > I looked at the code that produces this log file and it's writing the log > with utf-8 encoding. What else could cause this problem? Could it be that > Hindi characters may require 3 bytes for encoding? Just grasping at straws > here ... > > > Thanks, > > Alec > > > -------- Original message -------- > From: David Lang > Date:29/06/2016 2:00 PM (GMT-07:00) > To: rsyslog-users > Subject: Re: [rsyslog] Invalid JSON from > mmnormalize/liblognorm/omelasticsearch > > On Wed, 29 Jun 2016, Alec Swan wrote: > > > I tried using mmutf8fix as shown below, but it didn't seem to fix the > > problem. What I am doing is monitoring a log file with imfile action, > > parsing it with mmnormalize and sending JSON to Elasticsearch with > > omelasticsearch. > > > > I check the encoding of the log file using "file -bi" and it says > > "text/plain; charset=us-ascii". > > > However, it contains some Hindi characters, which I assume are encoded > with > > us-ascii. > > There is no way to encode Hindi characters as us-ascii. us-ascii is the > most > basic character set, English uppper case, lower case and punctuation only. > > So whatever character set it is in, it's not us-ascii > > > If I understand correctly, > > us-ascii is a subset of UTF-8. If this is the case, do I really need to > us > > mmutf8fix? > > It all depends on what character set it's actually in. try making a copy > of the > file that has the Hindi characters near the beginning of it and try the > file -bi > again, see if it gives a more accurate answer. > > otherwise, you will have to track down what's writing the messages and try > to > set the character set there (or at least find out what character set it's > using) > > David Lang > > > To me it seems like the Hindi characters are UTF-8 encoded with 3-byte > > sequences and when they are received by Elasticsearch the byte sequence > is > > incorrectly decoded to invalid Unicode sequence, such as "\u00.4". Is > this > > plausible? > > > > module(load = "imfile") > > module(load="mmutf8fix") > > module(load = "mmnormalize") > > module(load = "omelasticsearch") > > > > input(type = "imfile" Ruleset="X" ...) > > ruleset(name = "X") { > > action(type="mmutf8fix") > > action(type = "mmnormalize" ...) > > action(type = "omelasticsearch" ...) > > } > > > > Thanks, > > > > Alec > > > > On Tue, Jun 28, 2016 at 4:49 PM, Alec Swan <[email protected]> wrote: > > > >> Thanks for the suggestion, Dave. I noticed that on the client side the > >> log contained Hindi characters that got translated to "\u00E0\u00.4???\" > >> which eventually caused the error. I'll give mmutf8fix plugin a try. > >> > >> Thanks, > >> > >> Alec > >> > >> On Tue, Jun 28, 2016 at 3:24 PM, Dave Caplinger < > >> [email protected]> wrote: > >> > >>> On Jun 28, 2016, at 4:04 PM, Alec Swan <[email protected]> wrote: > >>> > > >>> > I think the root cause of the problem is that there is an invalid > UTF-8 > >>> > sequence "\u00.4" in the value if the "message" field. In fact, I > just > >>> > confirmed that {"message":"\u00.4"} is not a valid JSON on > >>> > http://jsonlint.com/. > >>> > >>> I've run into something similar where the original message source was > >>> sending Windows-1252 or other character set. Rsyslog doesn't know the > >>> incoming character set, so it doesn't know that it needs to be > converted to > >>> UTF-8. (That particular input would receive logs from various sources, > so > >>> the character set could vary per message). > >>> > >>> The fix we used was to add action(type="mmutf8fix") to the affected > >>> ruleset prior to any JSON template use. This isn't strictly accurate > >>> because you lose the 'invalid' character in the resulting string, but > at > >>> least that string is JSON-safe. In the ideal case you'd know what the > >>> original character set was and explicitly convert it UTF-8, but that > wasn't > >>> practical in our use case. > >>> > >>> -- > >>> Dave Caplinger | Director, Technical Product Management > >>> Solutionary — An NTT Group Security Company > >>> > >>> _______________________________________________ > >>> rsyslog mailing list > >>> http://lists.adiscon.net/mailman/listinfo/rsyslog > >>> http://www.rsyslog.com/professional-services/ > >>> What's up with rsyslog? Follow https://twitter.com/rgerhards > >>> NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a > myriad > >>> of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > >>> DON'T LIKE THAT. > >>> > >> > >> > > _______________________________________________ > > rsyslog mailing list > > http://lists.adiscon.net/mailman/listinfo/rsyslog > > http://www.rsyslog.com/professional-services/ > > What's up with rsyslog? Follow https://twitter.com/rgerhards > > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ > rsyslog mailing list > http://lists.adiscon.net/mailman/listinfo/rsyslog > http://www.rsyslog.com/professional-services/ > What's up with rsyslog? Follow https://twitter.com/rgerhards > NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad > of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you > DON'T LIKE THAT. > _______________________________________________ rsyslog mailing list http://lists.adiscon.net/mailman/listinfo/rsyslog http://www.rsyslog.com/professional-services/ What's up with rsyslog? Follow https://twitter.com/rgerhards NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.

