nicoloboschi opened a new pull request, #15431: URL: https://github.com/apache/pulsar/pull/15431
### Motivation If the message value contains non-printable characters you will get ``` 2022-04-22T22:37:29.673384094Z 22:37:29.668 [tenant/ns/topic] ERROR org.apache.pulsar.io.elasticsearch.ElasticSearchSink - Malformed document messageId=73895:0:1 2022-04-22T22:37:29.673415795Z com.fasterxml.jackson.core.JsonParseException: Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t) is allowed between tokens 2022-04-22T22:37:29.673424695Z at [Source: (String)"\u0000\u0000\u0002�\u000C\u0002\u001Epick_start_time\u0000\u0002\u00081211\u0002\u000ESafeway\u0002�����_\u0000\u0000\u0002\u00021\u0002\u0018445184429011"; line: 1, column: 2] ``` Even if you set malformedDocAction to IGNORE, the message will be re-delivered. In case of KEY_SHARED subscriptions this will lead to stuck subscriptions scenario. The issue is that [JSON format doesn't accept this kind of characters](https://datatracker.ietf.org/doc/html/rfc8259#section-7). >All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F). Since usually these characters are useless, it is better to drop them all instead of encoding (which is not simple because it depends how much the json is malformed. For example, inside a key or a value you can encode them but you cannot between tokens) ### Modifications - New option `stripNonPrintableCharacters` default=true (which will trigger a different behaviour by default) which removes the non printable characters from the output json (only for the document, not the _id because ElasticSearch doesn't care if the _id a valid json or not). The stripping is done via RegEx because, unfortunately, Jackson Mapper doesn't support this out of the box. - [x] `doc` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
