[ https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15486130#comment-15486130 ]
ASF GitHub Bot commented on DRILL-4653: --------------------------------------- Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/518 Upon reflection, it seems that newline is not an adequate marker to separate JSON records. Many of our samples have internal newlines. If a newline appears inside the JSON record, then we are subject to the same incorrect recovery as illustrated with the "a, x, bar, y" example in the earlier comment. Further, if the JSON tokenizer is like most, it probably discards whitespace, not returning EOL as a token. So, it seems that the best (or only) option is to scan for the "} {" pair. This requires two specific improvements: * A "token discarder" that uses a state machine to look for the "} {" pairs, and * An indirection around the get-token method so we can push the "{" token back onto the input. These changes, along with the pseudo-code shown earlier may provide as good a solution as we can get. (Phrased that way because some errors will cause two records to be discarded, as explained earlier.) Combine that with the options and error reporting from the original pull request and we are probably pretty close. > Malformed JSON should not stop the entire query from progressing > ---------------------------------------------------------------- > > Key: DRILL-4653 > URL: https://issues.apache.org/jira/browse/DRILL-4653 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - JSON > Affects Versions: 1.6.0 > Reporter: subbu srinivasan > Fix For: Future > > > Currently Drill query terminates upon first encounter of a invalid JSON line. > Drill has to continue progressing after ignoring the bad records. Something > similar to a setting of (ignore.malformed.json) would help. -- This message was sent by Atlassian JIRA (v6.3.4#6332)