[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15486130#comment-15486130
 ] 

ASF GitHub Bot commented on DRILL-4653:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/518
  
    Upon reflection, it seems that newline is not an adequate marker to 
separate JSON records. Many of our samples have internal newlines. If a newline 
appears inside the JSON record, then we are subject to the same incorrect 
recovery as illustrated with the "a, x, bar, y" example in the earlier comment.
    
    Further, if the JSON tokenizer is like most, it probably discards 
whitespace, not returning EOL as a token.
    
    So, it seems that the best (or only) option is to scan for the "} {" pair. 
This requires two specific improvements:
    
    * A "token discarder" that uses a state machine to look for the "} {" 
pairs, and
    * An indirection around the get-token method so we can push the "{" token 
back onto the input.
    
    These changes, along with the pseudo-code shown earlier may provide as good 
a solution as we can get. (Phrased that way because some errors will cause two 
records to be discarded, as explained earlier.) Combine that with the options 
and error reporting from the original pull request and we are probably pretty 
close.


> Malformed JSON should not stop the entire query from progressing
> ----------------------------------------------------------------
>
>                 Key: DRILL-4653
>                 URL: https://issues.apache.org/jira/browse/DRILL-4653
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.6.0
>            Reporter: subbu srinivasan
>             Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to