[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488657#comment-15488657
 ] 

ASF GitHub Bot commented on DRILL-4653:
---------------------------------------

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/518
  
    Poking around in the code a bit more, it looks like we rely on the Jackson 
JsonParser which delivers a stream of tokens. Here's the code in JsonReader:
    
        public ReadState write(ComplexWriter writer) throws IOException {
          JsonToken t = parser.nextToken();
    
    As you pointed out, white space is consumed internally to the tokenizer. 
But we do get JSON tokens ({, }, identifier, number, etc.) Given this, you can 
scan ahead to find the close-open bracket pattern.
    
    The trick seems to solve two problems. First, how do we handle tokens: The 
answer seems to be a few lines down:
    
        ReadState readState = writeToVector(writer, t);
    
    Given a start token, we write to vector. writeToVector is a 
recursive-descent parser that consumes tokens and does the right thing. Here, 
it accepts only a top-level object, rejecting all other tokens. For an object, 
we call writeDataSwitch (funny name, that) which recursively writes fields, 
objects, lists and the rest.
    
    Since this is a recursive-descent parser, JSON structure is represented by 
the call stack. To bail out of an invalid parse, we have to unwind the stack, 
which is what an exception does. So, that part may be good.
    
    The next step is how to get the tokenizer (JsonParser) pointed at the right 
point so we can try to read the next object. That is where we need to consume 
tokens until we find the end-start bracket pair.
    
    But, note that we have now consumed the start bracket, so we can't read it 
again when we gain call writeToVector. Checking the code, the JsonParser has no 
unget( ) method, unfortunately. To work around that, we need to "push" the 
token back onto the input. (Either actually doing so, or by having internal 
state that says that we've already read the open bracket.) 
    
    We also have to ask if the JSON parser keeps track of parse state. Looking 
at the code, it seems that JsonParser is really just a tokenizer: it has not 
state about the JSON structure. (Would be good to run a small test to verify 
this observation.)
    
    By the way, the JSON parser class has lots of good stuff already there. For 
example, the parser itself will keep track of line numbers and file locations. 
Perhaps we can use that when reporting error positions.
    
    The last bit is that we've been building up a record as we parser JSON. If 
we fail part way thorugh, we've got a half-build record. Again, here I'm a bit 
hazy on what Drill can do. Can we "unwind" the current record? Can we mark the 
record as one to ignore (with a select vector?) Or, do we live with the 
half-build record? Throwing away the half-built record would be best, if we can 
do it.
    
    All that said, how much of the above does your proposed code change handle? 
What other parts might still need to be added?
    
    Thanks!


> Malformed JSON should not stop the entire query from progressing
> ----------------------------------------------------------------
>
>                 Key: DRILL-4653
>                 URL: https://issues.apache.org/jira/browse/DRILL-4653
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.6.0
>            Reporter: subbu srinivasan
>             Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to