Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Poking around in the code a bit more, it looks like we rely on the Jackson
JsonParser which delivers a stream of tokens. Here's the code in JsonReader:
public ReadState write(ComplexWriter writer) throws IOException {
JsonToken t = parser.nextToken();
As you pointed out, white space is consumed internally to the tokenizer.
But we do get JSON tokens ({, }, identifier, number, etc.) Given this, you can
scan ahead to find the close-open bracket pattern.
The trick seems to solve two problems. First, how do we handle tokens: The
answer seems to be a few lines down:
ReadState readState = writeToVector(writer, t);
Given a start token, we write to vector. writeToVector is a
recursive-descent parser that consumes tokens and does the right thing. Here,
it accepts only a top-level object, rejecting all other tokens. For an object,
we call writeDataSwitch (funny name, that) which recursively writes fields,
objects, lists and the rest.
Since this is a recursive-descent parser, JSON structure is represented by
the call stack. To bail out of an invalid parse, we have to unwind the stack,
which is what an exception does. So, that part may be good.
The next step is how to get the tokenizer (JsonParser) pointed at the right
point so we can try to read the next object. That is where we need to consume
tokens until we find the end-start bracket pair.
But, note that we have now consumed the start bracket, so we can't read it
again when we gain call writeToVector. Checking the code, the JsonParser has no
unget( ) method, unfortunately. To work around that, we need to "push" the
token back onto the input. (Either actually doing so, or by having internal
state that says that we've already read the open bracket.)
We also have to ask if the JSON parser keeps track of parse state. Looking
at the code, it seems that JsonParser is really just a tokenizer: it has not
state about the JSON structure. (Would be good to run a small test to verify
this observation.)
By the way, the JSON parser class has lots of good stuff already there. For
example, the parser itself will keep track of line numbers and file locations.
Perhaps we can use that when reporting error positions.
The last bit is that we've been building up a record as we parser JSON. If
we fail part way thorugh, we've got a half-build record. Again, here I'm a bit
hazy on what Drill can do. Can we "unwind" the current record? Can we mark the
record as one to ignore (with a select vector?) Or, do we live with the
half-build record? Throwing away the half-built record would be best, if we can
do it.
All that said, how much of the above does your proposed code change handle?
What other parts might still need to be added?
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---