Github user paul-rogers commented on the issue:
https://github.com/apache/drill/pull/518
Thanks much for your contribution! Sorry it is taking a while to review the
code.
I've read over the code changes a couple of times. The options part is
fine. However, the parser part seems a bit too complex and may not handle all
cases of interest. I worry that the code may only handle the error shown in the
test input file: a missing close quote:
{ "foo" : "bar }
{ "foo" : "mumble" }
To do a more general recovery, we have to have some known good recovery
point. Pure JSON is not suited to recovery. Fortunately, the JSON we read is
not true JSON, rather it is a file of JSON objects. The file-of-json-object
structure offers two possible recovery points.
If we require that each JSON record start on a new line, then we can use
newline as our recovery point:
{ "foo" :: 10 }
{ "foo" : 20 }
In the above, the first record is badly formed. But, we know where the next
record begins because of the newline separator.
It seems that the code may be using the above approach. But, since the only
test case is a missing close quote, the parser will "naturally" read all tokens
up to EOL. But, the implementation seems to not handle the double-colon
example: the paser will fail on the second token. When resuming, the parser
will see another colon and fail again. Maybe the parser would eventually fail
enough to find the EOL. But, we can construct cases where things would fail:
{ "foo" :: 10, "x": { "bar", 20, "mumble" } }
If we don't discard tokens to the newline, the parser may decide that {
"bar" ... starts a new valid record, with the result of causing a schema change
when not needed.
The message here is that, on error, we must discard tokens to EOL, we can't
just try to resume parsing at the point of failure.
If, however, we can't count on the newline, then we need some other
syntactic trick. Perhaps we can look for "}/s*{" (that is, close bracket,
optional white space, open bracket):
{ "foo" ::
{ "foo" : 20 }
{ "foo" : 30 }
Here, we'd actually discard both the first and second records because we'd
only find the recovery patttern between the second and third records. (The "}
{" pattern can never occur inside a JSON object, it would have to be "}, {"
instead.)
Using EOL as delimiter is easiest: we don't need lookahead. But, we require
that a newline separate records. (Newlines within records can be handled, but
we'll omit that for now...)
Using "}/s*{" as the recovery is a bit harder: we need to read ahead by one
token, then push the token back on the input.
If we do the above, then the reader algorithm should look something like
this:
while ( more records to read ) {
try {
read the record
} catch ( parse exception ) {
read and discard tokens until we get to an EOL (or the "}/s*{")
reset the value vector pointer back one step to discard any
partially loaded values.
}
}
The above can be shown to be correct by working through a simple state
machine and set of examples.
Alternatively, it might help to include here (or in comments in code) an
explanation of the proposed recovery mechanism so it is easier for reviewers to
verify correctness.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---