Paul Rogers created DRILL-5950:
----------------------------------

             Summary: Allow JSON files to be splittable - for sequence of 
objects format
                 Key: DRILL-5950
                 URL: https://issues.apache.org/jira/browse/DRILL-5950
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.12.0
            Reporter: Paul Rogers


The JSON plugin format is not currently splittable. This means that every JSON 
file must be read by only a single thread. By contrast, text files are 
splittable.

The key barrier to allowing JSON files to be splittable is the lack of a good 
mechanism to find the start of a record at some arbitrary point in the file. 
Text readers handle this by scanning forward looking for (say) the newline that 
separates records. (Though this process can be thrown off if a newline appears 
in a quoted value, and the start quote appears before the split point.)

However, as was discovered in a previous JSON fix, Drill's form of JSON does 
provide the tools. In standard JSON, a list of records must be stuctured as a 
list:

{code}
[ { text: "first record"},
  { text: "second record"},
  ...
  { text: "final record" }
]
{code}

In this form, it is impossible to find the start of a record without parsing 
from the first character onwards.

But, Drill uses a common, but non-standard, JSON structure that dispenses with 
the array and the commas between records:

{code}
{ text: "first record" }
{ text: "second record" }
...
{ text: "last record" }
{code}

This form does unambiguously allow finding the start of the record. Simply scan 
until we find the tokens: {, }, possibly separated by white space. 
That sequence is not valid JSON and only occurs between records in the 
sequence-of-records format.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to