Hi,
Thanks for all the details.
Come back to one use case : the context is the transformation into Parquet of
JSONs containing billions
of records and for which each record have the global same schema but can have
some specificities.
Simplified example:
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green"
,"c4":"vanilla"}}
{"a":"cow" ,"b":"28","c":{"c1":"blue" ,"c3":"black"
,"c5":{"d":"2","e":"3"}}}
...
We need to transform the JSON into Parquet.
So OK,for columns a and b (in this example) but for c (we don't/can't know all
the possibilities and
it's growing up continuously. So the solution is to read "c" as TEXT and report
the use/treatment of the content.
So in these example, destination Parquet will have 3 columns
a : VARCHAR (example: 'horses')
b : INT (example: 14
c : VARCHAR (example: '{"c1":"blue","c3":"black","c5":{"d":"2","e":"3"}}'
We can't do that with drill because the "discover/alignement" of the "c" part
of the json is too heavy in
terms of resources and request crashes
So we currently use a Spark solution as Spark allow to specify a schema when
reading a file.
Hope that can help or give ideas,
Regards,
> Hi,
>
> Welcome to the Drill mailing list.
>
> You are right. Drill is a SQL engine. It works best when the JSON input files
> represent rows
> and columns.
>
> Of course, JSON itself can represent arbitrary data structures: you can use
> it to serialize
> any Java structure you want. Relational tables and columns represent a small
> subset of what
> JSON can do. Drill's goal is to read relational data encoded in JSON, not to
> somehow magically
> convert any arbitrary data structure into tables and columns.
>
> As described in our book, Learning Apache Drill, even seemingly trivial JSON
> can violate relational
> rules. For example:
>
> {a: 10} {a: 10.1}
>
> Since Drill infers types, and must guess the type on the first row, Drill
> will guess BIGINT.
> Then, the very next row shows that that was a poor guess and Drill will raise
> an error. I
> like to call this the "Drill cannot predict the future" problem.
>
>
> The team is working on a long-term project to allow you to specify a schema
> to resolve ambiguities.
> For example, you can tell Drill that column a above is a DOUBLE despite what
> the data might
> say. You can deal with schema evolution to say that column b (which does not
> appear above,
> but might appear in newer files) is an array of BIGINT. And so on.
>
> Drill also supports LATERAL JOIN and FLATTEN to handle nested tables:
>
> {name: "fred", orders: [ {date: "Jan 1", amount: 12.34}, {date: "Jan 12",
> amount: 23.45}]}
>
> The schema, however, will not transform arbitrary JSON into tables and
> columns. Some things
> are better done in an ETL step where you can use the full power of a
> declarative language
> (Java or Scala in Spark, say) to convert wild & crazy JSON into relational
> form.
>
> We are actively designing the schema feature. May I ask your use case? Would
> be super-helpful
> to understand the shape of your input data and how you want to map that into
> SQL.
>
> One final note: it is possible to write a custom format plugin. If you will
> query the same
> wild & crazy JSON shape multiple times, you can write a plugin to do the
> mapping as Drill
> reads the data. Not the simplest path, but possible.
>
>
> Thanks,
> - Paul
>
>
>
>> On Wednesday, February 5, 2020, 1:30:14 PM PST,
>> [email protected]
>> <[email protected]> wrote:
>>
>> Hi,
>>
>> Some JSON file are complex and containing differents "tree struct".
>> If these file are big it will take too much time for drill to align the
>> structures (and even
>> worse sometimes fail).
>> In spark it's possible to force a schema when reading a file to avoid long
>> or useless treatment
>> of align and eventually dismiss field and force type (like into string to
>> avoid going down
>> into the structure)
>>
>> Is there any possibility in drill to specify at read an explicit schema ?
>>
>> Thanks for any information