Paul Rogers created DRILL-7598:
----------------------------------
Summary: PostgreSQL-like functions for working with JSON
Key: DRILL-7598
URL: https://issues.apache.org/jira/browse/DRILL-7598
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
>From a contributor on the Drill user mailing list:
{quote}PostgreSQL has a practical way to manipulate the json data. You can
read: [https://www.postgresql.org/docs/12/datatype-json.html].
{quote}
The user's use case is as follows:
{code:json}
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}}
{"a":"cow" ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}}
{code}
Notice that the {{`c`}} column changes types. This causes Drill to fail in
execution. Hence the suggestion to work with column {{c}} as JSON without
parsing that JSON into Drill's relational schema.
Drill should offer such support. We've recently discussed introducing a similar
feature in Drill which one could, with some humor, call "let JSON be JSON." The
idea would be, as in PostreSQL, to simply represent JSON as text and allow the
user to work with JSON using JSON-oriented functions. The PostreSQL link
suggest that this is, in fact, a workable approach (though, as you not, doing
so is slower than converting JSON to a relational structure.)
Today, however, Drill attempts to map JSON into a relational model so that the
user can use [SQL operations to work on the
data|https://drill.apache.org/docs/json-data-model/]. The Drill approach works
well when the JSON is the output of a relational model (a dump of a relational
table or query, say.) The approach does not work for "native" JSON in all its
complexity. JSON is a superset of the relational model and so not all JSON
files map to tables and columns.
To solve the user's use case, Drill would need to adopt a solution similar to
PostgreSQL. In fact, Drill already has some of the pieces (such as the
[CONVERT_TO/CONVERT_FROM
operations|https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from]),
but even these attempt to convert JSON to or from the relational model. What
we need, so solve the general use case, are the kind of native JSON functions
which PostgreSQL provides.
Fortunately, since Drill would store JSON as a VARCHAR, no work would be needed
in the Drill "core". All that is needed is someone to provide a set of Drill
functions (UDFs) to call out to some JSON library to perform the desired
operations.
This feature would work best when the user can parse some parts of a JSON input
file into relational structure, others as JSON. (This is the use case which the
user list user faced.) So, we need a way to do that. See DRILL-7597 for a
request for such a feature.
Combining the PostgreSQL-like JSON functions with the ability to read selected
columns as JSON, might provide an elegant solution to the "messy JSON" problem.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)