[jira] [Resolved] (DRILL-8204) Allow Provided Schema for HTTP Plugin in JSON Mode

Vitalii Diravka (Jira) Thu, 05 May 2022 05:07:05 -0700


     [ 
https://issues.apache.org/jira/browse/DRILL-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vitalii Diravka resolved DRILL-8204.
------------------------------------
    Resolution: Fixed

> Allow Provided Schema for HTTP Plugin in JSON Mode
> --------------------------------------------------
>
>                 Key: DRILL-8204
>                 URL: https://issues.apache.org/jira/browse/DRILL-8204
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Other
>    Affects Versions: 1.20.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 2.0.0
>
>
> One of the challenges of querying APIs is inconsistent data. Drill allows you 
> to provide a schema for individual endpoints. You can do this in one of two 
> ways: either by 
> providing a serialized TupleMetadata of the desired schema. This is an 
> advanced functionality and should only be used by advanced Drill users.
> The schema provisioning currently supports complex types of Arrays and Maps 
> at any nesting level.
> ### Example Schema Provisioning:
> ```json
> "jsonOptions": {
> "providedSchema": [
> {
> "fieldName": "int_field",
> "fieldType": "bigint"
> }, {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode":"json"
> }
> },{
> // Array field
> "fieldName": "stringField",
> "fieldType": "varchar",
> "isArray": true
> }, {
> // Map field
> "fieldName": "mapField",
> "fieldType": "map",
> "fields": [
> {
> "fieldName": "nestedField",
> "fieldType": "int"
> },{
> "fieldName": "nestedField2",
> "fieldType": "varchar"
> }
> ]
> }
> ]
> }
> ```
> ### Example Provisioning the Schema with a JSON String
> ```json
> "jsonOptions": {
> "jsonSchema": 
> "\{\"type\":\"tuple_schema\",\"columns\":[{\"name\":\"outer_map\",\"type\":\"STRUCT<`int_field`
>  BIGINT, `int_array` ARRAY<BIGINT>>\",\"mode\":\"REQUIRED\"}]}"
> }
> ```
> You can print out a JSON string of a schema with the Java code below. 
> ```java
> TupleMetadata schema = new SchemaBuilder()
> .addNullable("a", MinorType.BIGINT)
> .addNullable("m", MinorType.VARCHAR)
> .build();
> ColumnMetadata m = schema.metadata("m");
> m.setProperty(JsonLoader.JSON_MODE, JsonLoader.JSON_LITERAL_MODE);
> System.out.println(schema.jsonString());
> ```
> This will generate something like the JSON string below:
> ```json
> {
> "type":"tuple_schema",
> "columns":[
> {"name":"a","type":"BIGINT","mode":"OPTIONAL"},
> {"name":"m","type":"VARCHAR","mode":"OPTIONAL","properties":\{"drill.json-mode":"json"}
> }
> ]
> }
> ```
> ## Dealing With Inconsistent Schemas
> One of the major challenges of interacting with JSON data is when the schema 
> is inconsistent. Drill has a `UNION` data type which is marked as 
> experimental. At the time of
> writing, the HTTP plugin does not support the `UNION`, however supplying a 
> schema can solve a lot of those issues.
> ### Json Mode
> Drill offers the option of reading all JSON values as a string. While this 
> can complicate downstream analytics, it can also be a more memory-efficient 
> way of reading data with 
> inconsistent schema. Unfortunately, at the time of writing, JSON-mode is only 
> available with a provided schema. However, future work will allow this mode 
> to be enabled for 
> any JSON data.
> #### Enabling JSON Mode:
> You can enable JSON mode simply by adding the `drill.json-mode` property with 
> a value of `json` to a field, as shown below:
> ```json
> {
> "fieldName": "jsonField",
> "fieldType": "varchar",
> "properties": {
> "drill.json-mode": "json"
> }
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Resolved] (DRILL-8204) Allow Provided Schema for HTTP Plugin in JSON Mode

Reply via email to