Shawn Hooton created ORC-200: -------------------------------- Summary: json-schema and convert commands should support schema evolution of json documents Key: ORC-200 URL: https://issues.apache.org/jira/browse/ORC-200 Project: ORC Issue Type: Bug Components: Java Affects Versions: 1.5.0 Reporter: Shawn Hooton Assignee: Shawn Hooton Attachments: example-v1.json, example-v2.json
Using the command (sample payloads attached): java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v1.json Produces the following output: create table tbl ( about string, address string, age tinyint, balance string, company string, email string, eyeColor string, favoriteFruit string, friends array <struct < id: tinyint, name: string>>, gender string, greeting string, guid string, id binary, index tinyint, isActive boolean, latitude decimal(8,6), longitude decimal(8,6), name string, phone string, picture string, registered timestamp, tags array <string> ) Notice that because org/apache/orc/tools/json/StructType.java uses a java.util.TreeMap for the fields instance variable the generated DDL is sorted alphabetically and not ordered by structure. This causes problems for the convert command as well. java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar meta -j -p output-v1.orc *** output ommited for brevity "schemaString": "struct<about:string,address:string,age:tinyint,balance:string,company:string,email:string,eyeColor:string,favoriteFruit:string,friends:array<struct<id:tinyint,name:string>>,gender:string,greeting:string,guid:string,id:binary,index:tinyint,isActive:boolean,latitude:decimal(8,6),longitude:decimal(8,6),name:string,phone:string,picture:string,registered:timestamp,tags:array<string>>", "schema": [ { "columnId": 0, "columnType": "STRUCT", "childColumnNames": [ "about", "address", "age", "balance", "company", "email", "eyeColor", "favoriteFruit", "friends", "gender", "greeting", "guid", "id", "index", "isActive", "latitude", "longitude", "name", "phone", "picture", "registered", "tags" ], *** output ommited for brevity This causes *major* problems when a field is added to the JSON document later e.g. java -jar orc-tools-1.5.0-SNAPSHOT-uber.jar json-schema -t ~/example-v2.json Examine where the newField field is added in the example-v2.json document and then examine the output below. This also affects the convert command. create table tbl ( about string, address string, age tinyint, balance string, company string, email string, eyeColor string, favoriteFruit string, friends array <struct < id: tinyint, name: string>>, gender string, greeting string, guid string, id binary, index tinyint, isActive boolean, latitude decimal(8,6), longitude decimal(8,6), name string, ***** newField string, phone string, picture string, registered timestamp, tags array <string> ) The org/apache/orc/tools/json/StructType.java class should use java.util.LinkedHashMap for the fields instance variable so order is maintained across changes to the JSON schema. Pull request *with* test cases incoming :) -- This message was sent by Atlassian JIRA (v6.3.15#6346)