[ https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630075#comment-15630075 ]
Khurram Faraaz commented on DRILL-4653: --------------------------------------- I don't this this is fixed, there are still some cases that need to be taken care of. Please see below. Also, more importantly this checking for malformed JSON should be ON/enabled by default in Drill. Users will like to ignore bad records, rather than see an Exception/Error and then our support suggest them to enable this skip_invalid_records. This I believe should be ON by default in Drill. [test@cent01 drill_4653]# cat badjson_01.json {"key":"test string"} {"key":"foo"} {"key":"foobar" {"key":"blah"} {"key":"temp"} [test@cent01 drill_4653]# cat badjson_02.json { "key":"foo", "badarray":[1,3,4,5,6,7,8,, "key":"test string", "key":"foobar" } [test@cent01 drill_4653]# [test@cent01 drill_4653]# cat badjson_03.json { "key":"foo", "key":"foobar", "key":"test string", "key":"string", "key": } [test@cent01 drill_4653]# [test@cent01 drill_4653]# cat badjson_04.json {"key":1} {"key":2} {"key":3} {"key": [test@cent01 drill_4653] [test@cent01 drill_4653]# cat badjson_05.json { "key1":"foobar", "key2":[1,3,4,5,6,7,8,9], "key3":{ "key4":}, "key5":"foo" } [test@cent01 drill_4653] [test@cent01 drill_4653]# cat badjson_06.json { "name":"John Doe", "age":33, "dept":"IT", "address":{ "street":"some street", "city":"some city", "zip": } "isManager":"yes" } [test@cent01 drill_4653] {noformat} 0: jdbc:drill:schema=dfs.tmp> alter session set `store.json.reader.skip_invalid_records`=true; +-------+--------------------------------------------------+ | ok | summary | +-------+--------------------------------------------------+ | true | store.json.reader.skip_invalid_records updated. | +-------+--------------------------------------------------+ 1 row selected (0.334 seconds) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select key from `badjson_01.json`; +--------------+ | key | +--------------+ | test string | | foo | | temp | +--------------+ 3 rows selected (0.466 seconds) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `badjson_01.json`; +--------------+ | key | +--------------+ | test string | | foo | | temp | +--------------+ 3 rows selected (0.222 seconds) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `badjson_02.json`; Error: DATA_READ ERROR: Unexpected character (',' (code 44)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; line: 3, column: 32] Line 3 Column 33 Field badarray Fragment 0:0 [Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 on centos-01.qa.lab:31010] (state=,code=0) {noformat} Stack trace from drillbit.log for above failure {noformat} org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Unexpected character (',' (code 44)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; line: 3, column: 32] Line 3 Column 33 Field badarray [Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 ] at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:543) ~[drill-common-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:586) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:372) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch(JsonReader.java:306) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector(JsonReader.java:247) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.write(JsonReader.java:202) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.store.easy.json.JSONRecordReader.next(JSONRecordReader.java:206) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:178) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:232) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:226) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at java.security.AccessController.doPrivileged(Native Method) [na:1.8.0_91] at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_91] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595) [hadoop-common-2.7.0-mapr-1607.jar:na] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:226) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character (',' (code 44)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712; line: 3, column: 32] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1586) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:521) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:450) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2628) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:854) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:748) ~[jackson-core-2.7.1.jar:2.7.1] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:537) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] ... 24 common frames omitted {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select key from `badjson_02.json`; +------+ | key | +------+ +------+ No rows selected (0.477 seconds) {noformat} This query should return "foo", "foobar", "test string", "string" in 4 rows. {noformat} 0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json`; +------+ | key | +------+ +------+ No rows selected (0.208 seconds) {noformat} This query should return "foobar" {noformat} 0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json` where key ='foobar'; +------+ | key | +------+ +------+ No rows selected (0.253 seconds) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select key from `badjson_04.json`; +------+ | key | +------+ | 1 | | 2 | | 3 | +------+ 3 rows selected (0.232 seconds) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `badjson_04.json`; Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within/between OBJECT entries File /tmp/badjson_04.json Record 4 Column 39 Fragment 0:0 [Error Id: a30668ff-8bdc-44bc-aeac-c566e2f731b6 on centos-01.qa.lab:31010] (state=,code=0) Stack trace from drillbit.log Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input within/between OBJECT entries at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@37039ebe; line: 5, column: 39] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1586) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipColon2(UTF8StreamJsonParser.java:3038) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipColon(UTF8StreamJsonParser.java:2950) ~[jackson-core-2.7.1.jar:2.7.1] at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:756) ~[jackson-core-2.7.1.jar:2.7.1] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:350) ~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch(JsonReader.java:306) ~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector(JsonReader.java:247) ~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.vector.complex.fn.JsonReader.write(JsonReader.java:202) ~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] at org.apache.drill.exec.store.easy.json.JSONRecordReader.next(JSONRecordReader.java:206) [drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT] ... 19 common frames omitted {noformat} This query should return "foobar" in key1 and arracy [1,3,4,5,6,7,8,9] in key2 {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `badjson_05.json`; +-------+-------+ | key1 | key2 | +-------+-------+ +-------+-------+ No rows selected (0.229 seconds) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select key1 from `badjson_05.json`; Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 125)): expected a value File /tmp/badjson_05.json Record 1 Column 22 Fragment 0:0 [Error Id: 01a8ce3b-b0c0-41c5-92cd-3467265b60a6 on centos-01.qa.lab:31010] (state=,code=0) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select key2 from `badjson_05.json`; Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 125)): expected a value File /tmp/badjson_05.json Record 1 Column 22 Fragment 0:0 [Error Id: 40bb646b-18e7-4dff-812d-f409ea1fcf27 on centos-01.qa.lab:31010] (state=,code=0) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select * from `badjson_06.json`; +-------+------+-------+----------+ | name | age | dept | address | +-------+------+-------+----------+ +-------+------+-------+----------+ No rows selected (0.205 seconds) {noformat} {noformat} 0: jdbc:drill:schema=dfs.tmp> select name from `badjson_06.json`; Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code 125)): expected a value File /tmp/badjson_06.json Record 1 Column 16 Fragment 0:0 [Error Id: b549023e-1f54-418c-adc5-9a21cf0ec3aa on centos-01.qa.lab:31010] (state=,code=0) {noformat} > Malformed JSON should not stop the entire query from progressing > ---------------------------------------------------------------- > > Key: DRILL-4653 > URL: https://issues.apache.org/jira/browse/DRILL-4653 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - JSON > Affects Versions: 1.6.0 > Reporter: subbu srinivasan > Fix For: 1.9.0 > > > Currently Drill query terminates upon first encounter of a invalid JSON line. > Drill has to continue progressing after ignoring the bad records. Something > similar to a setting of (ignore.malformed.json) would help. -- This message was sent by Atlassian JIRA (v6.3.4#6332)