Maxim Gekk created SPARK-25396: ---------------------------------- Summary: Read array of JSON objects via an Iterator Key: SPARK-25396 URL: https://issues.apache.org/jira/browse/SPARK-25396 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk
If a JSON file has a structure like below: {code} [ { "time":"2018-08-13T18:00:44.0860000Z", "resourceId":"some-text", "category":"A", "level":2, "operationName":"Error", "properties":{...} }, { "time":"2018-08-14T18:00:44.0860000Z", "resourceId":"some-text2", "category":"B", "level":3, "properties":{...} }, ] {code} it should be read in the `multiLine` mode. In this mode, Spark read whole array into memory in both cases when schema is `ArrayType` and `StructType`. It can lead to unnecessary memory consumption and even to OOM for big JSON files. In general, there is no need to materialize all parsed JSON record in memory there: https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 . So, JSON objects of an array can be read via an Iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org