[ https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609475#comment-16609475 ]
Hyukjin Kwon commented on SPARK-25396: -------------------------------------- Oh haha yea I tried this by myself before and kind of failed due to dealing with malformed record. If you see a good approach, please go ahead. > Read array of JSON objects via an Iterator > ------------------------------------------ > > Key: SPARK-25396 > URL: https://issues.apache.org/jira/browse/SPARK-25396 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: Maxim Gekk > Priority: Minor > > If a JSON file has a structure like below: > {code} > [ > { > "time":"2018-08-13T18:00:44.0860000Z", > "resourceId":"some-text", > "category":"A", > "level":2, > "operationName":"Error", > "properties":{...} > }, > { > "time":"2018-08-14T18:00:44.0860000Z", > "resourceId":"some-text2", > "category":"B", > "level":3, > "properties":{...} > }, > ... > ] > {code} > it should be read in the `multiLine` mode. In this mode, Spark read whole > array into memory in both cases when schema is `ArrayType` and `StructType`. > It can lead to unnecessary memory consumption and even to OOM for big JSON > files. > In general, there is no need to materialize all parsed JSON record in memory > there: > https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95 > . So, JSON objects of an array can be read via an Iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org