[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611537#comment-16611537
 ] 

Hyukjin Kwon commented on SPARK-25396:
--

Yea, I postponed the closing thing for the try I made at that time IIRC - that 
should also be related with handling malformed one. Yea, I hope that's not so 
complicated since, as you already know, the code here is quite convoluted. One 
possibility is we have another method to only parse array only that return 
iterator.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609605#comment-16609605
 ] 

Maxim Gekk commented on SPARK-25396:


I have a concern regarding to when I should close Jackson parser. For now it is 
closed before returning result from the parse method there: 
[https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L394-L404]
 . If I return an *Iterator[InternalRow]* instead of *Seq[InternalRow]*, so I 
have to postpone closing of Jackson parser at least up to the end of current 
task, right? ... but it is bad for per-line mode because this could produce a 
lot of opened JSON parsers. It seems implementations for multiLine and for 
per-line mode should be different.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609479#comment-16609479
 ] 

Hyukjin Kwon commented on SPARK-25396:
--

At that time, there's no multiple mode or json functions. So I wonder how it's 
like for the current status but still agree with this idea in general.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609475#comment-16609475
 ] 

Hyukjin Kwon commented on SPARK-25396:
--

Oh haha yea I tried this by myself before and kind of failed due to dealing 
with malformed record. If you see a good approach, please go ahead.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609469#comment-16609469
 ] 

Maxim Gekk commented on SPARK-25396:


[~hyukjin.kwon] WDYT

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org