[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-24 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091389#comment-17091389
 ] 

Maxim Gekk commented on SPARK-31463:


Parsing itself takes 10-20%. JSON datasource spends significant time in 
conversions to desired types according to schema. Even if you improve 
performance of parsing by a few times, the total impact will be not so 
significant.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-24 Thread Shashanka Balakuntala Srinivasa (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091232#comment-17091232
 ] 

Shashanka Balakuntala Srinivasa commented on SPARK-31463:
-

Hi [~hyukjin.kwon], I will start looking into this. Thanks.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091230#comment-17091230
 ] 

Hyukjin Kwon commented on SPARK-31463:
--

Separate source might be ideal. We can start it from separate project and 
gradually move it into Apache Spark when it's proven very useful later.

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-23 Thread Steven Moy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091205#comment-17091205
 ] 

Steven Moy commented on SPARK-31463:


Hi [~hyukjin.kwon], 

What's Spark recommended path on introducing C code? I was following SQLite and 
DuckDB, their approach is to inline the dependency (bring the code in in the 
case of compatible license). 

Or would it better to support simdjson as a compeletely separate DataSourcev2 
implementation?

simdjson license is Apache License as well; 
[https://github.com/simdjson/simdjson/blob/master/LICENSE]

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-23 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091199#comment-17091199
 ] 

Hyukjin Kwon commented on SPARK-31463:
--

So it's about vectorization, right? I think [~maxgekk] talked about 
vectorization somewhere.
My biggest concern is that if it's right to bring the C library into Spark as a 
dependency or not. 

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-20 Thread Shashanka Balakuntala Srinivasa (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087836#comment-17087836
 ] 

Shashanka Balakuntala Srinivasa commented on SPARK-31463:
-

Hi, Anyone working on this issue? 
If not, can i have some details on the implementation if we are moving from 
jackson to simdjson?

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org