[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091389#comment-17091389 ] Maxim Gekk commented on SPARK-31463: Parsing itself takes 10-20%. JSON datasource spends significant time in conversions to desired types according to schema. Even if you improve performance of parsing by a few times, the total impact will be not so significant. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091232#comment-17091232 ] Shashanka Balakuntala Srinivasa commented on SPARK-31463: - Hi [~hyukjin.kwon], I will start looking into this. Thanks. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091230#comment-17091230 ] Hyukjin Kwon commented on SPARK-31463: -- Separate source might be ideal. We can start it from separate project and gradually move it into Apache Spark when it's proven very useful later. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091205#comment-17091205 ] Steven Moy commented on SPARK-31463: Hi [~hyukjin.kwon], What's Spark recommended path on introducing C code? I was following SQLite and DuckDB, their approach is to inline the dependency (bring the code in in the case of compatible license). Or would it better to support simdjson as a compeletely separate DataSourcev2 implementation? simdjson license is Apache License as well; [https://github.com/simdjson/simdjson/blob/master/LICENSE] > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091199#comment-17091199 ] Hyukjin Kwon commented on SPARK-31463: -- So it's about vectorization, right? I think [~maxgekk] talked about vectorization somewhere. My biggest concern is that if it's right to bring the C library into Spark as a dependency or not. > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson
[ https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087836#comment-17087836 ] Shashanka Balakuntala Srinivasa commented on SPARK-31463: - Hi, Anyone working on this issue? If not, can i have some details on the implementation if we are moving from jackson to simdjson? > Enhance JsonDataSource by replacing jackson with simdjson > - > > Key: SPARK-31463 > URL: https://issues.apache.org/jira/browse/SPARK-31463 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3 >Reporter: Steven Moy >Priority: Minor > > I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how > to improve json reading speed. We use Spark to process terabytes of JSON, so > we try to find ways to improve JSON parsing speed. > > [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/] > > [https://github.com/simdjson/simdjson/issues/93] > > Anyone on the opensource communty interested in leading this effort to > integrate simdjson in spark json data source api? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org