[ https://issues.apache.org/jira/browse/SPARK-26378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723095#comment-16723095 ]
ASF GitHub Bot commented on SPARK-26378: ---------------------------------------- bersprockets opened a new pull request #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV tables URL: https://github.com/apache/spark/pull/23336 ## What changes were proposed in this pull request? After recent changes to CSV parsing to return partial results for bad CSV records, queries of wide CSV tables slowed considerably. That recent change resulted in every row being recreated, even when the associated input record had no parsing issues and the user specified no corrupt record field in his/her schema In this PR, I propose that a row should be recreated only if there is a parsing error or columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the row should be used as-is. This restores performance for the non-error case only. ### Benchmarks: baseline = commit before partial results change PR = this PR master = master branch The wide table has 6000 columns and 165,000 records, and the narrow table has 12 columns and 82,500,000 records. Tests are run with a single executor. In the following, positive percentages are bad (slower), negative are good (faster). #### Wide rows, all good records: baseline | pr | master | PR diff | master diff -----------|-----|-----------|-----------|--------------- 2.036489 min | 1.990344 min | 2.952561 min | -2.265882% | 44.982923% #### Wide rows, all bad records baseline | pr | master | PR diff | master diff -----------|-----|-----------|-----------|--------------- 1.660761 min | 3.016839 min | 3.011944 min | 81.653994% | 81.359283% Both my PR and the master branch are ~81% slower than the baseline when all records are bad but the user specified no corrupt record field in his/her schema. In fact, the master branch is reliably, but slightly, faster here, since it does not call badRecord() in this case. #### Wide rows, corrupt record field, all good records baseline | pr | master | PR diff | master diff -----------|-----|-----------|-----------|--------------- 2.912467 min | 2.893039 min | 2.905344 min | -0.667056% | -0.244543% #### Wide rows, corrupt record field, all bad records baseline | pr | master | PR diff | master diff -----------|-----|-----------|-----------|--------------- 2.441417 min | 2.979544 min | 2.957439 min | 22.041620% | 21.136180% Both my PR and the master branch are ~21-22% slower than the baseline when all records are bad and the user specified a corrupt record field in his/her schema. #### Narrow rows, all good records baseline | pr | master | diff1 | diff2 -----------|-----|-----------|-----------|--------------- 2.004539 min | 1.987183 min | 2.365122 min | -0.865813% | 17.988343% #### Narrow rows, corrupt record field, all good records baseline | pr | master | diff1 | diff2 -----------|-----|-----------|-----------|--------------- 2.390589 min | 2.382100 min | 2.379733 min | -0.355096% | -0.454095% ## How was this patch tested? All SQL unit tests Python core and SQL tests ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Queries of wide CSV data slowed after SPARK-26151 > ------------------------------------------------- > > Key: SPARK-26378 > URL: https://issues.apache.org/jira/browse/SPARK-26378 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Bruce Robbins > Priority: Major > > A recent change significantly slowed the queries of wide CSV tables. For > example, queries against a 6000 column table slowed by 45-48% when queried > with a single executor. > > The [PR for > SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2] > changed FailureSafeParser#toResultRow such that the returned function > recreates every row, even when the associated input record has no parsing > issues and the user specified no corrupt record field in his/her schema. This > extra processing is responsible for the slowdown. > > I propose that a row should be recreated only if there is a parsing error or > columns need to be shifted due to the existence of a corrupt column field in > the user-supplied schema. Otherwise, the row should be used as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org