[jira] [Commented] (SPARK-26378) Queries of wide CSV data slowed after SPARK-26151

ASF GitHub Bot (JIRA) Mon, 17 Dec 2018 07:49:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16723095#comment-16723095
 ]


ASF GitHub Bot commented on SPARK-26378:
----------------------------------------

bersprockets opened a new pull request #23336: [SPARK-26378][SQL] Restore 
performance of queries against wide CSV tables
URL: https://github.com/apache/spark/pull/23336
 
 
   ## What changes were proposed in this pull request?
   
   After recent changes to CSV parsing to return partial results for bad CSV 
records, queries of wide CSV tables slowed considerably. That recent change 
resulted in every row being recreated, even when the associated input record 
had no parsing issues and the user specified no corrupt record field in his/her 
schema
   
   In this PR,  I propose that a row should be recreated only if there is a 
parsing error or columns need to be shifted due to the existence of a corrupt 
column field in the user-supplied schema. Otherwise, the row should be used 
as-is. This restores performance for the non-error case only.
   
   ### Benchmarks:
   
   baseline = commit before partial results change
   PR = this PR
   master = master branch
   
   The wide table has 6000 columns and 165,000 records, and the narrow table 
has 12 columns and 82,500,000 records. Tests are run with a single executor.
   
   In the following, positive percentages are bad (slower), negative are good 
(faster).
   
   #### Wide rows, all good records:
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.036489 min | 1.990344 min | 2.952561 min | -2.265882% | 44.982923%
   
   #### Wide rows, all bad records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   1.660761 min | 3.016839 min | 3.011944 min | 81.653994% | 81.359283%
   
   Both my PR and the master branch are ~81% slower than the baseline when all 
records are bad but the user specified no corrupt record field in his/her 
schema. In fact, the master branch is reliably, but slightly, faster here, 
since it does not call badRecord() in this case.
   
   #### Wide rows, corrupt record field, all good records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.912467 min | 2.893039 min | 2.905344 min | -0.667056% | -0.244543%
   
   #### Wide rows, corrupt record field, all bad records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.441417 min | 2.979544 min | 2.957439 min | 22.041620% | 21.136180%
   
   Both my PR and the master branch are ~21-22% slower than the baseline when 
all records are bad and the user specified a corrupt record field in his/her 
schema.
   
   #### Narrow rows, all good records
   
   baseline | pr | master | diff1 | diff2
   -----------|-----|-----------|-----------|---------------
   2.004539 min | 1.987183 min | 2.365122 min | -0.865813% | 17.988343%
   
   #### Narrow rows, corrupt record field, all good records
   
   baseline | pr | master | diff1 | diff2
   -----------|-----|-----------|-----------|---------------
   2.390589 min | 2.382100 min | 2.379733 min | -0.355096% | -0.454095%
   ## How was this patch tested?
   
   All SQL unit tests
   Python core and SQL tests
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Queries of wide CSV data slowed after SPARK-26151
> -------------------------------------------------
>
>                 Key: SPARK-26378
>                 URL: https://issues.apache.org/jira/browse/SPARK-26378
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> A recent change significantly slowed the queries of wide CSV tables. For 
> example, queries against a 6000 column table slowed by 45-48% when queried 
> with a single executor.
>   
>  The [PR for 
> SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2]
>  changed FailureSafeParser#toResultRow such that the returned function 
> recreates every row, even when the associated input record has no parsing 
> issues and the user specified no corrupt record field in his/her schema. This 
> extra processing is responsible for the slowdown.
>   
>  I propose that a row should be recreated only if there is a parsing error or 
> columns need to be shifted due to the existence of a corrupt column field in 
> the user-supplied schema. Otherwise, the row should be used as-is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26378) Queries of wide CSV data slowed after SPARK-26151

Reply via email to