[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442783#comment-16442783 ]
Ruslan Dautkhanov commented on SPARK-23963: ------------------------------------------- I thought it's just a matter of a Spark committer to commit the same PR [https://github.com/apache/spark/pull/21043] to a different branch? Spark2.2 in this case. This PR gives 24x improvement on 6000 columns as you discovered, so I think this 1-line change should be admitted to Spark 2.2 fairly easily. > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -------------------------------------------------------------------------------------------------- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Bruce Robbins > Assignee: Bruce Robbins > Priority: Minor > Fix For: 2.4.0 > > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org