[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li updated SPARK-23963: ---------------------------- Fix Version/s: 2.3.1 2.2.2 > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -------------------------------------------------------------------------------------------------- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: Bruce Robbins > Assignee: Bruce Robbins > Priority: Minor > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org