[ https://issues.apache.org/jira/browse/SPARK-35003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317763#comment-17317763 ]
Apache Spark commented on SPARK-35003: -------------------------------------- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/32104 > Improve performance for reading smallint in vectorized Parquet reader > --------------------------------------------------------------------- > > Key: SPARK-35003 > URL: https://issues.apache.org/jira/browse/SPARK-35003 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.0 > Reporter: Chao Sun > Priority: Major > > Currently {{VectorizedRleValuesReader}} reads short in the following way: > {code:java} > for (int i = 0; i < n; i++) { > c.putShort(rowId + i, (short)data.readInteger()); > } > {code} > For PLAIN encoding {{readInteger}} is done via: > {code:java} > public final int readInteger() { > return getBuffer(4).getInt(); > } > {code} > which means it needs to repeatedly call {{slice}} buffer which is more > expensive than calling it once in a big chunk and then reading the ints out. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org