[ https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-35640: ------------------------------------ Assignee: Apache Spark > Refactor Parquet vectorized reader to remove duplicated code paths > ------------------------------------------------------------------ > > Key: SPARK-35640 > URL: https://issues.apache.org/jira/browse/SPARK-35640 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.0 > Reporter: Chao Sun > Assignee: Apache Spark > Priority: Major > > Currently in Parquet vectorized code path, there are many code duplications > such as the following: > {code:java} > public void readIntegers( > int total, > WritableColumnVector c, > int rowId, > int level, > VectorizedValuesReader data) throws IOException { > int left = total; > while (left > 0) { > if (this.currentCount == 0) this.readNextGroup(); > int n = Math.min(left, this.currentCount); > switch (mode) { > case RLE: > if (currentValue == level) { > data.readIntegers(n, c, rowId); > } else { > c.putNulls(rowId, n); > } > break; > case PACKED: > for (int i = 0; i < n; ++i) { > if (currentBuffer[currentBufferIdx++] == level) { > c.putInt(rowId + i, data.readInteger()); > } else { > c.putNull(rowId + i); > } > } > break; > } > rowId += n; > left -= n; > currentCount -= n; > } > } > {code} > This makes it hard to maintain as any change on this will need to be > replicated in 20+ places. The issue becomes more serious when we are going to > implement column index and complex type support for the vectorized path. > The original intention is for performance. However now days JIT compilers > tend to be smart on this and will inline virtual calls as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org