Chao Sun created SPARK-35640: -------------------------------- Summary: Refactor Parquet vectorized reader to remove duplicated code paths Key: SPARK-35640 URL: https://issues.apache.org/jira/browse/SPARK-35640 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun
Currently in Parquet vectorized code path, there are many code duplications such as the following: {code:java} public void readIntegers( int total, WritableColumnVector c, int rowId, int level, VectorizedValuesReader data) throws IOException { int left = total; while (left > 0) { if (this.currentCount == 0) this.readNextGroup(); int n = Math.min(left, this.currentCount); switch (mode) { case RLE: if (currentValue == level) { data.readIntegers(n, c, rowId); } else { c.putNulls(rowId, n); } break; case PACKED: for (int i = 0; i < n; ++i) { if (currentBuffer[currentBufferIdx++] == level) { c.putInt(rowId + i, data.readInteger()); } else { c.putNull(rowId + i); } } break; } rowId += n; left -= n; currentCount -= n; } } {code} This makes it hard to maintain as any change on this will need to be replicated in 20+ places. The issue becomes more serious when we are going to implement column index and complex type support for the vectorized path. The original intention is for performance. However now days JIT compilers tend to be smart on this and will inline virtual calls as much as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org