[jira] [Created] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

Chao Sun (Jira) Thu, 03 Jun 2021 17:59:06 -0700

Chao Sun created SPARK-35640:
--------------------------------

             Summary: Refactor Parquet vectorized reader to remove duplicated 
code paths
                 Key: SPARK-35640
                 URL: https://issues.apache.org/jira/browse/SPARK-35640
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Chao Sun



Currently in Parquet vectorized code path, there are many code duplications 
such as the following:
{code:java}
  public void readIntegers(
      int total,
      WritableColumnVector c,
      int rowId,
      int level,
      VectorizedValuesReader data) throws IOException {
    int left = total;
    while (left > 0) {
      if (this.currentCount == 0) this.readNextGroup();
      int n = Math.min(left, this.currentCount);
      switch (mode) {
        case RLE:
          if (currentValue == level) {
            data.readIntegers(n, c, rowId);
          } else {
            c.putNulls(rowId, n);
          }
          break;
        case PACKED:
          for (int i = 0; i < n; ++i) {
            if (currentBuffer[currentBufferIdx++] == level) {
              c.putInt(rowId + i, data.readInteger());
            } else {
              c.putNull(rowId + i);
            }
          }
          break;
      }
      rowId += n;
      left -= n;
      currentCount -= n;
    }
  }
{code}

This makes it hard to maintain as any change on this will need to be replicated 
in 20+ places. The issue becomes more serious when we are going to implement 
column index and complex type support for the vectorized path.

The original intention is for performance. However now days JIT compilers tend 
to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

Reply via email to