This is an automated email from the ASF dual-hosted git repository. sunchao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 01f9d270ea1 [SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding 01f9d270ea1 is described below commit 01f9d270ea14d4a9b3a5f326fc3f721ddd23e3f4 Author: Dennis Huo <dennis....@snowflake.com> AuthorDate: Wed Aug 17 16:21:43 2022 -0700 [SPARK-40128][SQL] Make the VectorizedColumnReader recognize DELTA_LENGTH_BYTE_ARRAY as a standalone column encoding ### What changes were proposed in this pull request? Add DELTA_LENGTH_BYTE_ARRAY as a recognized encoding in VectorizedColumnReader so that vectorized reads succeed when there are columns using DELTA_LENGTH_BYTE_ARRAY as a standalone encoding. ### Why are the changes needed? Spark currently throws an exception for DELTA_LENGTH_BYTE_ARRAY columns when vectorized reads are enabled and trying to read `delta_length_byte_array.parquet` from https://github.com/apache/parquet-testing: java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_LENGTH_BYTE_ARRAY ### Does this PR introduce _any_ user-facing change? Yes - previously throw UNSUPPORTED exception. Now reads the encoding same as if vectorized reads are disabled. ### How was this patch tested? Added test case to ParquetIOSuite; made sure it fails without the fix to VectorizedColumnReader and passes after. Closes #37557 from sfc-gh-dhuo/support-parquet-delta-length-byte-array. Authored-by: Dennis Huo <dennis....@snowflake.com> Signed-off-by: Chao Sun <sunc...@apple.com> --- .../datasources/parquet/VectorizedColumnReader.java | 2 ++ .../resources/test-data/delta_length_byte_array.parquet | Bin 0 -> 3072 bytes .../execution/datasources/parquet/ParquetIOSuite.scala | 10 ++++++++++ 3 files changed, 12 insertions(+) diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java index c2e85da3884..64178fdd72d 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java @@ -329,6 +329,8 @@ public class VectorizedColumnReader { return new VectorizedPlainValuesReader(); case DELTA_BYTE_ARRAY: return new VectorizedDeltaByteArrayReader(); + case DELTA_LENGTH_BYTE_ARRAY: + return new VectorizedDeltaLengthByteArrayReader(); case DELTA_BINARY_PACKED: return new VectorizedDeltaBinaryPackedReader(); case RLE: diff --git a/sql/core/src/test/resources/test-data/delta_length_byte_array.parquet b/sql/core/src/test/resources/test-data/delta_length_byte_array.parquet new file mode 100644 index 00000000000..ead505a1a1f Binary files /dev/null and b/sql/core/src/test/resources/test-data/delta_length_byte_array.parquet differ diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala index 5a8f4563756..0458e5a1a14 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala @@ -1307,6 +1307,16 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession } } + test("SPARK-40128 read DELTA_LENGTH_BYTE_ARRAY encoded strings") { + withAllParquetReaders { + checkAnswer( + // "fruit" column in this file is encoded using DELTA_LENGTH_BYTE_ARRAY. + // The file comes from https://github.com/apache/parquet-testing + readResourceParquetFile("test-data/delta_length_byte_array.parquet"), + (0 to 999).map(i => Row("apple_banana_mango" + Integer.toString(i * i)))) + } + } + test("SPARK-12589 copy() on rows returned from reader works for strings") { withTempPath { dir => val data = (1, "abc") ::(2, "helloabcde") :: Nil --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org