spark git commit: [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages

lian Mon, 25 Jul 2016 07:31:59 -0700

Repository: spark
Updated Branches:
  refs/heads/master 64529b186 -> d6a52176a



[SPARK-16668][TEST] Test parquet reader for row groups containing both 
dictionary and plain encoded pages

## What changes were proposed in this pull request?

This patch adds an explicit test for [SPARK-14217] by setting the parquet 
dictionary and page size the generated parquet file spans across 3 pages 
(within a single row group) where the first page is dictionary encoded and the 
remaining two are plain encoded.

## How was this patch tested?

1. ParquetEncodingSuite
2. Also manually tested that this test fails without 
https://github.com/apache/spark/pull/12279

Author: Sameer Agarwal <samee...@cs.berkeley.edu>

Closes #14304 from sameeragarwal/hybrid-encoding-test.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d6a52176
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d6a52176
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d6a52176

Branch: refs/heads/master
Commit: d6a52176ade92853f37167ad27631977dc79bc76
Parents: 64529b1
Author: Sameer Agarwal <samee...@cs.berkeley.edu>
Authored: Mon Jul 25 22:31:01 2016 +0800
Committer: Cheng Lian <l...@databricks.com>
Committed: Mon Jul 25 22:31:01 2016 +0800

----------------------------------------------------------------------
 .../parquet/ParquetEncodingSuite.scala          | 29 ++++++++++++++++++++
 1 file changed, 29 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/d6a52176/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
index 88fcfce..c754188 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
@@ -16,6 +16,10 @@
  */
 package org.apache.spark.sql.execution.datasources.parquet
 
+import scala.collection.JavaConverters._
+
+import org.apache.parquet.hadoop.ParquetOutputFormat
+
 import org.apache.spark.sql.test.SharedSQLContext
 
 // TODO: this needs a lot more testing but it's currently not easy to test 
with the parquet
@@ -78,4 +82,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest 
with SharedSQLContex
       }}
     }
   }
+
+  test("Read row group containing both dictionary and plain encoded pages") {
+    withSQLConf(ParquetOutputFormat.DICTIONARY_PAGE_SIZE -> "2048",
+      ParquetOutputFormat.PAGE_SIZE -> "4096") {
+      withTempPath { dir =>
+        // In order to explicitly test for SPARK-14217, we set the parquet 
dictionary and page size
+        // such that the following data spans across 3 pages (within a single 
row group) where the
+        // first page is dictionary encoded and the remaining two are plain 
encoded.
+        val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
+        data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
+        val file = 
SpecificParquetRecordReaderBase.listDirectory(dir).asScala.head
+
+        val reader = new VectorizedParquetRecordReader
+        reader.initialize(file, null /* set columns to null to project all 
columns */)
+        val column = reader.resultBatch().column(0)
+        assert(reader.nextBatch())
+
+        (0 until 512).foreach { i =>
+          assert(column.getUTF8String(3 * i).toString == i.toString)
+          assert(column.getUTF8String(3 * i + 1).toString == i.toString)
+          assert(column.getUTF8String(3 * i + 2).toString == i.toString)
+        }
+      }
+    }
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages

Reply via email to