[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22622 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r223158108 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' --- End diff -- https://issues.apache.org/jira/browse/SPARK-25656 is created for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r223077569 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' --- End diff -- Maybe, dictionary encoding could be a good candidate; `parquet.enable.dictionary` and `orc.dictionary.key.threshold` et al. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r223058276 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' --- End diff -- That sounds like a different issue. This PR covers both `TBLPROPERTIES` and `OPTIONS` syntaxes where are designed for that configuration-purpose historically. I mean this is not about data-source specific PR. Also, the scope of this PR is only write-side configurations. In any way, +1 for adding some introduction section for both Parquet/ORC examples there. We had better give both read/write side configuration examples, too. Could you file a JIRA issue for that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222911645 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' --- End diff -- Also give an example? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222911529 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' --- End diff -- I am fine either way. However, our current doc does not explain we are passing the data source specific options to the underlying data source: https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options Could you help improve it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222880436 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' --- End diff -- Ur, Apache ORC is an independent Apache project which has its own website and documents. We should respect that. If we introduce new ORC configuration one by one in Apache Spark website, it will eventually duplicate Apache ORC document in Apache Spark document. We had better guide ORC fans to Apache ORC website. If something is missing there, they can file an ORC JIRA, not SPARK JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222876905 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' --- End diff -- This new feature needs a doc update. We need to let our end users how to use it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222871049 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { } } } + + test("Enforce direct encoding column-wise selectively") { +Seq(true, false).foreach { convertMetastore => + withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") { +testSelectiveDictionaryEncoding(isSelective = false) --- End diff -- Ok. I see. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222868403 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { } } } + + test("Enforce direct encoding column-wise selectively") { +Seq(true, false).foreach { convertMetastore => + withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") { +testSelectiveDictionaryEncoding(isSelective = false) --- End diff -- Yep. This is based on the current behavior which is a little related to your CTAS PR. Only read-path works as expected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222868495 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { } } } + + test("Enforce direct encoding column-wise selectively") { +Seq(true, false).foreach { convertMetastore => + withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") { +testSelectiveDictionaryEncoding(isSelective = false) --- End diff -- When we change Spark behavior later, this test will be adapted according to it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222868074 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.dictionary.key.threshold '1.0', + | hive.exec.orc.dictionary.key.size.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES ('94086', 'random-uuid-string', 0.0)") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // Check the kind + val stripe = recordReader.readStripeFooter(reader.getStripes.get(0)) + assert(stripe.getColumns(1).getKind === DICTIONARY_V2) --- End diff -- Sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222865396 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala --- @@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton { } } } + + test("Enforce direct encoding column-wise selectively") { +Seq(true, false).foreach { convertMetastore => + withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> s"$convertMetastore") { +testSelectiveDictionaryEncoding(isSelective = false) --- End diff -- So even with `CONVERT_METASTORE_ORC` as true, we still can't use selective direct encoding? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222855990 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uniqColumn STRING, value DOUBLE) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.dictionary.key.threshold '1.0', + | hive.exec.orc.dictionary.key.size.threshold '1.0', + | orc.column.encoding.direct 'uniqColumn' + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES ('94086', 'random-uuid-string', 0.0)") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // Check the kind + val stripe = recordReader.readStripeFooter(reader.getStripes.get(0)) + assert(stripe.getColumns(1).getKind === DICTIONARY_V2) --- End diff -- Could you write some comments to explain what `DICTIONARY_V2 `, `DIRECT_V2 ` and `DIRECT` are? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222750086 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,71 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uuid STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uuid' + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uuid STRING, value DOUBLE) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.dictionary.key.threshold '1.0', + | hive.exec.orc.dictionary.key.size.threshold '1.0', + | orc.column.encoding.direct 'uuid' + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES ('94086', 'random-uuid-string', 0.0)") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // Check the kind + val stripe = recordReader.readStripeFooter(reader.getStripes.get(0)) + if (isSelective) { +assert(stripe.getColumns(1).getKind === DICTIONARY_V2) --- End diff -- For this, I will update like the following. ``` assert(stripe.getColumns(1).getKind === DICTIONARY_V2) if (isSelective) { assert(stripe.getColumns(2).getKind === DIRECT_V2) } else { assert(stripe.getColumns(2).getKind === DICTIONARY_V2) } assert(stripe.getColumns(3).getKind === DIRECT) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222535553 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,71 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uuid STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uuid' --- End diff -- How about changing column name? I thought it's some kind of enum to represent encoding stuff. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222535182 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -115,6 +116,71 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { } } + protected def testSelectiveDictionaryEncoding(isSelective: Boolean) { +val tableName = "orcTable" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uuid STRING, value DOUBLE) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.dictionary.key.threshold '1.0', + | orc.column.encoding.direct 'uuid' + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (zipcode STRING, uuid STRING, value DOUBLE) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.dictionary.key.threshold '1.0', + | hive.exec.orc.dictionary.key.size.threshold '1.0', + | orc.column.encoding.direct 'uuid' + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES ('94086', 'random-uuid-string', 0.0)") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // Check the kind + val stripe = recordReader.readStripeFooter(reader.getStripes.get(0)) + if (isSelective) { +assert(stripe.getColumns(1).getKind === DICTIONARY_V2) --- End diff -- @dongjoon-hyun, how about: ``` assert(stripe.getColumns(1).getKind === DICTIONARY_V2) assert(stripe.getColumns(3).getKind === DIRECT) if (isSelective) { assert(stripe.getColumns(2).getKind === DIRECT_V2) } else { assert(stripe.getColumns(2).getKind === DICTIONARY_V2) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22622#discussion_r222535254 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -284,4 +350,8 @@ class OrcSourceSuite extends OrcSuite with SharedSQLContext { test("Check BloomFilter creation") { testBloomFilterCreation(Kind.BLOOM_FILTER_UTF8) // After ORC-101 } + + test("Enforce direct encoding column-wise selectively") { +testSelectiveDictionaryEncoding(true) --- End diff -- how about `testSelectiveDictionaryEncoding(isSelective = true)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/22622 [SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC write ## What changes were proposed in this pull request? Before ORC 1.5.3, `orc.dictionary.key.threshold` and `hive.exec.orc.dictionary.key.size.threshold` are applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3. The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly. ``` ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv) ORC-403: [C++] Add checks to avoid invalid offsets in InputStream ORC-405: Remove calcite as a dependency from the benchmarks. ORC-375: Fix libhdfs on gcc7 by adding #include two places. ORC-383: Parallel builds fails with ConcurrentModificationException ORC-382: Apache rat exclusions + add rat check to travis ORC-401: Fix incorrect quoting in specification. ORC-385: Change RecordReader to extend Closeable. ORC-384: [C++] fix memory leak when loading non-ORC files ORC-391: [c++] parseType does not accept underscore in the field name ORC-397: Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan. ORC-389: Add ability to not decode Acid metadata columns ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-25635 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22622.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22622 commit 39b7fd63c4ce5cbe6dc628ffb0170aef361461ef Author: Dongjoon Hyun Date: 2018-10-03T19:03:44Z [SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC write --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org