[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22622


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r223158108
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
--- End diff --

https://issues.apache.org/jira/browse/SPARK-25656 is created for that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r223077569
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
--- End diff --

Maybe, dictionary encoding could be a good candidate; 
`parquet.enable.dictionary` and `orc.dictionary.key.threshold` et al.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r223058276
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
--- End diff --

That sounds like a different issue. This PR covers both `TBLPROPERTIES` and 
`OPTIONS` syntaxes where are designed for that configuration-purpose 
historically. I mean this is not about data-source specific PR. Also, the scope 
of this PR is only write-side configurations.

In any way, +1 for adding some introduction section for both Parquet/ORC 
examples there. We had better give both read/write side configuration examples, 
too. Could you file a JIRA issue for that?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222911645
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
--- End diff --

Also give an example?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222911529
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
--- End diff --

I am fine either way. However, our current doc does not explain we are 
passing the data source specific options to the underlying data source:


https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options

Could you help improve it? 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222880436
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
--- End diff --

Ur, Apache ORC is an independent Apache project which has its own website 
and documents. We should respect that. If we introduce new ORC configuration 
one by one in Apache Spark website, it will eventually duplicate Apache ORC 
document in Apache Spark document.

We had better guide ORC fans to Apache ORC website. If something is missing 
there, they can file an ORC JIRA, not SPARK JIRA.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222876905
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
--- End diff --

This new feature needs a doc update. We need to let our end users how to 
use it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222871049
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
   }
 }
   }
+
+  test("Enforce direct encoding column-wise selectively") {
+Seq(true, false).foreach { convertMetastore =>
+  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> 
s"$convertMetastore") {
+testSelectiveDictionaryEncoding(isSelective = false)
--- End diff --

Ok. I see. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222868403
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
   }
 }
   }
+
+  test("Enforce direct encoding column-wise selectively") {
+Seq(true, false).foreach { convertMetastore =>
+  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> 
s"$convertMetastore") {
+testSelectiveDictionaryEncoding(isSelective = false)
--- End diff --

Yep. This is based on the current behavior which is a little related to 
your CTAS PR. Only read-path works as expected.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222868495
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
   }
 }
   }
+
+  test("Enforce direct encoding column-wise selectively") {
+Seq(true, false).foreach { convertMetastore =>
+  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> 
s"$convertMetastore") {
+testSelectiveDictionaryEncoding(isSelective = false)
--- End diff --

When we change Spark behavior later, this test will be adapted according to 
it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222868074
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
+   |)
+""".stripMargin
+  case "hive" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |STORED AS ORC
+   |LOCATION '${dir.toURI}'
+   |TBLPROPERTIES (
+   |  orc.dictionary.key.threshold '1.0',
+   |  hive.exec.orc.dictionary.key.size.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
+   |)
+""".stripMargin
+  case impl =>
+throw new UnsupportedOperationException(s"Unknown ORC 
implementation: $impl")
+}
+
+sql(sqlStatement)
+sql(s"INSERT INTO $tableName VALUES ('94086', 
'random-uuid-string', 0.0)")
+
+val partFiles = dir.listFiles()
+  .filter(f => f.isFile && !f.getName.startsWith(".") && 
!f.getName.startsWith("_"))
+assert(partFiles.length === 1)
+
+val orcFilePath = new Path(partFiles.head.getAbsolutePath)
+val readerOptions = OrcFile.readerOptions(new Configuration())
+val reader = OrcFile.createReader(orcFilePath, readerOptions)
+var recordReader: RecordReaderImpl = null
+try {
+  recordReader = reader.rows.asInstanceOf[RecordReaderImpl]
+
+  // Check the kind
+  val stripe = 
recordReader.readStripeFooter(reader.getStripes.get(0))
+  assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
--- End diff --

Sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222865396
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala 
---
@@ -182,4 +182,12 @@ class HiveOrcSourceSuite extends OrcSuite with 
TestHiveSingleton {
   }
 }
   }
+
+  test("Enforce direct encoding column-wise selectively") {
+Seq(true, false).foreach { convertMetastore =>
+  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> 
s"$convertMetastore") {
+testSelectiveDictionaryEncoding(isSelective = false)
--- End diff --

So even with `CONVERT_METASTORE_ORC` as true, we still can't use selective 
direct encoding?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222855990
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,69 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
+   |)
+""".stripMargin
+  case "hive" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uniqColumn 
STRING, value DOUBLE)
+   |STORED AS ORC
+   |LOCATION '${dir.toURI}'
+   |TBLPROPERTIES (
+   |  orc.dictionary.key.threshold '1.0',
+   |  hive.exec.orc.dictionary.key.size.threshold '1.0',
+   |  orc.column.encoding.direct 'uniqColumn'
+   |)
+""".stripMargin
+  case impl =>
+throw new UnsupportedOperationException(s"Unknown ORC 
implementation: $impl")
+}
+
+sql(sqlStatement)
+sql(s"INSERT INTO $tableName VALUES ('94086', 
'random-uuid-string', 0.0)")
+
+val partFiles = dir.listFiles()
+  .filter(f => f.isFile && !f.getName.startsWith(".") && 
!f.getName.startsWith("_"))
+assert(partFiles.length === 1)
+
+val orcFilePath = new Path(partFiles.head.getAbsolutePath)
+val readerOptions = OrcFile.readerOptions(new Configuration())
+val reader = OrcFile.createReader(orcFilePath, readerOptions)
+var recordReader: RecordReaderImpl = null
+try {
+  recordReader = reader.rows.asInstanceOf[RecordReaderImpl]
+
+  // Check the kind
+  val stripe = 
recordReader.readStripeFooter(reader.getStripes.get(0))
+  assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
--- End diff --

Could you write some comments to explain what `DICTIONARY_V2 `, `DIRECT_V2 
` and `DIRECT` are?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222750086
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,71 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uuid STRING, 
value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uuid'
+   |)
+""".stripMargin
+  case "hive" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uuid STRING, 
value DOUBLE)
+   |STORED AS ORC
+   |LOCATION '${dir.toURI}'
+   |TBLPROPERTIES (
+   |  orc.dictionary.key.threshold '1.0',
+   |  hive.exec.orc.dictionary.key.size.threshold '1.0',
+   |  orc.column.encoding.direct 'uuid'
+   |)
+""".stripMargin
+  case impl =>
+throw new UnsupportedOperationException(s"Unknown ORC 
implementation: $impl")
+}
+
+sql(sqlStatement)
+sql(s"INSERT INTO $tableName VALUES ('94086', 
'random-uuid-string', 0.0)")
+
+val partFiles = dir.listFiles()
+  .filter(f => f.isFile && !f.getName.startsWith(".") && 
!f.getName.startsWith("_"))
+assert(partFiles.length === 1)
+
+val orcFilePath = new Path(partFiles.head.getAbsolutePath)
+val readerOptions = OrcFile.readerOptions(new Configuration())
+val reader = OrcFile.createReader(orcFilePath, readerOptions)
+var recordReader: RecordReaderImpl = null
+try {
+  recordReader = reader.rows.asInstanceOf[RecordReaderImpl]
+
+  // Check the kind
+  val stripe = 
recordReader.readStripeFooter(reader.getStripes.get(0))
+  if (isSelective) {
+assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
--- End diff --

For this, I will update like the following.
```
  assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
  if (isSelective) {
assert(stripe.getColumns(2).getKind === DIRECT_V2)
  } else {
assert(stripe.getColumns(2).getKind === DICTIONARY_V2)
  }
  assert(stripe.getColumns(3).getKind === DIRECT)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222535553
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,71 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uuid STRING, 
value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uuid'
--- End diff --

How about changing column name? I thought it's some kind of enum to 
represent encoding stuff.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222535182
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -115,6 +116,71 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
 }
   }
 
+  protected def testSelectiveDictionaryEncoding(isSelective: Boolean) {
+val tableName = "orcTable"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uuid STRING, 
value DOUBLE)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.dictionary.key.threshold '1.0',
+   |  orc.column.encoding.direct 'uuid'
+   |)
+""".stripMargin
+  case "hive" =>
+s"""
+   |CREATE TABLE $tableName (zipcode STRING, uuid STRING, 
value DOUBLE)
+   |STORED AS ORC
+   |LOCATION '${dir.toURI}'
+   |TBLPROPERTIES (
+   |  orc.dictionary.key.threshold '1.0',
+   |  hive.exec.orc.dictionary.key.size.threshold '1.0',
+   |  orc.column.encoding.direct 'uuid'
+   |)
+""".stripMargin
+  case impl =>
+throw new UnsupportedOperationException(s"Unknown ORC 
implementation: $impl")
+}
+
+sql(sqlStatement)
+sql(s"INSERT INTO $tableName VALUES ('94086', 
'random-uuid-string', 0.0)")
+
+val partFiles = dir.listFiles()
+  .filter(f => f.isFile && !f.getName.startsWith(".") && 
!f.getName.startsWith("_"))
+assert(partFiles.length === 1)
+
+val orcFilePath = new Path(partFiles.head.getAbsolutePath)
+val readerOptions = OrcFile.readerOptions(new Configuration())
+val reader = OrcFile.createReader(orcFilePath, readerOptions)
+var recordReader: RecordReaderImpl = null
+try {
+  recordReader = reader.rows.asInstanceOf[RecordReaderImpl]
+
+  // Check the kind
+  val stripe = 
recordReader.readStripeFooter(reader.getStripes.get(0))
+  if (isSelective) {
+assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
--- End diff --

@dongjoon-hyun, how about:

```
assert(stripe.getColumns(1).getKind === DICTIONARY_V2)
assert(stripe.getColumns(3).getKind === DIRECT)
if (isSelective) {
  assert(stripe.getColumns(2).getKind === DIRECT_V2)
} else {
  assert(stripe.getColumns(2).getKind === DICTIONARY_V2)
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22622#discussion_r222535254
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -284,4 +350,8 @@ class OrcSourceSuite extends OrcSuite with 
SharedSQLContext {
   test("Check BloomFilter creation") {
 testBloomFilterCreation(Kind.BLOOM_FILTER_UTF8) // After ORC-101
   }
+
+  test("Enforce direct encoding column-wise selectively") {
+testSelectiveDictionaryEncoding(true)
--- End diff --

how about `testSelectiveDictionaryEncoding(isSelective = true)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22622: [SPARK-25635][SQL][BUILD] Support selective direc...

2018-10-03 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/22622

[SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC 
write

## What changes were proposed in this pull request?


Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
`hive.exec.orc.dictionary.key.size.threshold` are applied for all columns. This 
has been a big huddle to enable dictionary encoding. From ORC 1.5.3, 
`orc.column.encoding.direct` is added to enforce direct encoding selectively in 
a column-wise manner. This PR aims to add that feature by upgrading ORC from 
1.5.2 to 1.5.3.

The followings are the patches in ORC 1.5.3 and this feature is the only 
one related to Spark directly.
```
ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
multi-byte data (gopalv)
ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
ORC-405: Remove calcite as a dependency from the benchmarks.
ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
ORC-383: Parallel builds fails with ConcurrentModificationException
ORC-382: Apache rat exclusions + add rat check to travis
ORC-401: Fix incorrect quoting in specification.
ORC-385: Change RecordReader to extend Closeable.
ORC-384: [C++] fix memory leak when loading non-ORC files
ORC-391: [c++] parseType does not accept underscore in the field name
ORC-397: Allow selective disabling of dictionary encoding. Original patch 
was by Mithun Radhakrishnan.
ORC-389: Add ability to not decode Acid metadata columns
```

## How was this patch tested?

Pass the Jenkins with newly added test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-25635

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22622.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22622


commit 39b7fd63c4ce5cbe6dc628ffb0170aef361461ef
Author: Dongjoon Hyun 
Date:   2018-10-03T19:03:44Z

[SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC 
write




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org