from:"wangyum"

[GitHub] spark pull request #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBe...

2018-07-05 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21677#discussion_r200260115
  
--- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
@@ -0,0 +1,556 @@
+[ Pushdown for many distinct value case 
]
--- End diff --

How about this?
```
...
Select all int rows (value != -1):   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


Parquet Vectorized1140 / 1165  0.9  
  1087.4   1.0X
Parquet Vectorized (Pushdown) 1140 / 1172  0.9  
  1086.8   1.0X
Native ORC Vectorized 1158 / 1206  0.9  
  1104.7   1.0X
Native ORC Vectorized (Pushdown)  1151 / 1220  0.9  
  1098.1   1.0X




Pushdown for few distinct value case (use dictionary encoding)



Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Select 0 distinct string row (value IS NULL): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized 512 /  565  2.0  
   488.6   1.0X
Parquet Vectorized (Pushdown)   27 /   33 39.3  
25.5  19.2X
Native ORC Vectorized  509 /  546  2.1  
   485.0   1.0X
Native ORC Vectorized (Pushdown)79 /   91 13.2  
75.5   6.5X
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBe...

2018-07-05 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21677#discussion_r200246934
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -427,16 +245,122 @@ object FilterPushdownBenchmark {
 }
   }
 }
+  }
 
-// Pushdown for few distinct value case (use dictionary encoding)
+  ignore("Pushdown for few distinct value case (use dictionary encoding)") 
{
 withTempPath { dir =>
   val numDistinctValues = 200
-  val mid = numDistinctValues / 2
 
   withTempTable("orcTable", "patquetTable") {
 prepareStringDictTable(dir, numRows, numDistinctValues, width)
-runStringBenchmark(numRows, width, mid, "distinct string")
+runStringBenchmark(numRows, width, numDistinctValues / 2, 
"distinct string")
+  }
+}
+  }
+
+  ignore("Pushdown benchmark for StringStartsWith") {
--- End diff --

Yes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

2018-07-05 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21556#discussion_r200246162
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
   (n: String, v: Any) => FilterApi.eq(
 intColumn(n),
 Option(v).map(date => 
dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
+
+case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
--- End diff --

Seems invalidate value already filtered by: 
https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L439


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBenchmark

2018-07-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21677
  
@HyukjinKwon Can you merge this to master first? I would like to update the 
[Benchmark 
results](https://github.com/apache/spark/pull/21677/files#diff-c5c0bfc86983d5779269cf75da8ed645)
 of several other pushdown related PRs to the corresponding PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

2018-07-04 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21556#discussion_r200170975
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
   (n: String, v: Any) => FilterApi.eq(
 intColumn(n),
 Option(v).map(date => 
dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
+
+case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal =>
--- End diff --

DecimalType contains variable: `decimalMetadata`. It seems difficult to 
make a constants like before. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21696: [SPARK-24716][SQL] Refactor ParquetFilters

2018-07-04 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21696
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters

2018-07-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21696#discussion_r200011264
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -19,187 +19,200 @@ package 
org.apache.spark.sql.execution.datasources.parquet
 
 import java.sql.Date
 
+import scala.collection.JavaConverters.asScalaBufferConverter
+
 import org.apache.parquet.filter2.predicate._
 import org.apache.parquet.filter2.predicate.FilterApi._
 import org.apache.parquet.io.api.Binary
-import org.apache.parquet.schema.PrimitiveComparator
+import org.apache.parquet.schema.{DecimalMetadata, MessageType, 
OriginalType, PrimitiveComparator, PrimitiveType}
+import org.apache.parquet.schema.OriginalType._
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._
 
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
 import org.apache.spark.sql.sources
-import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * Some utility function to convert Spark data source filters to Parquet 
filters.
  */
 private[parquet] class ParquetFilters(pushDownDate: Boolean, 
pushDownStartWith: Boolean) {
 
+  private case class ParquetSchemaType(
+  originalType: OriginalType,
+  primitiveTypeName: PrimitiveTypeName,
+  decimalMetadata: DecimalMetadata)
+
+  private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, null)
+  private val ParquetIntegerType = ParquetSchemaType(null, INT32, null)
+  private val ParquetLongType = ParquetSchemaType(null, INT64, null)
+  private val ParquetFloatType = ParquetSchemaType(null, FLOAT, null)
+  private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, null)
+  private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, null)
+  private val ParquetBinaryType = ParquetSchemaType(null, BINARY, null)
+  private val ParquetDateType = ParquetSchemaType(DATE, INT32, null)
+
   private def dateToDays(date: Date): SQLDate = {
 DateTimeUtils.fromJavaDate(date)
   }
 
-  private val makeEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
-case BooleanType =>
+  private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => 
FilterPredicate] = {
+case ParquetBooleanType =>
   (n: String, v: Any) => FilterApi.eq(booleanColumn(n), 
v.asInstanceOf[java.lang.Boolean])
-case IntegerType =>
+case ParquetIntegerType =>
   (n: String, v: Any) => FilterApi.eq(intColumn(n), 
v.asInstanceOf[Integer])
-case LongType =>
+case ParquetLongType =>
   (n: String, v: Any) => FilterApi.eq(longColumn(n), 
v.asInstanceOf[java.lang.Long])
-case FloatType =>
+case ParquetFloatType =>
   (n: String, v: Any) => FilterApi.eq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
-case DoubleType =>
+case ParquetDoubleType =>
   (n: String, v: Any) => FilterApi.eq(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
 
 // Binary.fromString and Binary.fromByteArray don't accept null values
-case StringType =>
+case ParquetStringType =>
   (n: String, v: Any) => FilterApi.eq(
 binaryColumn(n),
 Option(v).map(s => 
Binary.fromString(s.asInstanceOf[String])).orNull)
-case BinaryType =>
+case ParquetBinaryType =>
   (n: String, v: Any) => FilterApi.eq(
 binaryColumn(n),
 Option(v).map(b => 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull)
-case DateType if pushDownDate =>
+case ParquetDateType if pushDownDate =>
   (n: String, v: Any) => FilterApi.eq(
 intColumn(n),
 Option(v).map(date => 
dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
   }
 
-  private val makeNotEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
-case BooleanType =>
+  private val makeNotEq: PartialFunction[ParquetSchemaType, (String, Any) 
=> FilterPredicate] = {
+case ParquetBooleanType =>
   (n: String, v: Any) => FilterApi.notEq(booleanColumn(n), 
v.asInstanceOf[java.lang.Boolean])
-case IntegerType =>
+case ParquetIntegerType =>
   (n: String, v: Any) => FilterApi.notEq(intColumn(n), 
v.asInstanceOf[Integer])
-case LongType =>
+case ParquetLongType =>
   (n: String, v: Any) => FilterApi.notEq(longColumn(n), 
v.asInstanceOf[java.lang.Long])

[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters

2018-07-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21696#discussion_r19002
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -19,166 +19,186 @@ package 
org.apache.spark.sql.execution.datasources.parquet
 
 import java.sql.Date
 
+import scala.collection.JavaConverters._
+
 import org.apache.parquet.filter2.predicate._
 import org.apache.parquet.filter2.predicate.FilterApi._
 import org.apache.parquet.io.api.Binary
-import org.apache.parquet.schema.PrimitiveComparator
+import org.apache.parquet.schema._
+import org.apache.parquet.schema.OriginalType._
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._
 
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
 import org.apache.spark.sql.sources
-import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * Some utility function to convert Spark data source filters to Parquet 
filters.
  */
 private[parquet] class ParquetFilters(pushDownDate: Boolean, 
pushDownStartWith: Boolean) {
 
+  case class ParquetSchemaType(
+  originalType: OriginalType,
+  primitiveTypeName: PrimitiveType.PrimitiveTypeName,
+  decimalMetadata: DecimalMetadata)
+
   private def dateToDays(date: Date): SQLDate = {
 DateTimeUtils.fromJavaDate(date)
   }
 
-  private val makeEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
-case BooleanType =>
+  private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => 
FilterPredicate] = {
+// BooleanType
+case ParquetSchemaType(null, BOOLEAN, null) =>
   (n: String, v: Any) => FilterApi.eq(booleanColumn(n), 
v.asInstanceOf[java.lang.Boolean])
-case IntegerType =>
+// IntegerType
+case ParquetSchemaType(null, INT32, null) =>
   (n: String, v: Any) => FilterApi.eq(intColumn(n), 
v.asInstanceOf[Integer])
-case LongType =>
+// LongType
+case ParquetSchemaType(null, INT64, null) =>
   (n: String, v: Any) => FilterApi.eq(longColumn(n), 
v.asInstanceOf[java.lang.Long])
-case FloatType =>
+// FloatType
+case ParquetSchemaType(null, FLOAT, null) =>
   (n: String, v: Any) => FilterApi.eq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
-case DoubleType =>
+// DoubleType
+case ParquetSchemaType(null, DOUBLE, null) =>
   (n: String, v: Any) => FilterApi.eq(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
 
+// StringType
 // Binary.fromString and Binary.fromByteArray don't accept null values
-case StringType =>
+case ParquetSchemaType(UTF8, BINARY, null) =>
   (n: String, v: Any) => FilterApi.eq(
 binaryColumn(n),
 Option(v).map(s => 
Binary.fromString(s.asInstanceOf[String])).orNull)
-case BinaryType =>
+// BinaryType
+case ParquetSchemaType(null, BINARY, null) =>
   (n: String, v: Any) => FilterApi.eq(
 binaryColumn(n),
 Option(v).map(b => 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull)
-case DateType if pushDownDate =>
+// DateType
+case ParquetSchemaType(DATE, INT32, null) if pushDownDate =>
   (n: String, v: Any) => FilterApi.eq(
 intColumn(n),
 Option(v).map(date => 
dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
   }
 
-  private val makeNotEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
-case BooleanType =>
+  private val makeNotEq: PartialFunction[ParquetSchemaType, (String, Any) 
=> FilterPredicate] = {
+case ParquetSchemaType(null, BOOLEAN, null) =>
   (n: String, v: Any) => FilterApi.notEq(booleanColumn(n), 
v.asInstanceOf[java.lang.Boolean])
-case IntegerType =>
+case ParquetSchemaType(null, INT32, null) =>
   (n: String, v: Any) => FilterApi.notEq(intColumn(n), 
v.asInstanceOf[Integer])
-case LongType =>
+case ParquetSchemaType(null, INT64, null) =>
   (n: String, v: Any) => FilterApi.notEq(longColumn(n), 
v.asInstanceOf[java.lang.Long])
-case FloatType =>
+case ParquetSchemaType(null, FLOAT, null) =>
   (n: String, v: Any) => FilterApi.notEq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
-case DoubleType =>
+case ParquetSchemaType(null, DOUBLE, null) =>
   (n: String, v: Any) => FilterApi.notEq(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])

[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...

2018-07-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21682#discussion_r12294
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -42,6 +42,10 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
   private val makeEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
 case BooleanType =>
   (n: String, v: Any) => FilterApi.eq(booleanColumn(n), 
v.asInstanceOf[java.lang.Boolean])
+case ByteType | ShortType =>
+  (n: String, v: Any) => FilterApi.eq(
+intColumn(n),
+
Option(v).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull)
--- End diff --

value may be `null`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...

2018-07-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21682#discussion_r12316
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -93,6 +101,10 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
   }
 
   private val makeLt: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
+case ByteType | ShortType =>
+  (n: String, v: Any) => FilterApi.lt(
+intColumn(n),
+v.asInstanceOf[Number].intValue.asInstanceOf[Integer])
--- End diff --

value cannot be `null`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...

2018-07-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21682#discussion_r11024
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -42,6 +42,14 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
   private val makeEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
 case BooleanType =>
   (n: String, v: Any) => FilterApi.eq(booleanColumn(n), 
v.asInstanceOf[java.lang.Boolean])
+case ByteType =>
+  (n: String, v: Any) => FilterApi.eq(
+intColumn(n),
+Option(v).map(b => 
b.asInstanceOf[java.lang.Byte].toInt.asInstanceOf[Integer]).orNull)
+case ShortType =>
--- End diff --

How about like this:

- `makeEq` and `makeNotEq`
```scala
case ByteType | ShortType =>
  (n: String, v: Any) => FilterApi.notEq(
intColumn(n),

Option(v).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull)
case IntegerType =>
  (n: String, v: Any) => FilterApi.notEq(intColumn(n), 
v.asInstanceOf[Integer])
```
 - `makeLt`, `makeLtEq`, `makeGt` and `makeGtEq`:
```scala
case ByteType | ShortType =>
  (n: String, v: Any) => FilterApi.gtEq(
intColumn(n),
v.asInstanceOf[Number].intValue.asInstanceOf[Integer])
case IntegerType =>
  (n: String, v: Any) => FilterApi.gtEq(intColumn(n), 
v.asInstanceOf[java.lang.Integer])
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...

2018-07-03 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21682#discussion_r199986187
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -42,6 +42,14 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean, pushDownStartWith:
   private val makeEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
 case BooleanType =>
   (n: String, v: Any) => FilterApi.eq(booleanColumn(n), 
v.asInstanceOf[java.lang.Boolean])
+case ByteType =>
+  (n: String, v: Any) => FilterApi.eq(
+intColumn(n),
+Option(v).map(b => 
b.asInstanceOf[java.lang.Byte].toInt.asInstanceOf[Integer]).orNull)
--- End diff --

```scala
scala> null.asInstanceOf[Short].toInt.asInstanceOf[Integer]
res49: Integer = 0

scala> null.asInstanceOf[java.lang.Short].toInt.asInstanceOf[Integer]
java.lang.NullPointerException
  at scala.Predef$.Short2short(Predef.scala:360)
  ... 51 elided
```
That's why I use `Option.map` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters

2018-07-02 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21696#discussion_r199672805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -19,166 +19,186 @@ package 
org.apache.spark.sql.execution.datasources.parquet
 
 import java.sql.Date
 
+import scala.collection.JavaConverters._
+
 import org.apache.parquet.filter2.predicate._
 import org.apache.parquet.filter2.predicate.FilterApi._
 import org.apache.parquet.io.api.Binary
-import org.apache.parquet.schema.PrimitiveComparator
+import org.apache.parquet.schema._
+import org.apache.parquet.schema.OriginalType._
+import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._
 
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
 import org.apache.spark.sql.sources
-import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * Some utility function to convert Spark data source filters to Parquet 
filters.
  */
 private[parquet] class ParquetFilters(pushDownDate: Boolean, 
pushDownStartWith: Boolean) {
 
+  case class ParquetSchemaType(
+  originalType: OriginalType,
+  primitiveTypeName: PrimitiveType.PrimitiveTypeName,
+  decimalMetadata: DecimalMetadata)
+
   private def dateToDays(date: Date): SQLDate = {
 DateTimeUtils.fromJavaDate(date)
   }
 
-  private val makeEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
-case BooleanType =>
+  private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => 
FilterPredicate] = {
+// BooleanType
+case ParquetSchemaType(null, BOOLEAN, null) =>
--- End diff --

Mapping type reference:

https://github.com/apache/spark/blob/21a7bfd5c324e6c82152229f1394f26afeae771c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L338-L560


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21696: [SPARK-24716][SQL] Refactor ParquetFilters

2018-07-02 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21696
  
cc @gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...

2018-07-02 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21556#discussion_r199442189
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -359,6 +369,70 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 }
   }
 
+  test("filter pushdown - decimal") {
+Seq(true, false).foreach { legacyFormat =>
+  withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> 
legacyFormat.toString) {
+Seq(s"_1 decimal(${Decimal.MAX_INT_DIGITS}, 2)", // 
32BitDecimalType
+  s"_1 decimal(${Decimal.MAX_LONG_DIGITS}, 2)",  // 
64BitDecimalType
+  "_1 decimal(38, 18)"   // 
ByteArrayDecimalType
+).foreach { schemaDDL =>
+  val schema = StructType.fromDDL(schemaDDL)
+  val rdd =
+spark.sparkContext.parallelize((1 to 4).map(i => Row(new 
java.math.BigDecimal(i
+  val dataFrame = spark.createDataFrame(rdd, schema)
+  testDecimalPushDown(dataFrame) { implicit df =>
+assert(df.schema === schema)
+checkFilterPredicate('_1.isNull, classOf[Eq[_]], 
Seq.empty[Row])
+checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 
4).map(Row.apply(_)))
+
+checkFilterPredicate('_1 === 1, classOf[Eq[_]], 1)
+checkFilterPredicate('_1 <=> 1, classOf[Eq[_]], 1)
+checkFilterPredicate('_1 =!= 1, classOf[NotEq[_]], (2 to 
4).map(Row.apply(_)))
+
+checkFilterPredicate('_1 < 2, classOf[Lt[_]], 1)
+checkFilterPredicate('_1 > 3, classOf[Gt[_]], 4)
+checkFilterPredicate('_1 <= 1, classOf[LtEq[_]], 1)
+checkFilterPredicate('_1 >= 4, classOf[GtEq[_]], 4)
+
+checkFilterPredicate(Literal(1) === '_1, classOf[Eq[_]], 1)
+checkFilterPredicate(Literal(1) <=> '_1, classOf[Eq[_]], 1)
+checkFilterPredicate(Literal(2) > '_1, classOf[Lt[_]], 1)
+checkFilterPredicate(Literal(3) < '_1, classOf[Gt[_]], 4)
+checkFilterPredicate(Literal(1) >= '_1, classOf[LtEq[_]], 1)
+checkFilterPredicate(Literal(4) <= '_1, classOf[GtEq[_]], 4)
+
+checkFilterPredicate(!('_1 < 4), classOf[GtEq[_]], 4)
+checkFilterPredicate('_1 < 2 || '_1 > 3, 
classOf[Operators.Or], Seq(Row(1), Row(4)))
+  }
+}
+  }
+}
+  }
+
+  test("incompatible parquet file format will throw exeception") {
--- End diff --

Have create a PR: https://github.com/apache/spark/pull/21696
After this PR. Support decimal should be like this: 
https://github.com/wangyum/spark/blob/refactor-decimal-pushdown/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L118-L146


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters

2018-07-02 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21696

[SPARK-24716][SQL] Refactor ParquetFilters

## What changes were proposed in this pull request?

Replace DataFrame schema to Parquet file schema when create 
`ParquetFilters`. 
more details will add later.


## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24716

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21696.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21696


commit 10d408fd3fe429d5529755b5abebac86b22b6d55
Author: Yuming Wang 
Date:   2018-07-02T09:36:46Z

Refactor ParquetFilters




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...

2018-06-30 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21682

[SPARK-24706][SQL] ByteType and ShortType support pushdown to parquet

## What changes were proposed in this pull request?

`ByteType` and `ShortType` support pushdown to parquet data source.

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24706

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21682.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21682


commit e9d56252e6c65f5afa207bc98c8c5e008de57e0c
Author: Yuming Wang 
Date:   2018-06-30T19:13:13Z

ByteType and ShortType pushdown to parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21681: Pin tag 210

2018-06-30 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21681
  
 @zhangchj1990 Looks mistakenly open. Mind closing this please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...

2018-06-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21603
  
Benchmark result:
```
##[ Pushdown benchmark for InSet -> InFilters 
]##
Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

InSet -> InFilters (threshold: 10, values count: 5, distribution: 10): 
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized7649 / 7678  2.1  
   486.3   1.0X
Parquet Vectorized (Pushdown)  316 /  325 49.8  
20.1  24.2X
Native ORC Vectorized 6787 / 7353  2.3  
   431.5   1.1X
Native ORC Vectorized (Pushdown)  1010 / 1020 15.6  
64.2   7.6X

InSet -> InFilters (threshold: 10, values count: 5, distribution: 50): 
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized7537 / 7944  2.1  
   479.2   1.0X
Parquet Vectorized (Pushdown)  297 /  306 52.9  
18.9  25.3X
Native ORC Vectorized 6768 / 6779  2.3  
   430.3   1.1X
Native ORC Vectorized (Pushdown)   998 / 1017 15.8  
63.4   7.6X

InSet -> InFilters (threshold: 10, values count: 5, distribution: 90): 
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized7500 / 7592  2.1  
   476.8   1.0X
Parquet Vectorized (Pushdown)  299 /  306 52.5  
19.0  25.1X
Native ORC Vectorized 6758 / 6797  2.3  
   429.7   1.1X
Native ORC Vectorized (Pushdown)   982 /  993 16.0  
62.4   7.6X

InSet -> InFilters (threshold: 10, values count: 10, distribution: 10): 
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized7566 / 8153  2.1  
   481.1   1.0X
Parquet Vectorized (Pushdown)  319 /  328 49.3  
20.3  23.7X
Native ORC Vectorized 6761 / 6812  2.3  
   429.8   1.1X
Native ORC Vectorized (Pushdown)   995 / 1013 15.8  
63.3   7.6X

InSet -> InFilters (threshold: 10, values count: 10, distribution: 50): 
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized7512 / 7581  2.1  
   477.6   1.0X
Parquet Vectorized (Pushdown)  315 /  322 50.0  
20.0  23.9X
Native ORC Vectorized 6712 / 6774  2.3  
   426.8   1.1X
Native ORC Vectorized (Pushdown)  1001 / 1032 15.7  
63.6   7.5X

InSet -> InFilters (threshold: 10, values count: 10, distribution: 90): 
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized7603 / 7689  2.1  
   483.4   1.0X
Parquet Vectorized (Pushdown)  308 /  317 51.0  
19.6  24.7X
Native ORC Vectorized 7011 / 7605  2.2  
   445.7   1.1X
Native ORC Vectorized (Pushdown)  1038 / 1067 15.2  
66.0   7.3X

InSet -> InFilters (threshold: 10, values count: 50, distribution: 10): 
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized7750 / 7796  2.0  
   492.7   1.0X
Parquet Vectorized (Pushdown) 7855 / 7961  2.0  
   499.4   1.0X
Native ORC Vectorized 7120 / 7820  2.2  
   452.7   1.1X
Native ORC Vectorized (Pushdown)  1085 / 1122 14.5  
69.0   7.1X

InSet -> InFilters (threshold: 10, values count: 50, distribution: 50): 
Best/Avg Time(ms)Rate(M/

[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

2018-06-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21623
  
Benchmark result:
```
###[ Pushdown benchmark for StringStartsWith 
]###
Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)Rate(M/s) 
  Per Row(ns)   Relative


Parquet Vectorized  10104 / 11125  1.6  
   642.4   1.0X
Parquet Vectorized (Pushdown) 3002 / 3608  5.2  
   190.8   3.4X
Native ORC Vectorized9589 / 10454  1.6  
   609.7   1.1X
Native ORC Vectorized (Pushdown) 9798 / 10509  1.6  
   622.9   1.0X

StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized8437 / 8563  1.9  
   536.4   1.0X
Parquet Vectorized (Pushdown)  279 /  289 56.3  
17.8  30.2X
Native ORC Vectorized 7354 / 7568  2.1  
   467.5   1.1X
Native ORC Vectorized (Pushdown)  7730 / 7972  2.0  
   491.4   1.1X

StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized8290 / 8510  1.9  
   527.0   1.0X
Parquet Vectorized (Pushdown)  260 /  272 60.5  
16.5  31.9X
Native ORC Vectorized 7361 / 7395  2.1  
   468.0   1.1X
Native ORC Vectorized (Pushdown)  7694 / 7811  2.0  
   489.2   1.1X
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBenchmark

2018-06-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21677
  
cc @maropu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-06-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21556
  
Benchmark results:
```
###[ Pushdown benchmark for Decimal 
]
Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized4004 / 5309  3.9  
   254.5   1.0X
Parquet Vectorized (Pushdown) 1401 / 1431 11.2  
89.1   2.9X
Native ORC Vectorized 4499 / 4567  3.5  
   286.0   0.9X
Native ORC Vectorized (Pushdown)   899 /  961 17.5  
57.2   4.5X

Select 10% decimal(9, 2) rows (value < 1572864): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized5376 / 6437  2.9  
   341.8   1.0X
Parquet Vectorized (Pushdown) 2696 / 2754  5.8  
   171.4   2.0X
Native ORC Vectorized 5458 / 5623  2.9  
   347.0   1.0X
Native ORC Vectorized (Pushdown)  2230 / 2255  7.1  
   141.8   2.4X

Select 50% decimal(9, 2) rows (value < 7864320): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized8280 / 8487  1.9  
   526.4   1.0X
Parquet Vectorized (Pushdown) 7716 / 7757  2.0  
   490.6   1.1X
Native ORC Vectorized 9144 / 9495  1.7  
   581.4   0.9X
Native ORC Vectorized (Pushdown)  7918 / 8118  2.0  
   503.4   1.0X

Select 90% decimal(9, 2) rows (value < 14155776): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized9648 / 9676  1.6  
   613.4   1.0X
Parquet Vectorized (Pushdown) 9647 / 9778  1.6  
   613.3   1.0X
Native ORC Vectorized   10782 / 10867  1.5  
   685.5   0.9X
Native ORC Vectorized (Pushdown)10108 / 10269  1.6  
   642.6   1.0X

Select 1 decimal(18, 2) row (value = 7864320): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized4066 / 4147  3.9  
   258.5   1.0X
Parquet Vectorized (Pushdown)   84 /   89188.0  
 5.3  48.6X
Native ORC Vectorized 5430 / 5512  2.9  
   345.3   0.7X
Native ORC Vectorized (Pushdown)  1054 / 1076 14.9  
67.0   3.9X

Select 10% decimal(18, 2) rows (value < 1572864): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized5028 / 5154  3.1  
   319.7   1.0X
Parquet Vectorized (Pushdown) 1360 / 1421 11.6  
86.5   3.7X
Native ORC Vectorized 6266 / 6360  2.5  
   398.4   0.8X
Native ORC Vectorized (Pushdown)  2513 / 2550  6.3  
   159.8   2.0X

Select 50% decimal(18, 2) rows (value < 7864320): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized8571 / 8600  1.8  
   544.9   1.0X
Parquet Vectorized (Pushdown) 6455 / 6713  2.4  
   410.4   1.3X
Native ORC Vectorized   10138 / 10353  1.6  
   644.5   0.8X
Native ORC Vectorized (Pushdown)  8166 / 8418  1.9  
   519.2   1.0X

Select 90% decimal(18, 2) rows (value < 14155776): Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative


Parquet Vectorized  12184 /

[GitHub] spark pull request #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBe...

2018-06-29 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21677

[SPARK-24692][TESTS] Improvement FilterPushdownBenchmark

## What changes were proposed in this pull request?

1. Write the result to `benchmarks/FilterPushdownBenchmark-results.txt` for 
easy maintenance.
2. Add more benchmark case: `StringStartsWith`, `Decimal` and `InSet -> 
InFilters`.

## How was this patch tested?

manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24692

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21677.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21677


commit ccdd21cfa75f8577b5f8093c8e0b1eba6aa2e055
Author: Yuming Wang 
Date:   2018-06-30T00:22:16Z

Improvement FilterPushdownBenchmark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199116993
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +688,62 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
+  // Test canDrop()
--- End diff --

Both methods have been executed but it can't be confirmed which method  has 
taken effect.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

2018-06-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21623
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-28 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199043411
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
--- End diff --

Added `testStringStartsWith` to test that exactly go through the `canDrop` 
and `inverseCanDrop`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-28 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199043210
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -22,16 +22,23 @@ import java.sql.Date
 import org.apache.parquet.filter2.predicate._
 import org.apache.parquet.filter2.predicate.FilterApi._
 import org.apache.parquet.io.api.Binary
+import org.apache.parquet.schema.PrimitiveComparator
 
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources
 import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * Some utility function to convert Spark data source filters to Parquet 
filters.
  */
-private[parquet] class ParquetFilters(pushDownDate: Boolean) {
+private[parquet] class ParquetFilters() {
+
+  val sqlConf: SQLConf = SQLConf.get
--- End diff --

You are right. I hit a bug here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21547: [SPARK-24538][SQL] ByteArrayDecimalType support p...

2018-06-28 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/21547


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...

2018-06-28 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21547
  
Close it because I have implement it in 
[SPARK-24549](https://issues.apache.org/jira/browse/SPARK-24549).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...

2018-06-28 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21556
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...

2018-06-26 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21603#discussion_r198146352
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.In(name, values) if canMakeFilterOn(name) && 
values.length < 20 =>
--- End diff --

I have prepared a test case that you can verify it:
```scala
  test("Benchmark") {
def benchmark(func: () => Unit): Long = {
  val start = System.currentTimeMillis()
  func()
  val end = System.currentTimeMillis()
  end - start
}
// scalastyle:off
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  withTempPath { path =>
Seq(100, 1000).foreach { count =>
  Seq(1048576, 10485760, 104857600).foreach { blockSize =>
spark.range(count).toDF().selectExpr("id", "cast(id as string) 
as d1",
  "cast(id as double) as d2", "cast(id as float) as d3", 
"cast(id as int) as d4",
  "cast(id as decimal(38)) as d5")
  .coalesce(1).write.mode("overwrite")
  .option("parquet.block.size", 
blockSize).parquet(path.getAbsolutePath)
val df = spark.read.parquet(path.getAbsolutePath)
println(s"path: ${path.getAbsolutePath}")
Seq(1000, 100, 10, 1).foreach { ratio =>
  println(s"##[ count: $count, blockSize: $blockSize, 
ratio: $ratio ]#")
  var i = 1
  while (i < 300) {
val filter = Range(0, i).map(r => 
scala.util.Random.nextInt(count / ratio))
i += 4

sql("set spark.sql.parquet.pushdown.inFilterThreshold=1")
val vanillaTime = benchmark(() => df.where(s"id 
in(${filter.mkString(",")})").count())
sql("set spark.sql.parquet.pushdown.inFilterThreshold=1000")
val pushDownTime = benchmark(() => df.where(s"id 
in(${filter.mkString(",")})").count())

if (pushDownTime > vanillaTime) {
  println(s"vanilla is better, threshold: ${filter.size}, 
$pushDownTime, $vanillaTime")
} else {
  println(s"push down is better, threshold: ${filter.size}, 
$pushDownTime, $vanillaTime")
}
  }
}
  }
}
  }
}
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...

2018-06-26 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21603
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...

2018-06-26 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21603#discussion_r198124578
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.In(name, values) if canMakeFilterOn(name) && 
values.length < 20 =>
--- End diff --

It mainly depends on how many row groups can skip. for small table 
(assuming only one row group). There is no obvious difference.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21641: [SPARK-24658][SQL] Remove workaround for ANTLR bu...

2018-06-25 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21641

[SPARK-24658][SQL] Remove workaround for ANTLR bug

## What changes were proposed in this pull request?

Issue antlr/antlr4#781 has already been fixed, so the workaround of 
extracting the pattern into a separate rule is no longer needed. The presto 
already removed it: https://github.com/prestodb/presto/pull/10744.

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark ANTLR-780

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21641.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21641


commit e9a1b2cadbbd8dc34671c2051ee6ddfd5f637709
Author: Yuming Wang 
Date:   2018-06-26T05:18:17Z

Remove workaround for ANTLR bug




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-25 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r197992151
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +660,30 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
+  Seq("2", "2s", "2st", "2str", "2str2").foreach { prefix =>
+checkFilterPredicate(
+  '_1.startsWith(prefix).asInstanceOf[Predicate],
+  classOf[UserDefinedByInstance[_, _]],
+  "2str2")
+  }
+
+  Seq("2S", "null", "2str22").foreach { prefix =>
+checkFilterPredicate(
+  '_1.startsWith(prefix).asInstanceOf[Predicate],
+  classOf[UserDefinedByInstance[_, _]],
+  Seq.empty[Row])
+  }
+
+  assertResult(None) {
+parquetFilters.createFilter(
+  df.schema,
+  sources.StringStartsWith("_1", null))
--- End diff --

Thanks @attilapiros , `sources.StringStartsWith("_1", null)` will not 
matches them, same as before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down

2018-06-25 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21623
  
cc @gszadovszky @nandorKollar


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-23 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21623

[SPARK-24638][SQL] StringStartsWith support push down

## What changes were proposed in this pull request?

`StringStartsWith` support push down. About 50% savings in compute time.

## How was this patch tested?
unit tests and manual tests.
Performance test:
```scala
cat < SPARK-24638.scala
spark.range(1000).selectExpr("concat(id, 'str', id) as 
id").coalesce(1).write.option("parquet.block.size", 
1048576).parquet("/tmp/spark/parquet/string")
val df = spark.read.parquet("/tmp/spark/parquet/string/")
spark.sql("set spark.sql.parquet.filterPushdown=true")
val pushdownEnableStart = System.currentTimeMillis()
for(i <- 0 until 100) {
  df.where("id like '98%'").count()
}
val pushdownEnable = System.currentTimeMillis() - pushdownEnableStart

spark.sql("set spark.sql.parquet.filterPushdown=false")
val pushdownDisableStart = System.currentTimeMillis()
for(i <- 0 until 100) {
  df.where("id like '98%'").count()
}
val pushdownDisable = System.currentTimeMillis() - pushdownDisableStart

val improvements = pushdownDisable.toDouble - pushdownEnable.toDouble

println(s"improvements: ${improvements}")

EOF

bin/spark-shell -i SPARK-24638.scala
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24638

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21623.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21623


commit 5b52ace44c8a41631535c883b7a5c8545959e5e5
Author: Yuming Wang 
Date:   2018-06-23T13:27:30Z

StringStartsWith support push down




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...

2018-06-22 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21603#discussion_r197603396
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,17 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
+buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
+  .doc("The maximum number of values to filter push-down optimization 
for IN predicate. " +
+"Large threshold won't necessarily provide much better 
performance. " +
+"The experiment argued that 300 is the limit threshold. " +
--- End diff --

You are right.

Type | limit threshold
-- | --
string | 370
int | 210
long | 285
double | 270
float | 220
decimal | Will not provide better performance


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...

2018-06-21 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21603#discussion_r197338867
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.In(name, values) if canMakeFilterOn(name) && 
values.length < 20 =>
--- End diff --

It seems that the push-down performance is better when threshold is less 
than `300`:
https://user-images.githubusercontent.com/5399861/41757743-7e411532-7616-11e8-8844-45132c50c535.png;>

The code:
```scala
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  import testImplicits._
  withTempPath { path =>
val total = 1000
(0 to total).toDF().coalesce(1)
  .write.option("parquet.block.size", 512)
  .parquet(path.getAbsolutePath)
val df = spark.read.parquet(path.getAbsolutePath)
// scalastyle:off println
var lastSize = -1
var i = 16000
while (i < total) {
  val filter = Range(0, total).filter(_ % i == 0)
  i += 100
  if (lastSize != filter.size) {
if (lastSize == -1) println(s"start size: ${filter.size}")
lastSize = filter.size
sql("set spark.sql.parquet.pushdown.inFilterThreshold=100")
val begin1 = System.currentTimeMillis()
df.where(s"id in(${filter.mkString(",")})").count()
val end1 = System.currentTimeMillis()
val time1 = end1 - begin1

sql("set spark.sql.parquet.pushdown.inFilterThreshold=10")
val begin2 = System.currentTimeMillis()
df.where(s"id in(${filter.mkString(",")})").count()
val end2 = System.currentTimeMillis()
val time2 = end2 - begin2
if (time1 <= time2) println(s"Max threshold: $lastSize")
  }
}
  }
}
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...

2018-06-21 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21603
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...

2018-06-20 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21603#discussion_r197011649
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.In(name, values) if canMakeFilterOn(name) && 
values.length < 20 =>
--- End diff --

The threshold is **20**. Too many `values` may be OOM, for example:
```scala
spark.range(1000).coalesce(1).write.option("parquet.block.size", 
1048576).parquet("/tmp/spark/parquet/SPARK-17091")
val df = spark.read.parquet("/tmp/spark/parquet/SPARK-17091/")
df.where(s"id in(${Range(1, 1).mkString(",")})").count
```
```
Exception in thread "SIGINT handler" 18/06/21 13:00:54 WARN TaskSetManager: 
Lost task 7.0 in stage 1.0 (TID 8, localhost, executor driver): 
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.StringBuilder.toString(StringBuilder.java:407)
at 
org.apache.parquet.filter2.predicate.Operators$BinaryLogicalFilterPredicate.(Operators.java:263)
at 
org.apache.parquet.filter2.predicate.Operators$Or.(Operators.java:316)
at 
org.apache.parquet.filter2.predicate.FilterApi.or(FilterApi.java:261)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...

2018-06-20 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21603

[SPARK-17091][SQL] Add rule to convert IN predicate to equivalent Parquet 
filter

## What changes were proposed in this pull request?

Add a new optimizer rule to convert an IN predicate to an equivalent 
Parquet filter.
The original pr is: https://github.com/apache/spark/pull/18424

## How was this patch tested?
unit tests and manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-17091

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21603


commit 264eed81e33d3af7d7ea50a3a49866dde18f163b
Author: Yuming Wang 
Date:   2018-06-21T04:35:20Z

Convert IN predicate to Parquet filter push-down

commit 4f96881af4af6f613c049f3756ee3aba518ceab8
Author: Yuming Wang 
Date:   2018-06-21T04:49:12Z

Change threshold to 20.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

2018-06-19 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21556
  
cc @gatorsmile  @rdblue 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...

2018-06-19 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/18424
  
@ptkool Are you still working on?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

2018-06-15 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21556
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

2018-06-15 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21556
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...

2018-06-14 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21556
  
Another performance test:
https://user-images.githubusercontent.com/5399861/41448622-437d029a-708e-11e8-9c18-5d9f17cd1edf.png;>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...

2018-06-14 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21547
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...

2018-06-13 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21547
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21547: [SPARK-24538][SQL] ByteArrayDecimalType support p...

2018-06-13 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21547#discussion_r195289284
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -37,6 +39,23 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
 DateTimeUtils.fromJavaDate(date)
   }
 
+  private def decimalToBinary(precision: Int, decimal: JBigDecimal): 
Binary = {
--- End diff --

REF: 
https://github.com/apache/spark/blob/21a7bfd5c324e6c82152229f1394f26afeae771c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L247-L266


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDeci...

2018-06-13 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21556#discussion_r195283330
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -62,6 +62,16 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   (n: String, v: Any) => FilterApi.eq(
 intColumn(n),
 Option(v).map(date => 
dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull)
+case decimal: DecimalType if DecimalType.is32BitDecimalType(decimal) =>
+  (n: String, v: Any) => FilterApi.eq(
+intColumn(n),
+
Option(v).map(_.asInstanceOf[java.math.BigDecimal].unscaledValue().intValue()
--- End diff --

REF: 
https://github.com/apache/spark/blob/21a7bfd5c324e6c82152229f1394f26afeae771c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L219


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDeci...

2018-06-13 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21556

[SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType support push down

## What changes were proposed in this pull request?


[32BitDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L208)
 and 
[64BitDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L219)
 support push down to the data sources.

## How was this patch tested?

unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24549

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21556


commit 9832661e735fbbbfc4da4cc96f4ff7a537c3eca2
Author: Yuming Wang 
Date:   2018-06-13T12:28:33Z

32BitDecimalType and 64BitDecimalType support push down to the data sources




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...

2018-06-13 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21547
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21547: [SPARK-24538][SQL] ByteArrayDecimalType support p...

2018-06-12 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21547

[SPARK-24538][SQL] ByteArrayDecimalType support push down to the data 
sources

## What changes were proposed in this pull request?


[ByteArrayDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L230)
 support push down to the data sources.

## How was this patch tested?
unit tests and manual tests.

**manual tests**:
```scala
spark.range(1000).selectExpr("id", "cast(id as decimal(9)) as d1", 
"cast(id as decimal(9, 2)) as d2", "cast(id as decimal(18)) as d3", "cast(id as 
decimal(18, 4)) as d4", "cast(id as decimal(38)) as d5", "cast(id as 
decimal(38, 18)) as d6").coalesce(1).write.option("parquet.block.size", 
1048576).parquet("/tmp/spark/parquet/decimal")
val df = spark.read.parquet("/tmp/spark/parquet/decimal/")
// Only read about 1 MB data
df.filter("d6 = 1").show
// Read 174.3 MB data
df.filter("d3 = 1").show
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24538

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21547.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21547


commit 96066701ec75d3caa27994c47eab8ff64150b6a5
Author: Yuming Wang 
Date:   2018-06-13T01:35:55Z

ByteArrayDecimalType support push down to the data sources




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21479: [SPARK-23903][SQL] Add support for date extract

2018-06-02 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21479
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21479: [SPARK-23903][SQL] Add support for date extract

2018-06-01 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21479
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21479: [SPARK-23903][SQL] Add support for date extract

2018-06-01 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21479

[SPARK-23903][SQL] Add support for date extract

## What changes were proposed in this pull request?

Add support for date `extract`,  supported field same as 
[Hive](https://github.com/apache/hive/blob/rel/release-2.3.3/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g#L308-L316):
 `YEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`,  `HOUR`, `MINUTE`, `SECOND`.

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-23903

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21479.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21479


commit a1a4db3774e7e0911e710ed1a99694add29df545
Author: Yuming Wang 
Date:   2018-06-01T16:06:55Z

Add support for date extract




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-31 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21010
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-30 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21010
  
> Basically LGTM, but I'm wondering what if the expr2 is not like a format 
string?

The same as Hive:
```sql
spark-sql> SELECT format_number(12332.123456, 'abc');
abc12332
```
```sql
hive> SELECT format_number(12332.123456, 'abc');
OK
abc12332
Time taken: 0.218 seconds, Fetched: 1 row(s)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21460: [SPARK-23442][SQL] Improvement reading from parti...

2018-05-30 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21460

[SPARK-23442][SQL] Improvement reading from partitioned and bucketed table.

## What changes were proposed in this pull request?

For a partitioned and bucketed table. With the increasing number of 
partitions, the amount of data is getting larger and larger. Reading this table 
always uses the `bucket number` of tasks.
This PR changes the logic to `bucket number` * `partition number` when 
reading partitioned and bucketed table.

## How was this patch tested?
manual tests.
```scala
spark.range(1).selectExpr(
  "id as key",
  "id % 5 as t1",
  "id % 10 as p").repartition(5, 
col("p")).write.partitionBy("p").bucketBy(5,
  "key").sortBy("t1").saveAsTable("spark_23442")
```

```scala
// All partition: partition number = 5 * 10 = 50
spark.sql("select count(distinct t1) from spark_23442 ").show
```

```scala
// Filtered 1/2 partition: partition number = 5 * (10 / 2) = 25
spark.sql("select count(distinct t1) from spark_23442 where p >= 5 ").show
```



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-23442

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21460.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21460


commit 58e4e098016051f41103464040ba24bbee28b2cf
Author: Yuming Wang 
Date:   2018-05-30T06:53:52Z

Improvement reading from partitioned and bucketed table.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-28 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21010
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-28 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21010
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21431: [SPARK-19112][CORE][FOLLOW-UP] Add missing shortCompress...

2018-05-26 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21431
  
Yes. I have tested with `âconf spark.io.compression.codec=zstd`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21431: [SPARK-19112][CORE][FOLLOW-UP] Add missing shortC...

2018-05-25 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21431

[SPARK-19112][CORE][FOLLOW-UP] Add missing shortCompressionCodecNames to 
configuration.

## What changes were proposed in this pull request?

Spark provides three codecs: `lz4`, `lzf`, `snappy`, and `zstd`. This pr 
add missing shortCompressionCodecNames to configuration.

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-19112

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21431.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21431


commit 877d27055698a872347a04977ae050fef0ba42e7
Author: Yuming Wang <yumwang@...>
Date:   2018-05-25T13:41:46Z

Add missing shortCompressionCodecNames to configuration.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21423: [SPARK-24378][SQL] Fix date_trunc function incorr...

2018-05-24 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21423

[SPARK-24378][SQL] Fix date_trunc function incorrect examples

## What changes were proposed in this pull request?

Fix `date_trunc` function incorrect examples.

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24378

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21423


commit b8b0c9dd21bbb4a5d29174d778165a2bd72403e5
Author: Yuming Wang <yumwang@...>
Date:   2018-05-24T11:46:28Z

Fix incorrect examples




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21404: [SPARK-24360][SQL] Support Hive 3.0 metastore

2018-05-23 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21404
  
Can we remove the old hive support? such as 0.12, 0.13 and 0.14.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20274: [SPARK-20120][SQL][FOLLOW-UP] Better way to support spar...

2018-05-22 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20274
  
@srowen I have updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21375: [HOT-FIX][SQL] Fix: SQLConf.scala:1757: not found...

2018-05-19 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/21375


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21375: [HOT-FIX][SQL] Fix: SQLConf.scala:1757: not found...

2018-05-19 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21375

[HOT-FIX][SQL] Fix: SQLConf.scala:1757: not found: value Utils

## What changes were proposed in this pull request?

Fix: `SQLConf.scala:1757: not found: value Utils`

## How was this patch tested?

manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark hot-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21375.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21375


commit 5375965ca1de10e97ca6f2b0b984dbe0e9306c68
Author: Yuming Wang <yumwang@...>
Date:   2018-05-20T00:30:58Z

Fix: SQLConf.scala:1757: not found: value Utils




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21343: [SPARK-24292][SQL] Proxy user cannot connect to HiveMeta...

2018-05-17 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21343
  
This problem seems to have been fixed, can you try 
[v2.3.1-rc1](https://github.com/apache/spark/releases/tag/v2.3.1-rc1)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...

2018-05-16 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/18853
  
**Spark vs Teradata**:

https://user-images.githubusercontent.com/5399861/40102134-43a138e2-591c-11e8-8bf1-00fb9b72e026.png;>
https://user-images.githubusercontent.com/5399861/40102133-436658a8-591c-11e8-950c-9ec95a4c7ed0.png;>




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21328: Ci

2018-05-15 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/21328


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21328: Ci

2018-05-15 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21328

Ci

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/spark-mler/spark ci

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21328.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21328


commit 082c78494d2fbc904379d32cec4408b24620b25d
Author: Yuming Wang <wgyumg@...>
Date:   2018-05-15T04:28:48Z

Remove -T 4 -p

commit 4c0945143f344d6a3fe2288416b5721f8cb75edb
Author: Yuming Wang <wgyumg@...>
Date:   2018-05-15T05:58:34Z

travis_wait 60

commit d645136386c5fd245cb7a358ac06095e3e9bf989
Author: Yuming Wang <wgyumg@...>
Date:   2018-05-15T06:28:37Z

Update .travis.yml




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...

2018-05-02 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21010
  
retest please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21189: [SPARK-24117][SQL] Unified the getSizePerRow

2018-04-28 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21189

[SPARK-24117][SQL] Unified the getSizePerRow

## What changes were proposed in this pull request?

This pr unified the `getSizePerRow` because `getSizePerRow` is used in many 
places. For example:

1. 
[LocalRelation.scala#L80](https://github.com/wangyum/spark/blob/f70f46d1e5bc503e9071707d837df618b7696d32/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L80)
2. 
[SizeInBytesOnlyStatsPlanVisitor.scala#L36](https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L36)


## How was this patch tested?
Exist tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24117

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21189.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21189


commit cd415381386f0ac5c29cd6dab57ceafc86e96adf
Author: Yuming Wang <yumwang@...>
Date:   2018-04-28T11:10:33Z

Unified the getSizePerRow




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21170: [SPARK-22732][SS][FOLLOW-UP] Fix memoryV2.scala toString...

2018-04-27 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/21170
  
cc @zsxwing


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21170: [SPARK-22732][SS][FOLLOW-UP] Fix memoryV2.scala t...

2018-04-26 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21170

[SPARK-22732][SS][FOLLOW-UP] Fix memoryV2.scala toString error

## What changes were proposed in this pull request?

Fix `memoryV2.scala` toString error

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-22732

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21170






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2

2018-04-23 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/20659


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21077: [SPARK-21033][CORE][FOLLOW-UP] Update Spillable

2018-04-16 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21077

[SPARK-21033][CORE][FOLLOW-UP] Update Spillable

## What changes were proposed in this pull request?

Update 
```scala

SparkEnv.get.conf.getLong("spark.shuffle.spill.numElementsForceSpillThreshold", 
Long.MaxValue)
```  
to 
```scala
SparkEnv.get.conf.get(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD)
```

 because of `SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD`'s default 
value is `Integer.MAX_VALUE`:

https://github.com/apache/spark/blob/c99fc9ad9b600095baba003053dbf84304ca392b/core/src/main/scala/org/apache/spark/internal/config/package.scala#L503-L511

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-21033

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21077.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21077


commit a7de7c6d29508cfdcc9c7cb66fa1648c6b4b
Author: Yuming Wang <yumwang@...>
Date:   2018-04-16T08:12:53Z


SparkEnv.get.conf.getLong("spark.shuffle.spill.numElementsForceSpillThreshold", 
Long.MaxValue) -> 
SparkEnv.get.conf.get(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...

2018-04-09 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21010

[SPARK-23900][SQL] format_number support user specifed format as argument

## What changes were proposed in this pull request?

`format_number` support user specifed format as argument. For example:
```sql
spark-sql> SELECT format_number(12332.123456, '##.###');
12332.123
```

## How was this patch tested?

unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-23900

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21010.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21010


commit 202fa3d505ed4395858a277c9b871006b2f64483
Author: Yuming Wang <yumwang@...>
Date:   2018-04-09T14:28:44Z

format_number udf should take user specifed format as argument




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-04-09 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20944
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-04-07 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20944
  
cc @jerryshao 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20944: [SPARK-23831][SQL] Add org.apache.derby to Isolat...

2018-04-04 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20944#discussion_r179158019
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
 ---
@@ -188,6 +188,9 @@ private[hive] class IsolatedClientLoader(
 (name.startsWith("com.google") && 
!name.startsWith("com.google.cloud")) ||
 name.startsWith("java.lang.") ||
 name.startsWith("java.net") ||
+name.startsWith("com.sun.") ||
+name.startsWith("sun.reflect.") ||
--- End diff --

Yes, it doesn't matter if add these two lines, but I think it's best to 
add. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...

2018-03-31 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/18853
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20944: [SPARK-23831][SQL] Add org.apache.derby to Isolat...

2018-03-30 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/20944

[SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader

## What changes were proposed in this pull request?

Add `org.apache.derby` to `IsolatedClientLoader`, otherwise it may throw an 
exception:
```
[info] Cause: java.sql.SQLException: Failed to start database 
'metastore_db' with class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see the 
next exception for details.
[info] at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
[info] at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
[info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
[info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
Source)
[info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
[info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
```

How to reproduce:
```bash

sed 's/HiveExternalCatalogSuite/HiveExternalCatalog2Suite/g' 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
 > 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalog2Suite.scala

build/sbt -Phive "hive/test-only *.HiveExternalCatalogSuite 
*.HiveExternalCatalog2Suite"
```

## How was this patch tested?

manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-23831

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20944


commit 7d5cc71e4753f26fed4563a0eef27aa9de173d57
Author: Yuming Wang <yumwang@...>
Date:   2018-03-30T10:41:42Z

Add org.apache.derby to IsolatedClientLoader




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20785: [SPARK-23640][CORE] Fix hadoop config may override spark...

2018-03-29 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20785
  
Ping @vanzin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.u...

2018-03-27 Thread wangyum

Github user wangyum closed the pull request at:

https://github.com/apache/spark/pull/20898


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.uris bef...

2018-03-27 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20898
  
It looks like


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...

2018-03-24 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20866#discussion_r176906868
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala
 ---
@@ -92,8 +93,8 @@ private[security] class HiveDelegationTokenProvider
 s"$principal at $metastoreUri")
 
   doAsRealUser {
-val hive = Hive.get(conf, classOf[HiveConf])
--- End diff --

1. This 
[`Hive.get()`](https://github.com/apache/spark/blob/v2.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L239)
 is different from others, It loaded by  `IsolatedClientLoader`.
2. I can not start a `HiveThriftServer2` in a kerberized cluster, so I'm 
not sure `CLIService.java` should be updated, How about update it later?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...

2018-03-24 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20866#discussion_r176906522
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala
 ---
@@ -92,8 +93,8 @@ private[security] class HiveDelegationTokenProvider
 s"$principal at $metastoreUri")
 
   doAsRealUser {
-val hive = Hive.get(conf, classOf[HiveConf])
-val tokenStr = hive.getDelegationToken(currentUser.getUserName(), 
principal)
+metaStoreClient = new 
HiveMetaStoreClient(conf.asInstanceOf[HiveConf])
+val tokenStr = 
metaStoreClient.getDelegationToken(currentUser.getUserName, principal)
--- End diff --

Yes, both HMS 1.x and 2.x


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...

2018-03-24 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20866#discussion_r176906474
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala
 ---
@@ -92,8 +94,9 @@ private[security] class HiveDelegationTokenProvider
 s"$principal at $metastoreUri")
 
   doAsRealUser {
-val hive = Hive.get(conf, classOf[HiveConf])
-val tokenStr = hive.getDelegationToken(currentUser.getUserName(), 
principal)
+metastoreClient = 
RetryingMetaStoreClient.getProxy(conf.asInstanceOf[HiveConf], null,
--- End diff --

HiveMetaStoreClient -> RetryingMetaStoreClient. In fact, `Hive.get` also 
uses `RetryingMetaStoreClient`:
```
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.uris bef...

2018-03-24 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20898
  
Yes, it's proxy user:
```
export HADOOP_PROXY_USER=user
spark-sql --master yarn
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.uris bef...

2018-03-24 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20898
  
cc @yaooqinn @cloud-fan


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.u...

2018-03-24 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/20898

[SPARK-23789][SQL] Shouldn't set hive.metastore.uris before invoking 
HiveDelegationTokenProvider

## What changes were proposed in this pull request?

`spark-sql` can't connect to metastore with a security Hadoop cluster after 
[SPARK-21428](https://issues.apache.org/jira/browse/SPARK-21428).

`hive.metastore.uris` was `HiveConf.ConfVars.METASTOREURIS.defaultStrVal` 
here before SPARK-21428. 

This pr revert `hive.metastore.uris` to 
`HiveConf.ConfVars.METASTOREURIS.defaultStrVal`.


## How was this patch tested?

manual tests with a security Hadoop cluster


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-23789

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20898.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20898


commit a382aab4f3a9cabda10ab2aedbbb8d663737348f
Author: Yuming Wang <yumwang@...>
Date:   2018-03-24T07:19:25Z

Shouldn't set hive.metastore.uris before invoking 
HiveDelegationTokenProvider




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...

2018-03-24 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20866#discussion_r176902474
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala
 ---
@@ -92,8 +93,8 @@ private[security] class HiveDelegationTokenProvider
 s"$principal at $metastoreUri")
 
   doAsRealUser {
-val hive = Hive.get(conf, classOf[HiveConf])
--- End diff --

Thanks @dongjoon-hyun, seems `RetryingMetaStoreClient` is a better choice 
and I will try.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20867: Spark 23759

2018-03-20 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20867
  
Please update the title.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...

2018-03-20 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20866#discussion_r175859248
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -908,11 +912,39 @@ private[hive] object HiveClientImpl {
 Utils.classForName(name)
   .asInstanceOf[Class[_ <: 
org.apache.hadoop.hive.ql.io.HiveOutputFormat[_, _]]]
 
+  private def toHiveMetaApiTable(table: CatalogTable): HiveMetaApiTable = {
--- End diff --

Copy from Hive: 
https://github.com/apache/hive/blob/rel/release-2.3.2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L149


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...

2018-03-20 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20866#discussion_r175858111
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -908,11 +912,39 @@ private[hive] object HiveClientImpl {
 Utils.classForName(name)
   .asInstanceOf[Class[_ <: 
org.apache.hadoop.hive.ql.io.HiveOutputFormat[_, _]]]
 
+  private def toHiveMetaApiTable(table: CatalogTable): HiveMetaApiTable = {
+val sd = new StorageDescriptor
+sd.setSerdeInfo(new SerDeInfo)
+sd.setNumBuckets(-1)
+sd.setBucketCols(new JArrayList[String])
+sd.setCols(new JArrayList[FieldSchema])
+sd.setParameters(new JHashMap[String, String])
+sd.setSortCols(new JArrayList[Order])
+sd.getSerdeInfo.setParameters(new JHashMap[String, String])
+sd.getSerdeInfo.getParameters.put(serdeConstants.SERIALIZATION_FORMAT, 
"1")
+sd.setInputFormat(classOf[SequenceFileInputFormat[_, _]].getName)
+sd.setOutputFormat(classOf[HiveSequenceFileOutputFormat[_, _]].getName)
+val skewInfo: SkewedInfo = new SkewedInfo
+skewInfo.setSkewedColNames(new JArrayList[String])
+skewInfo.setSkewedColValues(new JArrayList[JList[String]])
+skewInfo.setSkewedColValueLocationMaps(new JHashMap[JList[String], 
String])
+sd.setSkewedInfo(skewInfo)
+
+val apiTable = new HiveMetaApiTable()
+apiTable.setSd(sd)
+apiTable.setPartitionKeys(new JArrayList[FieldSchema])
+apiTable.setParameters(new JHashMap[String, String])
+apiTable.setTableType(HiveTableType.MANAGED_TABLE.toString)
+apiTable.setDbName(table.database)
+apiTable.setTableName(table.identifier.table)
+apiTable
+  }
+
   /**
* Converts the native table metadata representation format CatalogTable 
to Hive's Table.
*/
   def toHiveTable(table: CatalogTable, userName: Option[String] = None): 
HiveTable = {
-val hiveTable = new HiveTable(table.database, table.identifier.table)
+val hiveTable = new HiveTable(toHiveMetaApiTable(table))
--- End diff --

Avoid 
[`t.setOwner(SessionState.getUserFromAuthenticator())`](https://github.com/apache/hive/blob/rel/release-2.3.2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L180),
 because it will connect to 
[Metastore](https://github.com/apache/hive/blob/rel/release-2.3.2/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L913),
 and we will set owner later: 
https://github.com/apache/spark/blob/v2.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L914


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...

2018-03-20 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/20866

[SPARK-23749][SQL] Avoid Hive.get() to compatible with different Hive 
metastore

## What changes were proposed in this pull request?

Avoid `Hive.get()` to compatible with different Hive metastore.

## How was this patch tested?
Exist unit tests and manual tests with a security Hadoop cluster

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-23749

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20866.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20866






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2

2018-03-17 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/20659
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 992 matches

Mail list logo