[GitHub] spark pull request #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBe...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21677#discussion_r200260115 --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt --- @@ -0,0 +1,556 @@ +[ Pushdown for many distinct value case ] --- End diff -- How about this? ``` ... Select all int rows (value != -1): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized1140 / 1165 0.9 1087.4 1.0X Parquet Vectorized (Pushdown) 1140 / 1172 0.9 1086.8 1.0X Native ORC Vectorized 1158 / 1206 0.9 1104.7 1.0X Native ORC Vectorized (Pushdown) 1151 / 1220 0.9 1098.1 1.0X Pushdown for few distinct value case (use dictionary encoding) Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Select 0 distinct string row (value IS NULL): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized 512 / 565 2.0 488.6 1.0X Parquet Vectorized (Pushdown) 27 / 33 39.3 25.5 19.2X Native ORC Vectorized 509 / 546 2.1 485.0 1.0X Native ORC Vectorized (Pushdown)79 / 91 13.2 75.5 6.5X ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBe...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21677#discussion_r200246934 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -427,16 +245,122 @@ object FilterPushdownBenchmark { } } } + } -// Pushdown for few distinct value case (use dictionary encoding) + ignore("Pushdown for few distinct value case (use dictionary encoding)") { withTempPath { dir => val numDistinctValues = 200 - val mid = numDistinctValues / 2 withTempTable("orcTable", "patquetTable") { prepareStringDictTable(dir, numRows, numDistinctValues, width) -runStringBenchmark(numRows, width, mid, "distinct string") +runStringBenchmark(numRows, width, numDistinctValues / 2, "distinct string") + } +} + } + + ignore("Pushdown benchmark for StringStartsWith") { --- End diff -- Yes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21556#discussion_r200246162 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: (n: String, v: Any) => FilterApi.eq( intColumn(n), Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull) + +case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal => --- End diff -- Seems invalidate value already filtered by: https://github.com/apache/spark/blob/e76b0124fbe463def00b1dffcfd8fd47e04772fe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L439 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBenchmark
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21677 @HyukjinKwon Can you merge this to master first? I would like to update the [Benchmark results](https://github.com/apache/spark/pull/21677/files#diff-c5c0bfc86983d5779269cf75da8ed645) of several other pushdown related PRs to the corresponding PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21556#discussion_r200170975 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -82,6 +120,30 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: (n: String, v: Any) => FilterApi.eq( intColumn(n), Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull) + +case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal => --- End diff -- DecimalType contains variable: `decimalMetadata`. It seems difficult to make a constants like before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21696: [SPARK-24716][SQL] Refactor ParquetFilters
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21696 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21696#discussion_r200011264 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -19,187 +19,200 @@ package org.apache.spark.sql.execution.datasources.parquet import java.sql.Date +import scala.collection.JavaConverters.asScalaBufferConverter + import org.apache.parquet.filter2.predicate._ import org.apache.parquet.filter2.predicate.FilterApi._ import org.apache.parquet.io.api.Binary -import org.apache.parquet.schema.PrimitiveComparator +import org.apache.parquet.schema.{DecimalMetadata, MessageType, OriginalType, PrimitiveComparator, PrimitiveType} +import org.apache.parquet.schema.OriginalType._ +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate import org.apache.spark.sql.sources -import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String /** * Some utility function to convert Spark data source filters to Parquet filters. */ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) { + private case class ParquetSchemaType( + originalType: OriginalType, + primitiveTypeName: PrimitiveTypeName, + decimalMetadata: DecimalMetadata) + + private val ParquetBooleanType = ParquetSchemaType(null, BOOLEAN, null) + private val ParquetIntegerType = ParquetSchemaType(null, INT32, null) + private val ParquetLongType = ParquetSchemaType(null, INT64, null) + private val ParquetFloatType = ParquetSchemaType(null, FLOAT, null) + private val ParquetDoubleType = ParquetSchemaType(null, DOUBLE, null) + private val ParquetStringType = ParquetSchemaType(UTF8, BINARY, null) + private val ParquetBinaryType = ParquetSchemaType(null, BINARY, null) + private val ParquetDateType = ParquetSchemaType(DATE, INT32, null) + private def dateToDays(date: Date): SQLDate = { DateTimeUtils.fromJavaDate(date) } - private val makeEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { -case BooleanType => + private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = { +case ParquetBooleanType => (n: String, v: Any) => FilterApi.eq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean]) -case IntegerType => +case ParquetIntegerType => (n: String, v: Any) => FilterApi.eq(intColumn(n), v.asInstanceOf[Integer]) -case LongType => +case ParquetLongType => (n: String, v: Any) => FilterApi.eq(longColumn(n), v.asInstanceOf[java.lang.Long]) -case FloatType => +case ParquetFloatType => (n: String, v: Any) => FilterApi.eq(floatColumn(n), v.asInstanceOf[java.lang.Float]) -case DoubleType => +case ParquetDoubleType => (n: String, v: Any) => FilterApi.eq(doubleColumn(n), v.asInstanceOf[java.lang.Double]) // Binary.fromString and Binary.fromByteArray don't accept null values -case StringType => +case ParquetStringType => (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(s => Binary.fromString(s.asInstanceOf[String])).orNull) -case BinaryType => +case ParquetBinaryType => (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) -case DateType if pushDownDate => +case ParquetDateType if pushDownDate => (n: String, v: Any) => FilterApi.eq( intColumn(n), Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull) } - private val makeNotEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { -case BooleanType => + private val makeNotEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = { +case ParquetBooleanType => (n: String, v: Any) => FilterApi.notEq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean]) -case IntegerType => +case ParquetIntegerType => (n: String, v: Any) => FilterApi.notEq(intColumn(n), v.asInstanceOf[Integer]) -case LongType => +case ParquetLongType => (n: String, v: Any) => FilterApi.notEq(longColumn(n), v.asInstanceOf[java.lang.Long])
[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21696#discussion_r19002 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -19,166 +19,186 @@ package org.apache.spark.sql.execution.datasources.parquet import java.sql.Date +import scala.collection.JavaConverters._ + import org.apache.parquet.filter2.predicate._ import org.apache.parquet.filter2.predicate.FilterApi._ import org.apache.parquet.io.api.Binary -import org.apache.parquet.schema.PrimitiveComparator +import org.apache.parquet.schema._ +import org.apache.parquet.schema.OriginalType._ +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate import org.apache.spark.sql.sources -import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String /** * Some utility function to convert Spark data source filters to Parquet filters. */ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) { + case class ParquetSchemaType( + originalType: OriginalType, + primitiveTypeName: PrimitiveType.PrimitiveTypeName, + decimalMetadata: DecimalMetadata) + private def dateToDays(date: Date): SQLDate = { DateTimeUtils.fromJavaDate(date) } - private val makeEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { -case BooleanType => + private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = { +// BooleanType +case ParquetSchemaType(null, BOOLEAN, null) => (n: String, v: Any) => FilterApi.eq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean]) -case IntegerType => +// IntegerType +case ParquetSchemaType(null, INT32, null) => (n: String, v: Any) => FilterApi.eq(intColumn(n), v.asInstanceOf[Integer]) -case LongType => +// LongType +case ParquetSchemaType(null, INT64, null) => (n: String, v: Any) => FilterApi.eq(longColumn(n), v.asInstanceOf[java.lang.Long]) -case FloatType => +// FloatType +case ParquetSchemaType(null, FLOAT, null) => (n: String, v: Any) => FilterApi.eq(floatColumn(n), v.asInstanceOf[java.lang.Float]) -case DoubleType => +// DoubleType +case ParquetSchemaType(null, DOUBLE, null) => (n: String, v: Any) => FilterApi.eq(doubleColumn(n), v.asInstanceOf[java.lang.Double]) +// StringType // Binary.fromString and Binary.fromByteArray don't accept null values -case StringType => +case ParquetSchemaType(UTF8, BINARY, null) => (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(s => Binary.fromString(s.asInstanceOf[String])).orNull) -case BinaryType => +// BinaryType +case ParquetSchemaType(null, BINARY, null) => (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) -case DateType if pushDownDate => +// DateType +case ParquetSchemaType(DATE, INT32, null) if pushDownDate => (n: String, v: Any) => FilterApi.eq( intColumn(n), Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull) } - private val makeNotEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { -case BooleanType => + private val makeNotEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = { +case ParquetSchemaType(null, BOOLEAN, null) => (n: String, v: Any) => FilterApi.notEq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean]) -case IntegerType => +case ParquetSchemaType(null, INT32, null) => (n: String, v: Any) => FilterApi.notEq(intColumn(n), v.asInstanceOf[Integer]) -case LongType => +case ParquetSchemaType(null, INT64, null) => (n: String, v: Any) => FilterApi.notEq(longColumn(n), v.asInstanceOf[java.lang.Long]) -case FloatType => +case ParquetSchemaType(null, FLOAT, null) => (n: String, v: Any) => FilterApi.notEq(floatColumn(n), v.asInstanceOf[java.lang.Float]) -case DoubleType => +case ParquetSchemaType(null, DOUBLE, null) => (n: String, v: Any) => FilterApi.notEq(doubleColumn(n), v.asInstanceOf[java.lang.Double])
[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21682#discussion_r12294 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -42,6 +42,10 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: private val makeEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { case BooleanType => (n: String, v: Any) => FilterApi.eq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean]) +case ByteType | ShortType => + (n: String, v: Any) => FilterApi.eq( +intColumn(n), + Option(v).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull) --- End diff -- value may be `null`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21682#discussion_r12316 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -93,6 +101,10 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: } private val makeLt: PartialFunction[DataType, (String, Any) => FilterPredicate] = { +case ByteType | ShortType => + (n: String, v: Any) => FilterApi.lt( +intColumn(n), +v.asInstanceOf[Number].intValue.asInstanceOf[Integer]) --- End diff -- value cannot be `null`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21682#discussion_r11024 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -42,6 +42,14 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: private val makeEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { case BooleanType => (n: String, v: Any) => FilterApi.eq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean]) +case ByteType => + (n: String, v: Any) => FilterApi.eq( +intColumn(n), +Option(v).map(b => b.asInstanceOf[java.lang.Byte].toInt.asInstanceOf[Integer]).orNull) +case ShortType => --- End diff -- How about like this: - `makeEq` and `makeNotEq` ```scala case ByteType | ShortType => (n: String, v: Any) => FilterApi.notEq( intColumn(n), Option(v).map(_.asInstanceOf[Number].intValue.asInstanceOf[Integer]).orNull) case IntegerType => (n: String, v: Any) => FilterApi.notEq(intColumn(n), v.asInstanceOf[Integer]) ``` - `makeLt`, `makeLtEq`, `makeGt` and `makeGtEq`: ```scala case ByteType | ShortType => (n: String, v: Any) => FilterApi.gtEq( intColumn(n), v.asInstanceOf[Number].intValue.asInstanceOf[Integer]) case IntegerType => (n: String, v: Any) => FilterApi.gtEq(intColumn(n), v.asInstanceOf[java.lang.Integer]) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21682#discussion_r199986187 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -42,6 +42,14 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: private val makeEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { case BooleanType => (n: String, v: Any) => FilterApi.eq(booleanColumn(n), v.asInstanceOf[java.lang.Boolean]) +case ByteType => + (n: String, v: Any) => FilterApi.eq( +intColumn(n), +Option(v).map(b => b.asInstanceOf[java.lang.Byte].toInt.asInstanceOf[Integer]).orNull) --- End diff -- ```scala scala> null.asInstanceOf[Short].toInt.asInstanceOf[Integer] res49: Integer = 0 scala> null.asInstanceOf[java.lang.Short].toInt.asInstanceOf[Integer] java.lang.NullPointerException at scala.Predef$.Short2short(Predef.scala:360) ... 51 elided ``` That's why I use `Option.map` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21696#discussion_r199672805 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -19,166 +19,186 @@ package org.apache.spark.sql.execution.datasources.parquet import java.sql.Date +import scala.collection.JavaConverters._ + import org.apache.parquet.filter2.predicate._ import org.apache.parquet.filter2.predicate.FilterApi._ import org.apache.parquet.io.api.Binary -import org.apache.parquet.schema.PrimitiveComparator +import org.apache.parquet.schema._ +import org.apache.parquet.schema.OriginalType._ +import org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName._ import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate import org.apache.spark.sql.sources -import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String /** * Some utility function to convert Spark data source filters to Parquet filters. */ private[parquet] class ParquetFilters(pushDownDate: Boolean, pushDownStartWith: Boolean) { + case class ParquetSchemaType( + originalType: OriginalType, + primitiveTypeName: PrimitiveType.PrimitiveTypeName, + decimalMetadata: DecimalMetadata) + private def dateToDays(date: Date): SQLDate = { DateTimeUtils.fromJavaDate(date) } - private val makeEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { -case BooleanType => + private val makeEq: PartialFunction[ParquetSchemaType, (String, Any) => FilterPredicate] = { +// BooleanType +case ParquetSchemaType(null, BOOLEAN, null) => --- End diff -- Mapping type reference: https://github.com/apache/spark/blob/21a7bfd5c324e6c82152229f1394f26afeae771c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L338-L560 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21696: [SPARK-24716][SQL] Refactor ParquetFilters
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21696 cc @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21556: [SPARK-24549][SQL] Support Decimal type push down...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21556#discussion_r199442189 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -359,6 +369,70 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - decimal") { +Seq(true, false).foreach { legacyFormat => + withSQLConf(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key -> legacyFormat.toString) { +Seq(s"_1 decimal(${Decimal.MAX_INT_DIGITS}, 2)", // 32BitDecimalType + s"_1 decimal(${Decimal.MAX_LONG_DIGITS}, 2)", // 64BitDecimalType + "_1 decimal(38, 18)" // ByteArrayDecimalType +).foreach { schemaDDL => + val schema = StructType.fromDDL(schemaDDL) + val rdd = +spark.sparkContext.parallelize((1 to 4).map(i => Row(new java.math.BigDecimal(i + val dataFrame = spark.createDataFrame(rdd, schema) + testDecimalPushDown(dataFrame) { implicit df => +assert(df.schema === schema) +checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) +checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 4).map(Row.apply(_))) + +checkFilterPredicate('_1 === 1, classOf[Eq[_]], 1) +checkFilterPredicate('_1 <=> 1, classOf[Eq[_]], 1) +checkFilterPredicate('_1 =!= 1, classOf[NotEq[_]], (2 to 4).map(Row.apply(_))) + +checkFilterPredicate('_1 < 2, classOf[Lt[_]], 1) +checkFilterPredicate('_1 > 3, classOf[Gt[_]], 4) +checkFilterPredicate('_1 <= 1, classOf[LtEq[_]], 1) +checkFilterPredicate('_1 >= 4, classOf[GtEq[_]], 4) + +checkFilterPredicate(Literal(1) === '_1, classOf[Eq[_]], 1) +checkFilterPredicate(Literal(1) <=> '_1, classOf[Eq[_]], 1) +checkFilterPredicate(Literal(2) > '_1, classOf[Lt[_]], 1) +checkFilterPredicate(Literal(3) < '_1, classOf[Gt[_]], 4) +checkFilterPredicate(Literal(1) >= '_1, classOf[LtEq[_]], 1) +checkFilterPredicate(Literal(4) <= '_1, classOf[GtEq[_]], 4) + +checkFilterPredicate(!('_1 < 4), classOf[GtEq[_]], 4) +checkFilterPredicate('_1 < 2 || '_1 > 3, classOf[Operators.Or], Seq(Row(1), Row(4))) + } +} + } +} + } + + test("incompatible parquet file format will throw exeception") { --- End diff -- Have create a PR: https://github.com/apache/spark/pull/21696 After this PR. Support decimal should be like this: https://github.com/wangyum/spark/blob/refactor-decimal-pushdown/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L118-L146 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21696: [SPARK-24716][SQL] Refactor ParquetFilters
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21696 [SPARK-24716][SQL] Refactor ParquetFilters ## What changes were proposed in this pull request? Replace DataFrame schema to Parquet file schema when create `ParquetFilters`. more details will add later. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24716 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21696.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21696 commit 10d408fd3fe429d5529755b5abebac86b22b6d55 Author: Yuming Wang Date: 2018-07-02T09:36:46Z Refactor ParquetFilters --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21682: [SPARK-24706][SQL] ByteType and ShortType support...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21682 [SPARK-24706][SQL] ByteType and ShortType support pushdown to parquet ## What changes were proposed in this pull request? `ByteType` and `ShortType` support pushdown to parquet data source. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24706 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21682.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21682 commit e9d56252e6c65f5afa207bc98c8c5e008de57e0c Author: Yuming Wang Date: 2018-06-30T19:13:13Z ByteType and ShortType pushdown to parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21681: Pin tag 210
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21681 @zhangchj1990 Looks mistakenly open. Mind closing this please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21603 Benchmark result: ``` ##[ Pushdown benchmark for InSet -> InFilters ]## Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz InSet -> InFilters (threshold: 10, values count: 5, distribution: 10): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized7649 / 7678 2.1 486.3 1.0X Parquet Vectorized (Pushdown) 316 / 325 49.8 20.1 24.2X Native ORC Vectorized 6787 / 7353 2.3 431.5 1.1X Native ORC Vectorized (Pushdown) 1010 / 1020 15.6 64.2 7.6X InSet -> InFilters (threshold: 10, values count: 5, distribution: 50): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized7537 / 7944 2.1 479.2 1.0X Parquet Vectorized (Pushdown) 297 / 306 52.9 18.9 25.3X Native ORC Vectorized 6768 / 6779 2.3 430.3 1.1X Native ORC Vectorized (Pushdown) 998 / 1017 15.8 63.4 7.6X InSet -> InFilters (threshold: 10, values count: 5, distribution: 90): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized7500 / 7592 2.1 476.8 1.0X Parquet Vectorized (Pushdown) 299 / 306 52.5 19.0 25.1X Native ORC Vectorized 6758 / 6797 2.3 429.7 1.1X Native ORC Vectorized (Pushdown) 982 / 993 16.0 62.4 7.6X InSet -> InFilters (threshold: 10, values count: 10, distribution: 10): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized7566 / 8153 2.1 481.1 1.0X Parquet Vectorized (Pushdown) 319 / 328 49.3 20.3 23.7X Native ORC Vectorized 6761 / 6812 2.3 429.8 1.1X Native ORC Vectorized (Pushdown) 995 / 1013 15.8 63.3 7.6X InSet -> InFilters (threshold: 10, values count: 10, distribution: 50): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized7512 / 7581 2.1 477.6 1.0X Parquet Vectorized (Pushdown) 315 / 322 50.0 20.0 23.9X Native ORC Vectorized 6712 / 6774 2.3 426.8 1.1X Native ORC Vectorized (Pushdown) 1001 / 1032 15.7 63.6 7.5X InSet -> InFilters (threshold: 10, values count: 10, distribution: 90): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized7603 / 7689 2.1 483.4 1.0X Parquet Vectorized (Pushdown) 308 / 317 51.0 19.6 24.7X Native ORC Vectorized 7011 / 7605 2.2 445.7 1.1X Native ORC Vectorized (Pushdown) 1038 / 1067 15.2 66.0 7.3X InSet -> InFilters (threshold: 10, values count: 50, distribution: 10): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized7750 / 7796 2.0 492.7 1.0X Parquet Vectorized (Pushdown) 7855 / 7961 2.0 499.4 1.0X Native ORC Vectorized 7120 / 7820 2.2 452.7 1.1X Native ORC Vectorized (Pushdown) 1085 / 1122 14.5 69.0 7.1X InSet -> InFilters (threshold: 10, values count: 50, distribution: 50): Best/Avg Time(ms)Rate(M/
[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21623 Benchmark result: ``` ###[ Pushdown benchmark for StringStartsWith ]### Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz StringStartsWith filter: (value like '10%'): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Parquet Vectorized 10104 / 11125 1.6 642.4 1.0X Parquet Vectorized (Pushdown) 3002 / 3608 5.2 190.8 3.4X Native ORC Vectorized9589 / 10454 1.6 609.7 1.1X Native ORC Vectorized (Pushdown) 9798 / 10509 1.6 622.9 1.0X StringStartsWith filter: (value like '1000%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized8437 / 8563 1.9 536.4 1.0X Parquet Vectorized (Pushdown) 279 / 289 56.3 17.8 30.2X Native ORC Vectorized 7354 / 7568 2.1 467.5 1.1X Native ORC Vectorized (Pushdown) 7730 / 7972 2.0 491.4 1.1X StringStartsWith filter: (value like '786432%'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized8290 / 8510 1.9 527.0 1.0X Parquet Vectorized (Pushdown) 260 / 272 60.5 16.5 31.9X Native ORC Vectorized 7361 / 7395 2.1 468.0 1.1X Native ORC Vectorized (Pushdown) 7694 / 7811 2.0 489.2 1.1X ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBenchmark
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21677 cc @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21556 Benchmark results: ``` ###[ Pushdown benchmark for Decimal ] Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Select 1 decimal(9, 2) row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized4004 / 5309 3.9 254.5 1.0X Parquet Vectorized (Pushdown) 1401 / 1431 11.2 89.1 2.9X Native ORC Vectorized 4499 / 4567 3.5 286.0 0.9X Native ORC Vectorized (Pushdown) 899 / 961 17.5 57.2 4.5X Select 10% decimal(9, 2) rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized5376 / 6437 2.9 341.8 1.0X Parquet Vectorized (Pushdown) 2696 / 2754 5.8 171.4 2.0X Native ORC Vectorized 5458 / 5623 2.9 347.0 1.0X Native ORC Vectorized (Pushdown) 2230 / 2255 7.1 141.8 2.4X Select 50% decimal(9, 2) rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized8280 / 8487 1.9 526.4 1.0X Parquet Vectorized (Pushdown) 7716 / 7757 2.0 490.6 1.1X Native ORC Vectorized 9144 / 9495 1.7 581.4 0.9X Native ORC Vectorized (Pushdown) 7918 / 8118 2.0 503.4 1.0X Select 90% decimal(9, 2) rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized9648 / 9676 1.6 613.4 1.0X Parquet Vectorized (Pushdown) 9647 / 9778 1.6 613.3 1.0X Native ORC Vectorized 10782 / 10867 1.5 685.5 0.9X Native ORC Vectorized (Pushdown)10108 / 10269 1.6 642.6 1.0X Select 1 decimal(18, 2) row (value = 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized4066 / 4147 3.9 258.5 1.0X Parquet Vectorized (Pushdown) 84 / 89188.0 5.3 48.6X Native ORC Vectorized 5430 / 5512 2.9 345.3 0.7X Native ORC Vectorized (Pushdown) 1054 / 1076 14.9 67.0 3.9X Select 10% decimal(18, 2) rows (value < 1572864): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized5028 / 5154 3.1 319.7 1.0X Parquet Vectorized (Pushdown) 1360 / 1421 11.6 86.5 3.7X Native ORC Vectorized 6266 / 6360 2.5 398.4 0.8X Native ORC Vectorized (Pushdown) 2513 / 2550 6.3 159.8 2.0X Select 50% decimal(18, 2) rows (value < 7864320): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized8571 / 8600 1.8 544.9 1.0X Parquet Vectorized (Pushdown) 6455 / 6713 2.4 410.4 1.3X Native ORC Vectorized 10138 / 10353 1.6 644.5 0.8X Native ORC Vectorized (Pushdown) 8166 / 8418 1.9 519.2 1.0X Select 90% decimal(18, 2) rows (value < 14155776): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative Parquet Vectorized 12184 /
[GitHub] spark pull request #21677: [SPARK-24692][TESTS] Improvement FilterPushdownBe...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21677 [SPARK-24692][TESTS] Improvement FilterPushdownBenchmark ## What changes were proposed in this pull request? 1. Write the result to `benchmarks/FilterPushdownBenchmark-results.txt` for easy maintenance. 2. Add more benchmark case: `StringStartsWith`, `Decimal` and `InSet -> InFilters`. ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24692 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21677.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21677 commit ccdd21cfa75f8577b5f8093c8e0b1eba6aa2e055 Author: Yuming Wang Date: 2018-06-30T00:22:16Z Improvement FilterPushdownBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21623#discussion_r199116993 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -660,6 +688,62 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex assert(df.where("col > 0").count() === 2) } } + + test("filter pushdown - StringStartsWith") { +withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df => + // Test canDrop() --- End diff -- Both methods have been executed but it can't be confirmed which method has taken effect. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21623 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21623#discussion_r199043411 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex assert(df.where("col > 0").count() === 2) } } + + test("filter pushdown - StringStartsWith") { +withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df => --- End diff -- Added `testStringStartsWith` to test that exactly go through the `canDrop` and `inverseCanDrop`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21623#discussion_r199043210 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -22,16 +22,23 @@ import java.sql.Date import org.apache.parquet.filter2.predicate._ import org.apache.parquet.filter2.predicate.FilterApi._ import org.apache.parquet.io.api.Binary +import org.apache.parquet.schema.PrimitiveComparator import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.sources import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String /** * Some utility function to convert Spark data source filters to Parquet filters. */ -private[parquet] class ParquetFilters(pushDownDate: Boolean) { +private[parquet] class ParquetFilters() { + + val sqlConf: SQLConf = SQLConf.get --- End diff -- You are right. I hit a bug here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21547: [SPARK-24538][SQL] ByteArrayDecimalType support p...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/21547 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21547 Close it because I have implement it in [SPARK-24549](https://issues.apache.org/jira/browse/SPARK-24549). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] Support Decimal type push down to the...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21556 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r198146352 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) + case sources.In(name, values) if canMakeFilterOn(name) && values.length < 20 => --- End diff -- I have prepared a test case that you can verify it: ```scala test("Benchmark") { def benchmark(func: () => Unit): Long = { val start = System.currentTimeMillis() func() val end = System.currentTimeMillis() end - start } // scalastyle:off withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { withTempPath { path => Seq(100, 1000).foreach { count => Seq(1048576, 10485760, 104857600).foreach { blockSize => spark.range(count).toDF().selectExpr("id", "cast(id as string) as d1", "cast(id as double) as d2", "cast(id as float) as d3", "cast(id as int) as d4", "cast(id as decimal(38)) as d5") .coalesce(1).write.mode("overwrite") .option("parquet.block.size", blockSize).parquet(path.getAbsolutePath) val df = spark.read.parquet(path.getAbsolutePath) println(s"path: ${path.getAbsolutePath}") Seq(1000, 100, 10, 1).foreach { ratio => println(s"##[ count: $count, blockSize: $blockSize, ratio: $ratio ]#") var i = 1 while (i < 300) { val filter = Range(0, i).map(r => scala.util.Random.nextInt(count / ratio)) i += 4 sql("set spark.sql.parquet.pushdown.inFilterThreshold=1") val vanillaTime = benchmark(() => df.where(s"id in(${filter.mkString(",")})").count()) sql("set spark.sql.parquet.pushdown.inFilterThreshold=1000") val pushDownTime = benchmark(() => df.where(s"id in(${filter.mkString(",")})").count()) if (pushDownTime > vanillaTime) { println(s"vanilla is better, threshold: ${filter.size}, $pushDownTime, $vanillaTime") } else { println(s"push down is better, threshold: ${filter.size}, $pushDownTime, $vanillaTime") } } } } } } } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21603 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r198124578 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) + case sources.In(name, values) if canMakeFilterOn(name) && values.length < 20 => --- End diff -- It mainly depends on how many row groups can skip. for small table (assuming only one row group). There is no obvious difference. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21641: [SPARK-24658][SQL] Remove workaround for ANTLR bu...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21641 [SPARK-24658][SQL] Remove workaround for ANTLR bug ## What changes were proposed in this pull request? Issue antlr/antlr4#781 has already been fixed, so the workaround of extracting the pattern into a separate rule is no longer needed. The presto already removed it: https://github.com/prestodb/presto/pull/10744. ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark ANTLR-780 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21641.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21641 commit e9a1b2cadbbd8dc34671c2051ee6ddfd5f637709 Author: Yuming Wang Date: 2018-06-26T05:18:17Z Remove workaround for ANTLR bug --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21623#discussion_r197992151 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -660,6 +660,30 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex assert(df.where("col > 0").count() === 2) } } + + test("filter pushdown - StringStartsWith") { +withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { implicit df => + Seq("2", "2s", "2st", "2str", "2str2").foreach { prefix => +checkFilterPredicate( + '_1.startsWith(prefix).asInstanceOf[Predicate], + classOf[UserDefinedByInstance[_, _]], + "2str2") + } + + Seq("2S", "null", "2str22").foreach { prefix => +checkFilterPredicate( + '_1.startsWith(prefix).asInstanceOf[Predicate], + classOf[UserDefinedByInstance[_, _]], + Seq.empty[Row]) + } + + assertResult(None) { +parquetFilters.createFilter( + df.schema, + sources.StringStartsWith("_1", null)) --- End diff -- Thanks @attilapiros , `sources.StringStartsWith("_1", null)` will not matches them, same as before. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21623: [SPARK-24638][SQL] StringStartsWith support push down
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21623 cc @gszadovszky @nandorKollar --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21623 [SPARK-24638][SQL] StringStartsWith support push down ## What changes were proposed in this pull request? `StringStartsWith` support push down. About 50% savings in compute time. ## How was this patch tested? unit tests and manual tests. Performance test: ```scala cat < SPARK-24638.scala spark.range(1000).selectExpr("concat(id, 'str', id) as id").coalesce(1).write.option("parquet.block.size", 1048576).parquet("/tmp/spark/parquet/string") val df = spark.read.parquet("/tmp/spark/parquet/string/") spark.sql("set spark.sql.parquet.filterPushdown=true") val pushdownEnableStart = System.currentTimeMillis() for(i <- 0 until 100) { df.where("id like '98%'").count() } val pushdownEnable = System.currentTimeMillis() - pushdownEnableStart spark.sql("set spark.sql.parquet.filterPushdown=false") val pushdownDisableStart = System.currentTimeMillis() for(i <- 0 until 100) { df.where("id like '98%'").count() } val pushdownDisable = System.currentTimeMillis() - pushdownDisableStart val improvements = pushdownDisable.toDouble - pushdownEnable.toDouble println(s"improvements: ${improvements}") EOF bin/spark-shell -i SPARK-24638.scala ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24638 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21623.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21623 commit 5b52ace44c8a41631535c883b7a5c8545959e5e5 Author: Yuming Wang Date: 2018-06-23T13:27:30Z StringStartsWith support push down --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r197603396 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,17 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD = +buildConf("spark.sql.parquet.pushdown.inFilterThreshold") + .doc("The maximum number of values to filter push-down optimization for IN predicate. " + +"Large threshold won't necessarily provide much better performance. " + +"The experiment argued that 300 is the limit threshold. " + --- End diff -- You are right. Type | limit threshold -- | -- string | 370 int | 210 long | 285 double | 270 float | 220 decimal | Will not provide better performance --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r197338867 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) + case sources.In(name, values) if canMakeFilterOn(name) && values.length < 20 => --- End diff -- It seems that the push-down performance is better when threshold is less than `300`: https://user-images.githubusercontent.com/5399861/41757743-7e411532-7616-11e8-8844-45132c50c535.png;> The code: ```scala withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { import testImplicits._ withTempPath { path => val total = 1000 (0 to total).toDF().coalesce(1) .write.option("parquet.block.size", 512) .parquet(path.getAbsolutePath) val df = spark.read.parquet(path.getAbsolutePath) // scalastyle:off println var lastSize = -1 var i = 16000 while (i < total) { val filter = Range(0, total).filter(_ % i == 0) i += 100 if (lastSize != filter.size) { if (lastSize == -1) println(s"start size: ${filter.size}") lastSize = filter.size sql("set spark.sql.parquet.pushdown.inFilterThreshold=100") val begin1 = System.currentTimeMillis() df.where(s"id in(${filter.mkString(",")})").count() val end1 = System.currentTimeMillis() val time1 = end1 - begin1 sql("set spark.sql.parquet.pushdown.inFilterThreshold=10") val begin2 = System.currentTimeMillis() df.where(s"id in(${filter.mkString(",")})").count() val end2 = System.currentTimeMillis() val time2 = end2 - begin2 if (time1 <= time2) println(s"Max threshold: $lastSize") } } } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21603: [SPARK-17091][SQL] Add rule to convert IN predicate to e...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21603 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21603#discussion_r197011649 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) + case sources.In(name, values) if canMakeFilterOn(name) && values.length < 20 => --- End diff -- The threshold is **20**. Too many `values` may be OOM, for example: ```scala spark.range(1000).coalesce(1).write.option("parquet.block.size", 1048576).parquet("/tmp/spark/parquet/SPARK-17091") val df = spark.read.parquet("/tmp/spark/parquet/SPARK-17091/") df.where(s"id in(${Range(1, 1).mkString(",")})").count ``` ``` Exception in thread "SIGINT handler" 18/06/21 13:00:54 WARN TaskSetManager: Lost task 7.0 in stage 1.0 (TID 8, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:3664) at java.lang.String.(String.java:207) at java.lang.StringBuilder.toString(StringBuilder.java:407) at org.apache.parquet.filter2.predicate.Operators$BinaryLogicalFilterPredicate.(Operators.java:263) at org.apache.parquet.filter2.predicate.Operators$Or.(Operators.java:316) at org.apache.parquet.filter2.predicate.FilterApi.or(FilterApi.java:261) at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276) at org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276) ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21603: [SPARK-17091][SQL] Add rule to convert IN predica...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21603 [SPARK-17091][SQL] Add rule to convert IN predicate to equivalent Parquet filter ## What changes were proposed in this pull request? Add a new optimizer rule to convert an IN predicate to an equivalent Parquet filter. The original pr is: https://github.com/apache/spark/pull/18424 ## How was this patch tested? unit tests and manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-17091 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21603.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21603 commit 264eed81e33d3af7d7ea50a3a49866dde18f163b Author: Yuming Wang Date: 2018-06-21T04:35:20Z Convert IN predicate to Parquet filter push-down commit 4f96881af4af6f613c049f3756ee3aba518ceab8 Author: Yuming Wang Date: 2018-06-21T04:49:12Z Change threshold to 20. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21556 cc @gatorsmile @rdblue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18424 @ptkool Are you still working on? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21556 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21556 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21556 Another performance test: https://user-images.githubusercontent.com/5399861/41448622-437d029a-708e-11e8-9c18-5d9f17cd1edf.png;> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21547 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21547 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21547: [SPARK-24538][SQL] ByteArrayDecimalType support p...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21547#discussion_r195289284 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -37,6 +39,23 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { DateTimeUtils.fromJavaDate(date) } + private def decimalToBinary(precision: Int, decimal: JBigDecimal): Binary = { --- End diff -- REF: https://github.com/apache/spark/blob/21a7bfd5c324e6c82152229f1394f26afeae771c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L247-L266 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDeci...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21556#discussion_r195283330 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -62,6 +62,16 @@ private[parquet] class ParquetFilters(pushDownDate: Boolean) { (n: String, v: Any) => FilterApi.eq( intColumn(n), Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull) +case decimal: DecimalType if DecimalType.is32BitDecimalType(decimal) => + (n: String, v: Any) => FilterApi.eq( +intColumn(n), + Option(v).map(_.asInstanceOf[java.math.BigDecimal].unscaledValue().intValue() --- End diff -- REF: https://github.com/apache/spark/blob/21a7bfd5c324e6c82152229f1394f26afeae771c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L219 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21556: [SPARK-24549][SQL] 32BitDecimalType and 64BitDeci...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21556 [SPARK-24549][SQL] 32BitDecimalType and 64BitDecimalType support push down ## What changes were proposed in this pull request? [32BitDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L208) and [64BitDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L219) support push down to the data sources. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24549 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21556.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21556 commit 9832661e735fbbbfc4da4cc96f4ff7a537c3eca2 Author: Yuming Wang Date: 2018-06-13T12:28:33Z 32BitDecimalType and 64BitDecimalType support push down to the data sources --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21547: [SPARK-24538][SQL] ByteArrayDecimalType support push dow...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21547 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21547: [SPARK-24538][SQL] ByteArrayDecimalType support p...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21547 [SPARK-24538][SQL] ByteArrayDecimalType support push down to the data sources ## What changes were proposed in this pull request? [ByteArrayDecimalType](https://github.com/apache/spark/blob/e28eb431146bcdcaf02a6f6c406ca30920592a6a/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L230) support push down to the data sources. ## How was this patch tested? unit tests and manual tests. **manual tests**: ```scala spark.range(1000).selectExpr("id", "cast(id as decimal(9)) as d1", "cast(id as decimal(9, 2)) as d2", "cast(id as decimal(18)) as d3", "cast(id as decimal(18, 4)) as d4", "cast(id as decimal(38)) as d5", "cast(id as decimal(38, 18)) as d6").coalesce(1).write.option("parquet.block.size", 1048576).parquet("/tmp/spark/parquet/decimal") val df = spark.read.parquet("/tmp/spark/parquet/decimal/") // Only read about 1 MB data df.filter("d6 = 1").show // Read 174.3 MB data df.filter("d3 = 1").show ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24538 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21547.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21547 commit 96066701ec75d3caa27994c47eab8ff64150b6a5 Author: Yuming Wang Date: 2018-06-13T01:35:55Z ByteArrayDecimalType support push down to the data sources --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21479: [SPARK-23903][SQL] Add support for date extract
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21479 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21479: [SPARK-23903][SQL] Add support for date extract
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21479 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21479: [SPARK-23903][SQL] Add support for date extract
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21479 [SPARK-23903][SQL] Add support for date extract ## What changes were proposed in this pull request? Add support for date `extract`, supported field same as [Hive](https://github.com/apache/hive/blob/rel/release-2.3.3/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g#L308-L316): `YEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `HOUR`, `MINUTE`, `SECOND`. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23903 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21479.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21479 commit a1a4db3774e7e0911e710ed1a99694add29df545 Author: Yuming Wang Date: 2018-06-01T16:06:55Z Add support for date extract --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21010 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21010 > Basically LGTM, but I'm wondering what if the expr2 is not like a format string? The same as Hive: ```sql spark-sql> SELECT format_number(12332.123456, 'abc'); abc12332 ``` ```sql hive> SELECT format_number(12332.123456, 'abc'); OK abc12332 Time taken: 0.218 seconds, Fetched: 1 row(s) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21460: [SPARK-23442][SQL] Improvement reading from parti...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21460 [SPARK-23442][SQL] Improvement reading from partitioned and bucketed table. ## What changes were proposed in this pull request? For a partitioned and bucketed table. With the increasing number of partitions, the amount of data is getting larger and larger. Reading this table always uses the `bucket number` of tasks. This PR changes the logic to `bucket number` * `partition number` when reading partitioned and bucketed table. ## How was this patch tested? manual tests. ```scala spark.range(1).selectExpr( "id as key", "id % 5 as t1", "id % 10 as p").repartition(5, col("p")).write.partitionBy("p").bucketBy(5, "key").sortBy("t1").saveAsTable("spark_23442") ``` ```scala // All partition: partition number = 5 * 10 = 50 spark.sql("select count(distinct t1) from spark_23442 ").show ``` ```scala // Filtered 1/2 partition: partition number = 5 * (10 / 2) = 25 spark.sql("select count(distinct t1) from spark_23442 where p >= 5 ").show ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23442 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21460.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21460 commit 58e4e098016051f41103464040ba24bbee28b2cf Author: Yuming Wang Date: 2018-05-30T06:53:52Z Improvement reading from partitioned and bucketed table. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21010 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21010 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21431: [SPARK-19112][CORE][FOLLOW-UP] Add missing shortCompress...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21431 Yes. I have tested with `âconf spark.io.compression.codec=zstd`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21431: [SPARK-19112][CORE][FOLLOW-UP] Add missing shortC...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21431 [SPARK-19112][CORE][FOLLOW-UP] Add missing shortCompressionCodecNames to configuration. ## What changes were proposed in this pull request? Spark provides three codecs: `lz4`, `lzf`, `snappy`, and `zstd`. This pr add missing shortCompressionCodecNames to configuration. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-19112 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21431.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21431 commit 877d27055698a872347a04977ae050fef0ba42e7 Author: Yuming Wang <yumwang@...> Date: 2018-05-25T13:41:46Z Add missing shortCompressionCodecNames to configuration. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21423: [SPARK-24378][SQL] Fix date_trunc function incorr...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21423 [SPARK-24378][SQL] Fix date_trunc function incorrect examples ## What changes were proposed in this pull request? Fix `date_trunc` function incorrect examples. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24378 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21423.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21423 commit b8b0c9dd21bbb4a5d29174d778165a2bd72403e5 Author: Yuming Wang <yumwang@...> Date: 2018-05-24T11:46:28Z Fix incorrect examples --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21404: [SPARK-24360][SQL] Support Hive 3.0 metastore
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21404 Can we remove the old hive support? such as 0.12, 0.13 and 0.14. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20274: [SPARK-20120][SQL][FOLLOW-UP] Better way to support spar...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20274 @srowen I have updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21375: [HOT-FIX][SQL] Fix: SQLConf.scala:1757: not found...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/21375 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21375: [HOT-FIX][SQL] Fix: SQLConf.scala:1757: not found...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21375 [HOT-FIX][SQL] Fix: SQLConf.scala:1757: not found: value Utils ## What changes were proposed in this pull request? Fix: `SQLConf.scala:1757: not found: value Utils` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark hot-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21375.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21375 commit 5375965ca1de10e97ca6f2b0b984dbe0e9306c68 Author: Yuming Wang <yumwang@...> Date: 2018-05-20T00:30:58Z Fix: SQLConf.scala:1757: not found: value Utils --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21343: [SPARK-24292][SQL] Proxy user cannot connect to HiveMeta...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21343 This problem seems to have been fixed, can you try [v2.3.1-rc1](https://github.com/apache/spark/releases/tag/v2.3.1-rc1)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18853 **Spark vs Teradata**: https://user-images.githubusercontent.com/5399861/40102134-43a138e2-591c-11e8-8bf1-00fb9b72e026.png;> https://user-images.githubusercontent.com/5399861/40102133-436658a8-591c-11e8-950c-9ec95a4c7ed0.png;> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21328: Ci
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/21328 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21328: Ci
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21328 Ci ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/spark-mler/spark ci Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21328.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21328 commit 082c78494d2fbc904379d32cec4408b24620b25d Author: Yuming Wang <wgyumg@...> Date: 2018-05-15T04:28:48Z Remove -T 4 -p commit 4c0945143f344d6a3fe2288416b5721f8cb75edb Author: Yuming Wang <wgyumg@...> Date: 2018-05-15T05:58:34Z travis_wait 60 commit d645136386c5fd245cb7a358ac06095e3e9bf989 Author: Yuming Wang <wgyumg@...> Date: 2018-05-15T06:28:37Z Update .travis.yml --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21010 retest please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21189: [SPARK-24117][SQL] Unified the getSizePerRow
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21189 [SPARK-24117][SQL] Unified the getSizePerRow ## What changes were proposed in this pull request? This pr unified the `getSizePerRow` because `getSizePerRow` is used in many places. For example: 1. [LocalRelation.scala#L80](https://github.com/wangyum/spark/blob/f70f46d1e5bc503e9071707d837df618b7696d32/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L80) 2. [SizeInBytesOnlyStatsPlanVisitor.scala#L36](https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L36) ## How was this patch tested? Exist tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24117 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21189.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21189 commit cd415381386f0ac5c29cd6dab57ceafc86e96adf Author: Yuming Wang <yumwang@...> Date: 2018-04-28T11:10:33Z Unified the getSizePerRow --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21170: [SPARK-22732][SS][FOLLOW-UP] Fix memoryV2.scala toString...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21170 cc @zsxwing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21170: [SPARK-22732][SS][FOLLOW-UP] Fix memoryV2.scala t...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21170 [SPARK-22732][SS][FOLLOW-UP] Fix memoryV2.scala toString error ## What changes were proposed in this pull request? Fix `memoryV2.scala` toString error ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-22732 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21170.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21170 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/20659 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21077: [SPARK-21033][CORE][FOLLOW-UP] Update Spillable
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21077 [SPARK-21033][CORE][FOLLOW-UP] Update Spillable ## What changes were proposed in this pull request? Update ```scala SparkEnv.get.conf.getLong("spark.shuffle.spill.numElementsForceSpillThreshold", Long.MaxValue) ``` to ```scala SparkEnv.get.conf.get(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD) ``` because of `SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD`'s default value is `Integer.MAX_VALUE`: https://github.com/apache/spark/blob/c99fc9ad9b600095baba003053dbf84304ca392b/core/src/main/scala/org/apache/spark/internal/config/package.scala#L503-L511 ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-21033 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21077.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21077 commit a7de7c6d29508cfdcc9c7cb66fa1648c6b4b Author: Yuming Wang <yumwang@...> Date: 2018-04-16T08:12:53Z SparkEnv.get.conf.getLong("spark.shuffle.spill.numElementsForceSpillThreshold", Long.MaxValue) -> SparkEnv.get.conf.get(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21010: [SPARK-23900][SQL] format_number support user spe...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21010 [SPARK-23900][SQL] format_number support user specifed format as argument ## What changes were proposed in this pull request? `format_number` support user specifed format as argument. For example: ```sql spark-sql> SELECT format_number(12332.123456, '##.###'); 12332.123 ``` ## How was this patch tested? unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23900 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21010.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21010 commit 202fa3d505ed4395858a277c9b871006b2f64483 Author: Yuming Wang <yumwang@...> Date: 2018-04-09T14:28:44Z format_number udf should take user specifed format as argument --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20944 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20944 cc @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20944: [SPARK-23831][SQL] Add org.apache.derby to Isolat...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20944#discussion_r179158019 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -188,6 +188,9 @@ private[hive] class IsolatedClientLoader( (name.startsWith("com.google") && !name.startsWith("com.google.cloud")) || name.startsWith("java.lang.") || name.startsWith("java.net") || +name.startsWith("com.sun.") || +name.startsWith("sun.reflect.") || --- End diff -- Yes, it doesn't matter if add these two lines, but I think it's best to add. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] Add new type coercion to compatible w...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18853 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20944: [SPARK-23831][SQL] Add org.apache.derby to Isolat...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/20944 [SPARK-23831][SQL] Add org.apache.derby to IsolatedClientLoader ## What changes were proposed in this pull request? Add `org.apache.derby` to `IsolatedClientLoader`, otherwise it may throw an exception: ``` [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see the next exception for details. [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source) [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source) [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source) ``` How to reproduce: ```bash sed 's/HiveExternalCatalogSuite/HiveExternalCatalog2Suite/g' sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala > sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalog2Suite.scala build/sbt -Phive "hive/test-only *.HiveExternalCatalogSuite *.HiveExternalCatalog2Suite" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23831 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20944.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20944 commit 7d5cc71e4753f26fed4563a0eef27aa9de173d57 Author: Yuming Wang <yumwang@...> Date: 2018-03-30T10:41:42Z Add org.apache.derby to IsolatedClientLoader --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20785: [SPARK-23640][CORE] Fix hadoop config may override spark...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20785 Ping @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.u...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/20898 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.uris bef...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20898 It looks like --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20866#discussion_r176906868 --- Diff: core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala --- @@ -92,8 +93,8 @@ private[security] class HiveDelegationTokenProvider s"$principal at $metastoreUri") doAsRealUser { -val hive = Hive.get(conf, classOf[HiveConf]) --- End diff -- 1. This [`Hive.get()`](https://github.com/apache/spark/blob/v2.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L239) is different from others, It loaded by `IsolatedClientLoader`. 2. I can not start a `HiveThriftServer2` in a kerberized cluster, so I'm not sure `CLIService.java` should be updated, How about update it later? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20866#discussion_r176906522 --- Diff: core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala --- @@ -92,8 +93,8 @@ private[security] class HiveDelegationTokenProvider s"$principal at $metastoreUri") doAsRealUser { -val hive = Hive.get(conf, classOf[HiveConf]) -val tokenStr = hive.getDelegationToken(currentUser.getUserName(), principal) +metaStoreClient = new HiveMetaStoreClient(conf.asInstanceOf[HiveConf]) +val tokenStr = metaStoreClient.getDelegationToken(currentUser.getUserName, principal) --- End diff -- Yes, both HMS 1.x and 2.x --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20866#discussion_r176906474 --- Diff: core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala --- @@ -92,8 +94,9 @@ private[security] class HiveDelegationTokenProvider s"$principal at $metastoreUri") doAsRealUser { -val hive = Hive.get(conf, classOf[HiveConf]) -val tokenStr = hive.getDelegationToken(currentUser.getUserName(), principal) +metastoreClient = RetryingMetaStoreClient.getProxy(conf.asInstanceOf[HiveConf], null, --- End diff -- HiveMetaStoreClient -> RetryingMetaStoreClient. In fact, `Hive.get` also uses `RetryingMetaStoreClient`: ``` at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.uris bef...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20898 Yes, it's proxy user: ``` export HADOOP_PROXY_USER=user spark-sql --master yarn ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.uris bef...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20898 cc @yaooqinn @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20898: [SPARK-23789][SQL] Shouldn't set hive.metastore.u...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/20898 [SPARK-23789][SQL] Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider ## What changes were proposed in this pull request? `spark-sql` can't connect to metastore with a security Hadoop cluster after [SPARK-21428](https://issues.apache.org/jira/browse/SPARK-21428). `hive.metastore.uris` was `HiveConf.ConfVars.METASTOREURIS.defaultStrVal` here before SPARK-21428. This pr revert `hive.metastore.uris` to `HiveConf.ConfVars.METASTOREURIS.defaultStrVal`. ## How was this patch tested? manual tests with a security Hadoop cluster You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23789 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20898.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20898 commit a382aab4f3a9cabda10ab2aedbbb8d663737348f Author: Yuming Wang <yumwang@...> Date: 2018-03-24T07:19:25Z Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20866#discussion_r176902474 --- Diff: core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala --- @@ -92,8 +93,8 @@ private[security] class HiveDelegationTokenProvider s"$principal at $metastoreUri") doAsRealUser { -val hive = Hive.get(conf, classOf[HiveConf]) --- End diff -- Thanks @dongjoon-hyun, seems `RetryingMetaStoreClient` is a better choice and I will try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20867: Spark 23759
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20867 Please update the title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20866#discussion_r175859248 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -908,11 +912,39 @@ private[hive] object HiveClientImpl { Utils.classForName(name) .asInstanceOf[Class[_ <: org.apache.hadoop.hive.ql.io.HiveOutputFormat[_, _]]] + private def toHiveMetaApiTable(table: CatalogTable): HiveMetaApiTable = { --- End diff -- Copy from Hive: https://github.com/apache/hive/blob/rel/release-2.3.2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L149 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20866#discussion_r175858111 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -908,11 +912,39 @@ private[hive] object HiveClientImpl { Utils.classForName(name) .asInstanceOf[Class[_ <: org.apache.hadoop.hive.ql.io.HiveOutputFormat[_, _]]] + private def toHiveMetaApiTable(table: CatalogTable): HiveMetaApiTable = { +val sd = new StorageDescriptor +sd.setSerdeInfo(new SerDeInfo) +sd.setNumBuckets(-1) +sd.setBucketCols(new JArrayList[String]) +sd.setCols(new JArrayList[FieldSchema]) +sd.setParameters(new JHashMap[String, String]) +sd.setSortCols(new JArrayList[Order]) +sd.getSerdeInfo.setParameters(new JHashMap[String, String]) +sd.getSerdeInfo.getParameters.put(serdeConstants.SERIALIZATION_FORMAT, "1") +sd.setInputFormat(classOf[SequenceFileInputFormat[_, _]].getName) +sd.setOutputFormat(classOf[HiveSequenceFileOutputFormat[_, _]].getName) +val skewInfo: SkewedInfo = new SkewedInfo +skewInfo.setSkewedColNames(new JArrayList[String]) +skewInfo.setSkewedColValues(new JArrayList[JList[String]]) +skewInfo.setSkewedColValueLocationMaps(new JHashMap[JList[String], String]) +sd.setSkewedInfo(skewInfo) + +val apiTable = new HiveMetaApiTable() +apiTable.setSd(sd) +apiTable.setPartitionKeys(new JArrayList[FieldSchema]) +apiTable.setParameters(new JHashMap[String, String]) +apiTable.setTableType(HiveTableType.MANAGED_TABLE.toString) +apiTable.setDbName(table.database) +apiTable.setTableName(table.identifier.table) +apiTable + } + /** * Converts the native table metadata representation format CatalogTable to Hive's Table. */ def toHiveTable(table: CatalogTable, userName: Option[String] = None): HiveTable = { -val hiveTable = new HiveTable(table.database, table.identifier.table) +val hiveTable = new HiveTable(toHiveMetaApiTable(table)) --- End diff -- Avoid [`t.setOwner(SessionState.getUserFromAuthenticator())`](https://github.com/apache/hive/blob/rel/release-2.3.2/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L180), because it will connect to [Metastore](https://github.com/apache/hive/blob/rel/release-2.3.2/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L913), and we will set owner later: https://github.com/apache/spark/blob/v2.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L914 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20866: [SPARK-23749][SQL] Avoid Hive.get() to compatible...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/20866 [SPARK-23749][SQL] Avoid Hive.get() to compatible with different Hive metastore ## What changes were proposed in this pull request? Avoid `Hive.get()` to compatible with different Hive metastore. ## How was this patch tested? Exist unit tests and manual tests with a security Hadoop cluster You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23749 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20866.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20866 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20659: [DO-NOT-MERGE] Try to update Hive to 2.3.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20659 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org