[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16351 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93814412 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala --- @@ -66,144 +66,138 @@ class CSVTypeCastSuite extends SparkFunSuite { } test("Nullable types are handled") { -assertNull( - CSVTypeCast.castTo("-", "_1", ByteType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", ShortType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", LongType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", FloatType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DoubleType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", BooleanType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DecimalType.DoubleDecimal, true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", TimestampType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DateType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", StringType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) - -// casting a null to not nullable field should throw an exception. -var message = intercept[RuntimeException] { - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) - -message = intercept[RuntimeException] { - CSVTypeCast.castTo("-", "_1", StringType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) - } - - test("String type should also respect `nullValue`") { --- End diff -- Some tests in ``String type should also respect `nullValue` `` were duplicated and other were folded into `Nullable types are handled`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93814350 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala --- @@ -66,144 +66,141 @@ class CSVTypeCastSuite extends SparkFunSuite { } test("Nullable types are handled") { -assertNull( - CSVTypeCast.castTo("-", "_1", ByteType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", ShortType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", LongType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", FloatType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DoubleType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", BooleanType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DecimalType.DoubleDecimal, true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", TimestampType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DateType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", StringType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) - -// casting a null to not nullable field should throw an exception. -var message = intercept[RuntimeException] { - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) - -message = intercept[RuntimeException] { - CSVTypeCast.castTo("-", "_1", StringType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) +val types = Seq(ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, + BooleanType, DecimalType.DoubleDecimal, TimestampType, DateType, StringType) + +types.foreach { t => + val converterOne = +CSVTypeCast.makeConverter("_1", t, nullable = true, CSVOptions("nullValue", "-")) + assertNull(converterOne.apply("-")) + assertNull(converterOne.apply(null)) + + // casting a null to not nullable field should throw an exception. + val converterTwo = +CSVTypeCast.makeConverter("_1", t, nullable = false, CSVOptions("nullValue", "-")) + + var message = intercept[RuntimeException] { +converterTwo.apply("-") + }.getMessage + assert(message.contains("null value found but field _1 is not nullable.")) + + message = intercept[RuntimeException] { +converterTwo.apply(null) + }.getMessage + assert(message.contains("null value found but field _1 is not nullable.")) +} } test("String type should also respect `nullValue`") { --- End diff -- I see, some of tests are overlapped. Let me try to clean up with some comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93791725 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala --- @@ -66,144 +66,141 @@ class CSVTypeCastSuite extends SparkFunSuite { } test("Nullable types are handled") { -assertNull( - CSVTypeCast.castTo("-", "_1", ByteType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", ShortType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", LongType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", FloatType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DoubleType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", BooleanType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DecimalType.DoubleDecimal, true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", TimestampType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DateType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", StringType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) - -// casting a null to not nullable field should throw an exception. -var message = intercept[RuntimeException] { - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) - -message = intercept[RuntimeException] { - CSVTypeCast.castTo("-", "_1", StringType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) +val types = Seq(ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, + BooleanType, DecimalType.DoubleDecimal, TimestampType, DateType, StringType) + +types.foreach { t => + val converterOne = +CSVTypeCast.makeConverter("_1", t, nullable = true, CSVOptions("nullValue", "-")) + assertNull(converterOne.apply("-")) + assertNull(converterOne.apply(null)) + + // casting a null to not nullable field should throw an exception. + val converterTwo = +CSVTypeCast.makeConverter("_1", t, nullable = false, CSVOptions("nullValue", "-")) + + var message = intercept[RuntimeException] { +converterTwo.apply("-") + }.getMessage + assert(message.contains("null value found but field _1 is not nullable.")) + + message = intercept[RuntimeException] { +converterTwo.apply(null) + }.getMessage + assert(message.contains("null value found but field _1 is not nullable.")) +} } test("String type should also respect `nullValue`") { --- End diff -- this test is not needed right? you already included string type in https://github.com/apache/spark/pull/16351/files#diff-b6ea04714c153af78c2acbab6265cda8R70 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93791600 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala --- @@ -66,144 +66,141 @@ class CSVTypeCastSuite extends SparkFunSuite { } test("Nullable types are handled") { -assertNull( - CSVTypeCast.castTo("-", "_1", ByteType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", ShortType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", LongType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", FloatType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DoubleType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", BooleanType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DecimalType.DoubleDecimal, true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", TimestampType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", DateType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo("-", "_1", StringType, nullable = true, CSVOptions("nullValue", "-"))) -assertNull( - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = true, CSVOptions("nullValue", "-"))) - -// casting a null to not nullable field should throw an exception. -var message = intercept[RuntimeException] { - CSVTypeCast.castTo(null, "_1", IntegerType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) - -message = intercept[RuntimeException] { - CSVTypeCast.castTo("-", "_1", StringType, nullable = false, CSVOptions("nullValue", "-")) -}.getMessage -assert(message.contains("null value found but field _1 is not nullable.")) +val types = Seq(ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, + BooleanType, DecimalType.DoubleDecimal, TimestampType, DateType, StringType) + +types.foreach { t => + val converterOne = --- End diff -- isn't it more natural to write `converter1`, `converter2`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93381128 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -215,84 +215,133 @@ private[csv] object CSVInferSchema { } private[csv] object CSVTypeCast { + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any /** - * Casts given string datum to specified type. - * Currently we do not support complex types (ArrayType, MapType, StructType). + * Create converters which cast each given string datum to each specified type in given schema. + * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). * - * For string types, this is simply the datum. For other types. + * For string types, this is simply the datum. + * For other types, this is converted into the value according to the type. * For other nullable types, returns null if it is null or equals to the value specified * in `nullValue` option. * - * @param datum string value - * @param name field name in schema. - * @param castType data type to cast `datum` into. - * @param nullable nullability for the field. + * @param schema schema that contains data types to cast the given value into. * @param options CSV options. */ - def castTo( + def makeConverters( + schema: StructType, + options: CSVOptions = CSVOptions()): Array[ValueConverter] = { +schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray + } + + /** + * Create a converter which converts the string value to a value according to a desired type. + */ + def makeConverter( + name: String, + dataType: DataType, + nullable: Boolean = true, + options: CSVOptions = CSVOptions()): ValueConverter = dataType match { +case _: ByteType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => --- End diff -- nit: nullSafeDatum(d, name, nullable, options)(_.toByte) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93381049 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -215,84 +215,133 @@ private[csv] object CSVInferSchema { } private[csv] object CSVTypeCast { + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any /** - * Casts given string datum to specified type. - * Currently we do not support complex types (ArrayType, MapType, StructType). + * Create converters which cast each given string datum to each specified type in given schema. + * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). * - * For string types, this is simply the datum. For other types. + * For string types, this is simply the datum. + * For other types, this is converted into the value according to the type. * For other nullable types, returns null if it is null or equals to the value specified * in `nullValue` option. * - * @param datum string value - * @param name field name in schema. - * @param castType data type to cast `datum` into. - * @param nullable nullability for the field. + * @param schema schema that contains data types to cast the given value into. * @param options CSV options. */ - def castTo( + def makeConverters( + schema: StructType, + options: CSVOptions = CSVOptions()): Array[ValueConverter] = { +schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray + } + + /** + * Create a converter which converts the string value to a value according to a desired type. + */ + def makeConverter( + name: String, + dataType: DataType, + nullable: Boolean = true, + options: CSVOptions = CSVOptions()): ValueConverter = dataType match { +case _: ByteType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +datum.toByte + } + +case _: ShortType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +datum.toShort + } + +case _: IntegerType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +datum.toInt + } + +case _: LongType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +datum.toLong + } + +case _: FloatType => (d: String) => + nullSafeDatum(d, name, nullable, options) { +case options.nanValue => Float.NaN +case options.negativeInf => Float.NegativeInfinity +case options.positiveInf => Float.PositiveInfinity +case datum => + Try(datum.toFloat) + .getOrElse(NumberFormat.getInstance(Locale.US).parse(datum).floatValue()) + } + +case _: DoubleType => (d: String) => + nullSafeDatum(d, name, nullable, options) { +case options.nanValue => Double.NaN +case options.negativeInf => Double.NegativeInfinity +case options.positiveInf => Double.PositiveInfinity +case datum => + Try(datum.toDouble) + .getOrElse(NumberFormat.getInstance(Locale.US).parse(datum).doubleValue()) + } + +case _: BooleanType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +datum.toBoolean + } + +case dt: DecimalType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +val value = new BigDecimal(datum.replaceAll(",", "")) +Decimal(value, dt.precision, dt.scale) + } + +case _: TimestampType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +// This one will lose microseconds parts. +// See https://issues.apache.org/jira/browse/SPARK-10681. +Try(options.timestampFormat.parse(datum).getTime * 1000L) + .getOrElse { + // If it fails to parse, then tries the way used in 2.0 and 1.x for backwards + // compatibility. + DateTimeUtils.stringToTime(datum).getTime * 1000L +} + } + +case _: DateType => (d: String) => + nullSafeDatum(d, name, nullable, options) { case datum => +// This one will lose microseconds parts. +// See https://issues.apache.org/jira/browse/SPARK-10681.x + Try(DateTimeUtils.millisToDays(options.dateFormat.parse(datum).getTime)) +
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16351#discussion_r93230247 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -215,84 +215,133 @@ private[csv] object CSVInferSchema { } private[csv] object CSVTypeCast { + // A `ValueConverter` is responsible for converting the given value to a desired type. + private type ValueConverter = String => Any /** - * Casts given string datum to specified type. - * Currently we do not support complex types (ArrayType, MapType, StructType). + * Create converters which cast each given string datum to each specified type in given schema. + * Currently, we do not support complex types (`ArrayType`, `MapType`, `StructType`). * - * For string types, this is simply the datum. For other types. + * For string types, this is simply the datum. + * For other types, this is converted into the value according to the type. * For other nullable types, returns null if it is null or equals to the value specified * in `nullValue` option. * - * @param datum string value - * @param name field name in schema. - * @param castType data type to cast `datum` into. - * @param nullable nullability for the field. + * @param schema schema that contains data types to cast the given value into. * @param options CSV options. */ - def castTo( + private[sql] def makeConverters( --- End diff -- Ops, I can remove this access modifier. Will remove it soon and the one below too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16351: [SPARK-18943][SQL] Avoid per-record type dispatch...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/16351 [SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading ## What changes were proposed in this pull request? `CSVRelation.csvParser` does type dispatch for each value in each row. We can prevent this because the schema is already kept in `CSVRelation`. So, this PR proposes that converters are created first according to the schema, and then apply them to each. ## How was this patch tested? Tests in `CSVTypeCastSuite` and `CSVRelation` You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark type-dispatch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16351.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16351 commit e72d1bc419dfd6da7f6e298d5b5412dba69eb5ad Author: hyukjinkwon Date: 2016-12-20T11:54:05Z Avoid per-record type dispatch in CSV when reading --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org