[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r356247291 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala ## @@ -207,7 +208,16 @@ class ParquetWriteSupport extends WriteSupport[InternalRow] with Logging { case t: UserDefinedType[_] => makeWriter(t.sqlType) - // TODO Adds IntervalType support + case CalendarIntervalType => +(row: SpecializedGetters, ordinal: Int) => + val interval = row.getInterval(ordinal) + val buf = ByteBuffer.wrap(reusableBuffer) + buf.order(ByteOrder.LITTLE_ENDIAN) +.putInt((interval.milliseconds()).toInt) Review comment: fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r346700958 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -498,10 +498,8 @@ case class DataSource( outputColumnNames: Seq[String], physicalPlan: SparkPlan): BaseRelation = { val outputColumns = DataWritingCommand.logicalPlanOutputWithNames(data, outputColumnNames) -if (outputColumns.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) { Review comment: How are Python and R involved into read/write in parquet? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r346700532 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala ## @@ -207,7 +208,16 @@ class ParquetWriteSupport extends WriteSupport[InternalRow] with Logging { case t: UserDefinedType[_] => makeWriter(t.sqlType) - // TODO Adds IntervalType support + case CalendarIntervalType => +(row: SpecializedGetters, ordinal: Int) => + val interval = row.getInterval(ordinal) + val buf = ByteBuffer.wrap(reusableBuffer) + buf.order(ByteOrder.LITTLE_ENDIAN) +.putInt((interval.milliseconds()).toInt) Review comment: Spark will read them back as negative values: https://github.com/apache/spark/pull/26102/files#diff-35a70bb270f17ea3a1d964c4bec0e0a2R912 . I don't know about other systems. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r344307502 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala ## @@ -498,10 +498,8 @@ case class DataSource( outputColumnNames: Seq[String], physicalPlan: SparkPlan): BaseRelation = { val outputColumns = DataWritingCommand.logicalPlanOutputWithNames(data, outputColumnNames) -if (outputColumns.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) { Review comment: Just wondering what's the relation between this PR and opening `CalendarIntervalType`? An `INTERVAL` column could appear as the result of subtraction of 2 datetime columns, and an user may want to store it into fs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r335287678 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ## @@ -325,6 +325,26 @@ private[parquet] class ParquetRowConverter( override def set(value: Any): Unit = updater.set(value.asInstanceOf[InternalRow].copy()) }) + case CalendarIntervalType +if parquetType.asPrimitiveType().getPrimitiveTypeName == FIXED_LEN_BYTE_ARRAY => +new ParquetPrimitiveConverter(updater) { + override def addBinary(value: Binary): Unit = { +assert( + value.length() == 12, + "Intervals are expected to be stored in 12-byte fixed len byte array, " + +s"but got a ${value.length()}-byte array.") + +val buf = value.toByteBuffer.order(ByteOrder.LITTLE_ENDIAN) +val milliseconds = buf.getInt +var microseconds = milliseconds * DateTimeUtils.MICROS_PER_MILLIS +val days = buf.getInt +val daysInUs = Math.multiplyExact(days, DateTimeUtils.MICROS_PER_DAY) Review comment: Don't want to defend another side :-) but the consequence of storing days separately means that hours are unbounded. In this way, `interval 1 day 25 hours` and `interval 2 days 1 hours` are represented differently in parquet - (0, 1, 9000) and (0, 2, 360). As @cloud-fan wrote above, this can lead to different result while adding those intervals to 2 November 2019: `2019-11-02` + `interval 1 day 25 hours` = `2019-11-04 00:00:00` but `2019-11-02` + `interval 2 days 1 hour` = `2019-11-04 01:00:00`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r335184353 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ## @@ -325,6 +325,26 @@ private[parquet] class ParquetRowConverter( override def set(value: Any): Unit = updater.set(value.asInstanceOf[InternalRow].copy()) }) + case CalendarIntervalType +if parquetType.asPrimitiveType().getPrimitiveTypeName == FIXED_LEN_BYTE_ARRAY => +new ParquetPrimitiveConverter(updater) { + override def addBinary(value: Binary): Unit = { +assert( + value.length() == 12, + "Intervals are expected to be stored in 12-byte fixed len byte array, " + +s"but got a ${value.length()}-byte array.") + +val buf = value.toByteBuffer.order(ByteOrder.LITTLE_ENDIAN) +val milliseconds = buf.getInt +var microseconds = milliseconds * DateTimeUtils.MICROS_PER_MILLIS +val days = buf.getInt +val daysInUs = Math.multiplyExact(days, DateTimeUtils.MICROS_PER_DAY) Review comment: 1. According to the SQL standard, `hours` must be in the range of 0-23 2. We already loose the information while converting an interval string to a `CalendarInterval` value: ```sql spark-sql> select interval 1 day 25 hours; interval 2 days 1 hours ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r335142560 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala ## @@ -325,6 +325,26 @@ private[parquet] class ParquetRowConverter( override def set(value: Any): Unit = updater.set(value.asInstanceOf[InternalRow].copy()) }) + case CalendarIntervalType +if parquetType.asPrimitiveType().getPrimitiveTypeName == FIXED_LEN_BYTE_ARRAY => +new ParquetPrimitiveConverter(updater) { + override def addBinary(value: Binary): Unit = { +assert( + value.length() == 12, + "Intervals are expected to be stored in 12-byte fixed len byte array, " + +s"but got a ${value.length()}-byte array.") + +val buf = value.toByteBuffer.order(ByteOrder.LITTLE_ENDIAN) +val milliseconds = buf.getInt +var microseconds = milliseconds * DateTimeUtils.MICROS_PER_MILLIS +val days = buf.getInt +val daysInUs = Math.multiplyExact(days, DateTimeUtils.MICROS_PER_DAY) Review comment: This can be fixed only if we change structure of `CalendarInterval` but such modifications are almost orthogonal to this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r334298187 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ## @@ -371,6 +371,8 @@ class ParquetFileFormat case udt: UserDefinedType[_] => supportDataType(udt.sqlType) +case _: CalendarIntervalType => true Review comment: I see. I will combine them. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r334298172 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala ## @@ -73,6 +73,9 @@ class ParquetWriteSupport extends WriteSupport[InternalRow] with Logging { // Reusable byte array used to write timestamps as Parquet INT96 values private val timestampBuffer = new Array[Byte](12) + // Reusable byte array used to write intervals as Parquet FIXED_LEN_BYTE_ARRAY values + private val intervalBuffer = new Array[Byte](12) Review comment: I will combine the buffers. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource URL: https://github.com/apache/spark/pull/26102#discussion_r334296652 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ## @@ -371,6 +371,8 @@ class ParquetFileFormat case udt: UserDefinedType[_] => supportDataType(udt.sqlType) +case _: CalendarIntervalType => true Review comment: Unfortunately, I cannot extend `AtomicType` because it supposes ordering. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org