[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-12-10 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r356247291
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
 ##
 @@ -207,7 +208,16 @@ class ParquetWriteSupport extends 
WriteSupport[InternalRow] with Logging {
 
   case t: UserDefinedType[_] => makeWriter(t.sqlType)
 
-  // TODO Adds IntervalType support
+  case CalendarIntervalType =>
+(row: SpecializedGetters, ordinal: Int) =>
+  val interval = row.getInterval(ordinal)
+  val buf = ByteBuffer.wrap(reusableBuffer)
+  buf.order(ByteOrder.LITTLE_ENDIAN)
+.putInt((interval.milliseconds()).toInt)
 
 Review comment:
   fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-11-15 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r346700958
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ##
 @@ -498,10 +498,8 @@ case class DataSource(
   outputColumnNames: Seq[String],
   physicalPlan: SparkPlan): BaseRelation = {
 val outputColumns = DataWritingCommand.logicalPlanOutputWithNames(data, 
outputColumnNames)
-if 
(outputColumns.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
 
 Review comment:
   How are Python and R involved into read/write in parquet? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-11-15 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r346700532
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
 ##
 @@ -207,7 +208,16 @@ class ParquetWriteSupport extends 
WriteSupport[InternalRow] with Logging {
 
   case t: UserDefinedType[_] => makeWriter(t.sqlType)
 
-  // TODO Adds IntervalType support
+  case CalendarIntervalType =>
+(row: SpecializedGetters, ordinal: Int) =>
+  val interval = row.getInterval(ordinal)
+  val buf = ByteBuffer.wrap(reusableBuffer)
+  buf.order(ByteOrder.LITTLE_ENDIAN)
+.putInt((interval.milliseconds()).toInt)
 
 Review comment:
   Spark will read them back as negative values: 
https://github.com/apache/spark/pull/26102/files#diff-35a70bb270f17ea3a1d964c4bec0e0a2R912
 . I don't know about other systems.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-11-08 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r344307502
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ##
 @@ -498,10 +498,8 @@ case class DataSource(
   outputColumnNames: Seq[String],
   physicalPlan: SparkPlan): BaseRelation = {
 val outputColumns = DataWritingCommand.logicalPlanOutputWithNames(data, 
outputColumnNames)
-if 
(outputColumns.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
 
 Review comment:
   Just wondering what's the relation between this PR and opening 
`CalendarIntervalType`? An `INTERVAL` column could appear as the result of 
subtraction of 2 datetime columns, and an user may want to store it into fs.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-10-15 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r335287678
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ##
 @@ -325,6 +325,26 @@ private[parquet] class ParquetRowConverter(
 override def set(value: Any): Unit = 
updater.set(value.asInstanceOf[InternalRow].copy())
   })
 
+  case CalendarIntervalType
+if parquetType.asPrimitiveType().getPrimitiveTypeName == 
FIXED_LEN_BYTE_ARRAY =>
+new ParquetPrimitiveConverter(updater) {
+  override def addBinary(value: Binary): Unit = {
+assert(
+  value.length() == 12,
+  "Intervals are expected to be stored in 12-byte fixed len byte 
array, " +
+s"but got a ${value.length()}-byte array.")
+
+val buf = value.toByteBuffer.order(ByteOrder.LITTLE_ENDIAN)
+val milliseconds = buf.getInt
+var microseconds = milliseconds * DateTimeUtils.MICROS_PER_MILLIS
+val days = buf.getInt
+val daysInUs = Math.multiplyExact(days, 
DateTimeUtils.MICROS_PER_DAY)
 
 Review comment:
   Don't want to defend another side :-) but the consequence of storing days 
separately means that hours are unbounded. In this way, `interval 1 day 25 
hours` and `interval 2 days 1 hours` are represented differently in parquet - 
(0, 1, 9000) and (0, 2, 360). As @cloud-fan wrote above, this can lead 
to different result while adding those intervals to 2 November 2019: 
`2019-11-02` + `interval 1 day 25 hours` = `2019-11-04 00:00:00` but 
`2019-11-02` + `interval 2 days 1 hour` = `2019-11-04 01:00:00`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-10-15 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r335184353
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ##
 @@ -325,6 +325,26 @@ private[parquet] class ParquetRowConverter(
 override def set(value: Any): Unit = 
updater.set(value.asInstanceOf[InternalRow].copy())
   })
 
+  case CalendarIntervalType
+if parquetType.asPrimitiveType().getPrimitiveTypeName == 
FIXED_LEN_BYTE_ARRAY =>
+new ParquetPrimitiveConverter(updater) {
+  override def addBinary(value: Binary): Unit = {
+assert(
+  value.length() == 12,
+  "Intervals are expected to be stored in 12-byte fixed len byte 
array, " +
+s"but got a ${value.length()}-byte array.")
+
+val buf = value.toByteBuffer.order(ByteOrder.LITTLE_ENDIAN)
+val milliseconds = buf.getInt
+var microseconds = milliseconds * DateTimeUtils.MICROS_PER_MILLIS
+val days = buf.getInt
+val daysInUs = Math.multiplyExact(days, 
DateTimeUtils.MICROS_PER_DAY)
 
 Review comment:
   1. According to the SQL standard, `hours` must be in the range of 0-23
   2. We already loose the information while converting an interval string to a 
`CalendarInterval` value:
   ```sql
   spark-sql> select interval 1 day 25 hours;
   interval 2 days 1 hours
   ``` 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-10-15 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r335142560
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ##
 @@ -325,6 +325,26 @@ private[parquet] class ParquetRowConverter(
 override def set(value: Any): Unit = 
updater.set(value.asInstanceOf[InternalRow].copy())
   })
 
+  case CalendarIntervalType
+if parquetType.asPrimitiveType().getPrimitiveTypeName == 
FIXED_LEN_BYTE_ARRAY =>
+new ParquetPrimitiveConverter(updater) {
+  override def addBinary(value: Binary): Unit = {
+assert(
+  value.length() == 12,
+  "Intervals are expected to be stored in 12-byte fixed len byte 
array, " +
+s"but got a ${value.length()}-byte array.")
+
+val buf = value.toByteBuffer.order(ByteOrder.LITTLE_ENDIAN)
+val milliseconds = buf.getInt
+var microseconds = milliseconds * DateTimeUtils.MICROS_PER_MILLIS
+val days = buf.getInt
+val daysInUs = Math.multiplyExact(days, 
DateTimeUtils.MICROS_PER_DAY)
 
 Review comment:
   This can be fixed only if we change structure of `CalendarInterval` but such 
modifications are almost orthogonal to this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-10-13 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r334298187
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ##
 @@ -371,6 +371,8 @@ class ParquetFileFormat
 
 case udt: UserDefinedType[_] => supportDataType(udt.sqlType)
 
+case _: CalendarIntervalType => true
 
 Review comment:
   I see. I will combine them.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-10-13 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r334298172
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
 ##
 @@ -73,6 +73,9 @@ class ParquetWriteSupport extends WriteSupport[InternalRow] 
with Logging {
   // Reusable byte array used to write timestamps as Parquet INT96 values
   private val timestampBuffer = new Array[Byte](12)
 
+  // Reusable byte array used to write intervals as Parquet 
FIXED_LEN_BYTE_ARRAY values
+  private val intervalBuffer = new Array[Byte](12)
 
 Review comment:
   I will combine the buffers.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] Support the `INTERVAL` type by Parquet datasource

2019-10-13 Thread GitBox
MaxGekk commented on a change in pull request #26102: [SPARK-29448][SQL] 
Support the `INTERVAL` type by Parquet datasource
URL: https://github.com/apache/spark/pull/26102#discussion_r334296652
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ##
 @@ -371,6 +371,8 @@ class ParquetFileFormat
 
 case udt: UserDefinedType[_] => supportDataType(udt.sqlType)
 
+case _: CalendarIntervalType => true
 
 Review comment:
   Unfortunately, I cannot extend `AtomicType` because it supposes ordering.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org