Re: [discuss][SQL] Partitioned column type inference proposal

Wenchen Fan Tue, 14 Nov 2017 08:34:59 -0800

My 2 cents:

1. when merging NullType with another type, the result should always be
that type.
2. when merging StringType with another type, the result should always be
StringType.
3. when merging integral types, the priority from high to low: DecimalType,
LongType, IntegerType. This is because DecimalType is used as big integer
when paring partition column values.
4. DoubleType can't be merged with other types, except DoubleType itself.
5. when merging TimestampType with DateType, return TimestampType.



On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Hi dev,
>
> I would like to post a proposal about partitioned column type inference
> (related with 'spark.sql.sources.partitionColumnTypeInference.enabled'
> configuration).
>
> This thread focuses on the type coercion (finding the common type) in
> partitioned columns, in particular, when the different form of data is
> inserted for the partition column and then it is read back with the type
> inference.
>
>
> *Problem:*
>
>
> val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
> df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
> spark.read.load("/tmp/foo").printSchema()
> val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
> df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
> spark.read.load("/tmp/bar").printSchema()
>
>
>
> It currently returns:
>
>
> root
>  |-- i: integer (nullable = true)
>  |-- ts: date (nullable = true)
>
> root
>  |-- i: integer (nullable = true)
>  |-- decimal: integer (nullable = true)
>
>
> The type coercion looks less well designed yet and currently there are few
> holes which is not quite ideal:
>
>
> private val upCastingOrder: Seq[DataType] =
>   Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
> ...
> literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
>
>
>
> The current way does not deal with when the types are outside of the
> upCastingOrder. It just returns the first type, as the type coerced one.
>
> This has been being discussed in https://github.com/apache/
> spark/pull/19389#discussion_r150426911, but I would like to have more
> feedback from community as it possibly is a breaking change.
>
> For the current releases of Spark (2.2.0 <=), we support the types below
> for partitioned column schema inference, given my investigation -
> https://github.com/apache/spark/pull/19389#discussion_r150528207:
>
>   NullType
>   IntegerType
>   LongType
>   DoubleType,
>   *DecimalType(...)
>   DateType
>   TimestampType
>   StringType
>
>   *DecimalType only when it's bigger than LongType:
>
> I believe this is something we should definitely fix.
>
>
>
> *Proposal:*
>
> I propose the change - https://github.com/apache/spark/pull/19389
>
> Simply, it reuses the case 2 specified in https://github.com/apache/
> spark/blob/6412ea1759d39a2380c572ec24cfd8ae4f2d81f7/sql/catalyst/src/
> main/scala/org/apache/spark/sql/catalyst/analysis/
> TypeCoercion.scala#L40-L43
>
> Please refer the chart I produced here - https://github.com/apache/
> spark/pull/19389/files#r150528361. The current proposal will brings the
> type coercion behaviour change in those cases below:
>
>
> Input typesOld output typeNew output type
> [NullType, DecimalType(38,0)] StringType DecimalType(38,0)
> [NullType, DateType] StringType DateType
> [NullType, TimestampType] StringType TimestampType
> [IntegerType, DecimalType(38,0)] IntegerType DecimalType(38,0)
> [IntegerType, DateType] IntegerType StringType
> [IntegerType, TimestampType] IntegerType StringType
> [LongType, DecimalType(38,0)] LongType DecimalType(38,0)
> [LongType, DateType] LongType StringType
> [LongType, TimestampType] LongType StringType
> [DoubleType, DateType] DoubleType StringType
> [DoubleType, TimestampType] DoubleType StringType
> [DecimalType(38,0), NullType] StringType DecimalType(38,0)
> [DecimalType(38,0), IntegerType] IntegerType DecimalType(38,0)
> [DecimalType(38,0), LongType] LongType DecimalType(38,0)
> [DecimalType(38,0), DateType] DecimalType(38,0) StringType
> [DecimalType(38,0), TimestampType] DecimalType(38,0) StringType
> [DateType, NullType] StringType DateType
> [DateType, IntegerType] IntegerType StringType
> [DateType, LongType] LongType StringType
> [DateType, DoubleType] DoubleType StringType
> [DateType, DecimalType(38,0)] DateType StringType
> [DateType, TimestampType] DateType TimestampType
> [TimestampType, NullType] StringType TimestampType
> [TimestampType, IntegerType] IntegerType StringType
> [TimestampType, LongType] LongType StringType
> [TimestampType, DoubleType] DoubleType StringType
> [TimestampType, DecimalType(38,0)] TimestampType StringType
>
>
>
> *Other possible suggestions**:*
>
> Probably, we could also consider simply making the merged type to string
> types when there are any type conflicts.
> Otherwise, there could be more stricter rules. I want more opinion from
> the community.
>
>
>
> *Questions:*
>
> - Does the *Proposal:* above looks good?
>
> - If not, what would be the alternative?
>
>
>
>
>

Re: [discuss][SQL] Partitioned column type inference proposal

Reply via email to