Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Bart Samwel Wed, 24 Jun 2020 13:09:12 -0700

The relevant earlier discussion is here:
https://github.com/apache/spark/pull/25678#issuecomment-531585556.


(FWIW, a recent PR tried adding this again:
https://github.com/apache/spark/pull/28858.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[email protected]> wrote:

> Hello,
>
>
> Tahsin and I are trying to use the Apache Parquet file format with Spark
> SQL, but are running into errors when reading Parquet files that contain
> TimeType columns. We're wondering whether this is unsupported in Spark SQL
> due to an architectural limitation, or due to lack of resources?
>
>
> Context: When reading some Parquet files with Spark, we get an error
> message like the following:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could
> not read or convert schema for file:
> dbfs:/test/randomdata/sample001.parquet
> ...
> Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type:
> INT64 (TIME_MICROS);
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)
>
>
> This only seems to occur with Parquet files that have a column with the
> "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the
> Parquet file. After digging into this a bit, we think that the error
> message is coming from "ParquetSchemaConverter.scala" here: link
> <https://github.com/apache/spark/blob/11d3a744e20fe403dd76e18d57963b6090a7c581/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L151>.
>
> <https://github.com/apache/spark/blob/11d3a744e20fe403dd76e18d57963b6090a7c581/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L140>
>
>
> This seems to imply that the Spark SQL engine does not support reading
> Parquet files with TimeType columns.
>
> We are wondering if anyone on the mailing list could shed some more light
> on this: are there are architectural/datatype limitations in Spark that are
> resulting in this error, or is TimeType support for Parquet files something
> that hasn't been implemented yet due to lack of resources/interest?
>
>
> Thanks,
> Rylan
>


-- 
Bart Samwel
[email protected]

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Reply via email to