[
https://issues.apache.org/jira/browse/SPARK-54586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raz Luvaton updated SPARK-54586:
--------------------------------
Affects Version/s: 3.5.3
3.5.1
> Spark is creating invalid parquet files - put non utf-8 data in string column
> ------------------------------------------------------------------------------
>
> Key: SPARK-54586
> URL: https://issues.apache.org/jira/browse/SPARK-54586
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.5.1, 3.5.3, 4.0.1, 3.5.7
> Reporter: Raz Luvaton
> Priority: Critical
>
> According to Parquet Logical Types specification, String Column should be
> encoded as UTF8.
>
> {noformat}
> ### STRING
> `STRING` may only be used to annotate the `BYTE_ARRAY` primitive type and
> indicates
> that the byte array should be interpreted as a UTF-8 encoded character string.
> The sort order used for `STRING` strings is unsigned byte-wise comparison.
> *Compatibility*
> `STRING` corresponds to `UTF8` ConvertedType.{noformat}
> From [Parquet
> Specification|https://github.com/apache/parquet-format/blob/905c89706004ee13a76c3df0f9fa4b1d583ddf9a/LogicalTypes.md?plain=1#L61-L70]
> but spark can write invalid parquet files that then can't be read by other
> tools.
> you can try:
> {noformat}
> datafusion-cli --command "select * from
> '/tmp/binary-data-with-cast-string/*.parquet' limit 10"{noformat}
> and you will get an error:
> {noformat}
> Error: Parquet error: Arrow: Parquet argument error: Parquet error:
> encountered non UTF-8 data: invalid utf-8 sequence of 1 bytes from index
> 4{noformat}
> ----
> Reproduction for how to create invalid parquet file:
>
> example in spark 4 as it has `try_validate_utf8` expression to make it very
> easy to see
>
> {noformat}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql._
> val binaryData = Seq(
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x00)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x01)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x02)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x03)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x04)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x05)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x06)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x07)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x08)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x09)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0A)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0B)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0C)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0D)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0E)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0F)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x10)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x11)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x12)),
> Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x13))
> )
> // Create some file with binary data
> spark
> .createDataFrame(
> spark.sparkContext.parallelize(binaryData, 1),
> StructType(Seq(StructField("a", DataTypes.BinaryType, nullable = true)))
> )
> .write
> .format("parquet")
> .mode("overwrite")
> .save("/tmp/binary-data")
> // Create a new file with binary data and cast to string
> spark
> .read
> .parquet("/tmp/binary-data")
> .withColumn("b", col("a").cast(DataTypes.StringType))
> .write
> .format("parquet")
> .mode("overwrite")
> .save("/tmp/binary-data-with-cast-string")
> {noformat}
>
> to also verify that spark write invalid utf8 you can do this:
> {noformat}
> spark
> .read
> .parquet("/tmp/binary-data-with-cast-string")
> .withColumn("b_is_valid", expr("try_validate_utf8(b)"))
> .show(){noformat}
> which will output:
> {noformat}
> +-------------+-----+----------+
> | a| b|b_is_valid|
> +-------------+-----+----------+
> |[80 00 00 00]| �| NULL|
> |[80 00 00 01]| �| NULL|
> |[80 00 00 02]| �| NULL|
> |[80 00 00 03]| �| NULL|
> |[80 00 00 04]| �| NULL|
> |[80 00 00 05]| �| NULL|
> |[80 00 00 06]| �| NULL|
> |[80 00 00 07]|�\a| NULL|
> |[80 00 00 08]|�\b| NULL|
> |[80 00 00 09]|�\t| NULL|
> |[80 00 00 0A]|�\n| NULL|
> |[80 00 00 0B]|�\v| NULL|
> |[80 00 00 0C]|�\f| NULL|
> |[80 00 00 0D]|�\r| NULL|
> |[80 00 00 0E]| �| NULL|
> |[80 00 00 0F]| �| NULL|
> |[80 00 00 10]| �| NULL|
> |[80 00 00 11]| �| NULL|
> |[80 00 00 12]| �| NULL|
> |[80 00 00 13]| �| NULL|
> +-------------+-----+----------+{noformat}
>
> you will see that the `b_is_valid` column is always NULL due to all values
> are non valid utf8.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]