[ 
https://issues.apache.org/jira/browse/SPARK-54586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raz Luvaton updated SPARK-54586:
--------------------------------
    Affects Version/s: 3.5.3
                       3.5.1

> Spark is creating invalid parquet files - put non utf-8 data in string column 
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-54586
>                 URL: https://issues.apache.org/jira/browse/SPARK-54586
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.1, 3.5.3, 4.0.1, 3.5.7
>            Reporter: Raz Luvaton
>            Priority: Critical
>
> According to Parquet Logical Types specification, String Column should be 
> encoded as UTF8.
>  
> {noformat}
> ### STRING
> `STRING` may only be used to annotate the `BYTE_ARRAY` primitive type and 
> indicates
> that the byte array should be interpreted as a UTF-8 encoded character string.
> The sort order used for `STRING` strings is unsigned byte-wise comparison.
> *Compatibility*
> `STRING` corresponds to `UTF8` ConvertedType.{noformat}
> From [Parquet 
> Specification|https://github.com/apache/parquet-format/blob/905c89706004ee13a76c3df0f9fa4b1d583ddf9a/LogicalTypes.md?plain=1#L61-L70]
> but spark can write invalid parquet files that then can't be read by other 
> tools.
> you can try:
> {noformat}
> datafusion-cli --command "select * from 
> '/tmp/binary-data-with-cast-string/*.parquet' limit 10"{noformat}
> and you will get an error:
> {noformat}
> Error: Parquet error: Arrow: Parquet argument error: Parquet error: 
> encountered non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 
> 4{noformat}
> ----
> Reproduction for how to create invalid parquet file:
>  
> example in spark 4 as it has `try_validate_utf8` expression to make it very 
> easy to see
>  
> {noformat}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql._
> val binaryData = Seq(
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x00)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x01)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x02)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x03)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x04)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x05)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x06)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x07)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x08)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x09)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0A)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0B)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0C)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0D)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0E)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0F)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x10)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x11)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x12)),
>   Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x13))
> )
> // Create some file with binary data
> spark
>   .createDataFrame(
>     spark.sparkContext.parallelize(binaryData, 1),
>     StructType(Seq(StructField("a", DataTypes.BinaryType, nullable = true)))
>   )
>   .write
>   .format("parquet")
>   .mode("overwrite")
>   .save("/tmp/binary-data")
> // Create a new file with binary data and cast to string
> spark
>   .read
>   .parquet("/tmp/binary-data")
>   .withColumn("b", col("a").cast(DataTypes.StringType))
>   .write
>   .format("parquet")
>   .mode("overwrite")
>   .save("/tmp/binary-data-with-cast-string")
> {noformat}
>  
> to also verify that spark write invalid utf8 you can do this:
> {noformat}
> spark
>   .read
>   .parquet("/tmp/binary-data-with-cast-string")
>   .withColumn("b_is_valid", expr("try_validate_utf8(b)"))
>   .show(){noformat}
> which will output:
> {noformat}
> +-------------+-----+----------+
> |            a|    b|b_is_valid|
> +-------------+-----+----------+
> |[80 00 00 00]| �|      NULL|
> |[80 00 00 01]| �|      NULL|
> |[80 00 00 02]| �|      NULL|
> |[80 00 00 03]| �|      NULL|
> |[80 00 00 04]| �|      NULL|
> |[80 00 00 05]| �|      NULL|
> |[80 00 00 06]| �|      NULL|
> |[80 00 00 07]|�\a|      NULL|
> |[80 00 00 08]|�\b|      NULL|
> |[80 00 00 09]|�\t|      NULL|
> |[80 00 00 0A]|�\n|      NULL|
> |[80 00 00 0B]|�\v|      NULL|
> |[80 00 00 0C]|�\f|      NULL|
> |[80 00 00 0D]|�\r|      NULL|
> |[80 00 00 0E]| �|      NULL|
> |[80 00 00 0F]| �|      NULL|
> |[80 00 00 10]| �|      NULL|
> |[80 00 00 11]| �|      NULL|
> |[80 00 00 12]| �|      NULL|
> |[80 00 00 13]| �|      NULL|
> +-------------+-----+----------+{noformat}
>  
> you will see that the `b_is_valid` column is always NULL due to all values 
> are non valid utf8.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to