[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

Rong Ma (Jira) Mon, 21 Jun 2021 23:53:07 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rong Ma updated SPARK-31703:
----------------------------
    Description: 
tTrying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
so as to be able to read data stored in parquet format, we notice that values 
associated with DOUBLE and DECIMAL types are parsed in the wrong form.

According toe parquet documentation, they always opt to store the values using 
little-endian representation for values:
 [https://github.com/apache/parquet-format/blob/master/Encodings.md]
{noformat}
The plain encoding is used whenever a more efficient encoding can not be used. 
It
stores the data in the following format:

BOOLEAN: Bit Packed, LSB first
INT32: 4 bytes little endian
INT64: 8 bytes little endian
INT96: 12 bytes little endian (deprecated)
FLOAT: 4 bytes IEEE little endian
DOUBLE: 8 bytes IEEE little endian
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in 
the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array

For native types, this outputs the data as little endian. Floating
point types are encoded in IEEE.
For the byte array type, it encodes the length as a 4 byte little
endian, followed by the bytes.{noformat}

  was:
Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so 
as to be able to read data stored in parquet format, we notice that values 
associated with DOUBLE and DECIMAL types are parsed in the wrong form.

According toe parquet documentation, they always opt to store the values using 
little-endian representation for values:
 [https://github.com/apache/parquet-format/blob/master/Encodings.md]
{noformat}
The plain encoding is used whenever a more efficient encoding can not be used. 
It
stores the data in the following format:

BOOLEAN: Bit Packed, LSB first
INT32: 4 bytes little endian
INT64: 8 bytes little endian
INT96: 12 bytes little endian (deprecated)
FLOAT: 4 bytes IEEE little endian
DOUBLE: 8 bytes IEEE little endian
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in 
the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array

For native types, this outputs the data as little endian. Floating
point types are encoded in IEEE.
For the byte array type, it encodes the length as a 4 byte little
endian, followed by the bytes.{noformat}


> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31703
>                 URL: https://issues.apache.org/jira/browse/SPARK-31703
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.5, 3.0.0
>         Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>            Reporter: Michail Giannakopoulos
>            Assignee: Tin Hang To
>            Priority: Blocker
>              Labels: BigEndian, correctness
>             Fix For: 2.4.7, 3.0.1, 3.1.0
>
>         Attachments: Data_problem_Spark.gif
>
>
> tTrying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using little-endian representation for values:
>  [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

Reply via email to