Daniel Darabos created SPARK-40873: -------------------------------------- Summary: Spark doesn't see some Parquet columns written from r-arrow Key: SPARK-40873 URL: https://issues.apache.org/jira/browse/SPARK-40873 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Daniel Darabos Attachments: part-0.parquet
I have a Parquet file that was created in R with the r-arrow package version 9.0.0 from Conda Forge with the write_dataset() function. It has four columns, but Spark 3.3.0 only sees two of them. {{>>> df = spark.read.parquet('part-0.parquet')}} {{()}} {{>>> df.head()}} {{Row(name='Adam', age=20.0)}} {{>>> df.columns}} {{['name', 'age']}} {{>>> import pandas as pd}} {{>>> pd.read_parquet('part-0.parquet')}} {{ name age age_2 age_4}} {{0 Adam 20.0 400.0 160000.0}} {{1 Eve 18.0 324.0 104976.0}} {{2 Bob 50.0 2500.0 6250000.0}} {{3 Isolated Joe 2.0 4.0 16.0}} {{>>> import pyarrow as pa}} {{>>> import pyarrow.parquet as pq}} {{>>> t = pq.read_table('part-0.parquet')}} {{>>> t}} {{pyarrow.Table}} {{name: string}} {{age: double}} {{age_2: double}} {{age_4: double}} {{----}} {{name: [["Adam","Eve","Bob","Isolated Joe"]]}} {{age: [[20,18,50,2]]}} {{age_2: [[400,324,2500,4]]}} {{age_4: [[160000,104976,6250000,16]]}} {{>>> pq.read_metadata('part-0.parquet')}} {{<pyarrow._parquet.FileMetaData object at 0x7f13e9dee5e0>}} {{ created_by: parquet-cpp-arrow version 9.0.0}} {{ num_columns: 4}} {{ num_rows: 4}} {{ num_row_groups: 1}} {{ format_version: 2.6}} {{ serialized_size: 1510}} {{>>> pq.read_metadata('part-0.parquet').schema}} {{<pyarrow._parquet.ParquetSchema object at 0x7f13e9dc46c0>}} {{required group field_id=-1 schema {}} {{ optional binary field_id=-1 name (String);}} {{ optional double field_id=-1 age;}} {{ optional double field_id=-1 age_2;}} {{ optional double field_id=-1 age_4;}} {{}}} "age_2" and "age_4" look no different from "age" based on the schema. I tried changing the names (just letters) but I still get the same behavior. Is something wrong with my file? Is something wrong with Spark? (I'll attach the file in a minute, I just need to figure out how.) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org