Guillermo Duran created ARROW-17432: ---------------------------------------
Summary: duplicated rows when importing large csv into parquet Key: ARROW-17432 URL: https://issues.apache.org/jira/browse/ARROW-17432 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 9.0.0, 8.0.0 Environment: R version 4.2.1 Running in Arch Linux - EndeavourOS arrow_info() Arrow package version: 9.0.0 Capabilities: dataset TRUE substrait FALSE parquet TRUE json TRUE s3 TRUE gcs TRUE utf8proc TRUE re2 TRUE snappy TRUE gzip TRUE brotli TRUE zstd TRUE lz4 TRUE lz4_frame TRUE lzo FALSE bz2 TRUE jemalloc TRUE mimalloc TRUE Memory: Allocator jemalloc Current 49.31 Kb Max 1.63 Mb Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 9.0.0 C++ Compiler GNU C++ Compiler Version 7.5.0 #### print(pa.__version__) 9.0.0 Reporter: Guillermo Duran This is a weird issue that creates new rows when importing a large csv (56 GB) into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT didn't occurred using the Python Arrow library 9.0.0. Due to the large size of the original csv it's difficult to create a reproducible example, but I share the code and outputs. The code I use in R to import the csv: {code:java} library(arrow) library(dplyr) csv_file <- "/ebird_erd2021/full/obs.csv" dest <- "/ebird_erd2021/full/obs_parquet/" sch = arrow::schema(checklist_id = float32(), species_code = string(), exotic_category = float32(), obs_count = float32(), only_presence_reported = float32(), only_slash_reported = float32(), valid = float32(), reviewed = float32(), has_media = float32() ) csv_stream <- open_dataset(csv_file, format = "csv", schema = sch, skip_rows = 1) write_dataset(csv_stream, dest, format = "parquet", max_rows_per_file=1000000L, hive_style = TRUE, existing_data_behavior = "overwrite"){code} When I load the dataset and check this checklist_id I get duplicates that are not part of the obs.csv file. There shouldn't be duplicated species in a checklist (amerob for example)... and also note that the duplicated species have different obs_count (I show how this look on the csv file below) {code:java} parquet_arrow <- open_dataset(dest, format = "parquet") parquet_arrow |> filter(checklist_id == 18543372) |> arrange(species_code) |> collect() # A tibble: 50 × 3 checklist_id species_code obs_count <dbl> <chr> <dbl> 1 18543372 altori 3 2 18543372 amekes 1 3 18543372 amered 40 4 18543372 amerob 30 5 18543372 amerob 9 6 18543372 balori 9 7 18543372 blkter 9 8 18543372 blkvul 20 9 18543372 buggna 1 10 18543372 buwwar 1 # … with 40 more rows # ℹ Use `print(n = ...)` to see more rows{code} If I use awk to check that same checklist id and amerob species on the csv_file, I get something different: {code:java} $ awk -F "," '{ if (($1 == 18543372) && ($2 == "amerob")) { print } }' obs.csv 18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0{code} Just one amerob species in the checklist_id 18653372 with 30 obs_count... If I import the csv into parquet using the Python Arrow library as: {code:java} import pyarrow as pa import pyarrow.dataset as ds import pyarrow.compute as pc import pandas as pd test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv", nrows = 1000) sch = pa.Schema.from_pandas(test_rows_csv) csv_file = ds.dataset("/ebird_erd2021/full/obs.csv", schema = sch, format = "csv") ds.write_dataset(csv_file, "ebird_erd2021/full/obs_parquet_py/", format = "parquet", schema = sch, use_threads = True, max_rows_per_file = 1000000, max_rows_per_group = 1000000, existing_data_behavior = "error"){code} And then load it in R doing the same search on that checklist: {code:java} parquet_py <- "/ebird_erd2021/full/obs_parquet_py/" parquet_arrow <- open_dataset(parquet_py, format = "parquet") parquet_arrow |> filter(checklist_id == 18543372) |> arrange(species_code) |> select(checklist_id, species_code, obs_count) |> collect() # A tibble: 17 × 3 checklist_id species_code obs_count <dbl> <chr> <dbl> 1 18543372 amered 40 2 18543372 amerob 30 3 18543372 balori 9 4 18543372 buggna 1 5 18543372 buwwar 1 6 18543372 cangoo 6 7 18543372 eastow 1 8 18543372 gowwar 1 9 18543372 grycat 1 10 18543372 houwre 1 11 18543372 norwat 2 12 18543372 ovenbi1 1 13 18543372 reshaw 1 14 18543372 rewbla 60 15 18543372 robgro 2 16 18543372 sedwre1 2 17 18543372 turvul 1{code} I get exactly what I should get. No species_code repeated (as in the original csv). -- This message was sent by Atlassian Jira (v8.20.10#820010)