[jira] [Created] (ARROW-17432) duplicated rows when importing large csv into parquet

Guillermo Duran (Jira) Tue, 16 Aug 2022 07:02:40 -0700

Guillermo Duran created ARROW-17432:
---------------------------------------


             Summary: duplicated rows when importing large csv into parquet
                 Key: ARROW-17432
                 URL: https://issues.apache.org/jira/browse/ARROW-17432
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 9.0.0, 8.0.0
         Environment: R version 4.2.1
Running in Arch Linux - EndeavourOS

arrow_info()
Arrow package version: 9.0.0

Capabilities:
               
dataset    TRUE
substrait FALSE
parquet    TRUE
json       TRUE
s3         TRUE
gcs        TRUE
utf8proc   TRUE
re2        TRUE
snappy     TRUE
gzip       TRUE
brotli     TRUE
zstd       TRUE
lz4        TRUE
lz4_frame  TRUE
lzo       FALSE
bz2        TRUE
jemalloc   TRUE
mimalloc   TRUE

Memory:
                  
Allocator jemalloc
Current   49.31 Kb
Max        1.63 Mb

Runtime:
                        
SIMD Level          avx2
Detected SIMD Level avx2

Build:
                          
C++ Library Version  9.0.0
C++ Compiler           GNU
C++ Compiler Version 7.5.0

####
print(pa.__version__)
9.0.0

            Reporter: Guillermo Duran


This is a weird issue that creates new rows when importing a large csv (56 GB) 
into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT didn't 
occurred using the Python Arrow library 9.0.0. Due to the large size of the 
original csv it's difficult to create a reproducible example, but I share the 
code and outputs.

The code I use in R to import the csv:
{code:java}
library(arrow)
library(dplyr)
 
csv_file <- "/ebird_erd2021/full/obs.csv"
dest <- "/ebird_erd2021/full/obs_parquet/" 

sch = arrow::schema(checklist_id = float32(),
                    species_code = string(),
                    exotic_category = float32(),
                    obs_count = float32(),
                    only_presence_reported = float32(),
                    only_slash_reported = float32(),
                    valid = float32(),
                    reviewed = float32(),
                    has_media = float32()
                    )

csv_stream <- open_dataset(csv_file, format = "csv", 
                           schema = sch, skip_rows = 1)

write_dataset(csv_stream, dest, format = "parquet", 
              max_rows_per_file=1000000L,
              hive_style = TRUE,
              existing_data_behavior = "overwrite"){code}
When I load the dataset and check this checklist_id I get duplicates that are 
not part of the obs.csv file. There shouldn't be duplicated species in a 
checklist (amerob for example)... and also note that the duplicated species 
have different obs_count (I show how this look on the csv file below)
{code:java}
parquet_arrow <- open_dataset(dest, format = "parquet")

parquet_arrow |> 
  filter(checklist_id == 18543372) |> 
  arrange(species_code) |> 
  collect() 

# A tibble: 50 × 3
   checklist_id species_code obs_count
          <dbl> <chr>            <dbl>
 1     18543372 altori               3
 2     18543372 amekes               1
 3     18543372 amered              40
 4     18543372 amerob              30
 5     18543372 amerob               9
 6     18543372 balori               9
 7     18543372 blkter               9
 8     18543372 blkvul              20
 9     18543372 buggna               1
10     18543372 buwwar               1
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows{code}
If I use awk to check that same checklist id and amerob species on the 
csv_file, I get something different:
{code:java}
$ awk -F "," '{ if (($1 == 18543372) &&  ($2 == "amerob")) { print } }' obs.csv

18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0{code}
Just one amerob species in the checklist_id 18653372 with 30 obs_count...

 

If I import the csv into parquet using the Python Arrow library as:
{code:java}
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
import pandas as pd 

test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv",
                            nrows = 1000)

sch = pa.Schema.from_pandas(test_rows_csv)

csv_file = ds.dataset("/ebird_erd2021/full/obs.csv",
                      schema = sch,
                      format = "csv")

ds.write_dataset(csv_file, 
                 "ebird_erd2021/full/obs_parquet_py/",
                 format = "parquet",
                 schema = sch,
                 use_threads = True,
                 max_rows_per_file = 1000000,
                 max_rows_per_group = 1000000,
                 existing_data_behavior = "error"){code}
And then load it in R doing the same search on that checklist:
{code:java}
parquet_py <- "/ebird_erd2021/full/obs_parquet_py/"

parquet_arrow <- open_dataset(parquet_py, format = "parquet")

parquet_arrow |> 
  filter(checklist_id == 18543372) |> 
  arrange(species_code) |> 
  select(checklist_id, species_code, obs_count) |> 
  collect()

# A tibble: 17 × 3
   checklist_id species_code obs_count
          <dbl> <chr>            <dbl>
 1     18543372 amered              40
 2     18543372 amerob              30
 3     18543372 balori               9
 4     18543372 buggna               1
 5     18543372 buwwar               1
 6     18543372 cangoo               6
 7     18543372 eastow               1
 8     18543372 gowwar               1
 9     18543372 grycat               1
10     18543372 houwre               1
11     18543372 norwat               2
12     18543372 ovenbi1              1
13     18543372 reshaw               1
14     18543372 rewbla              60
15     18543372 robgro               2
16     18543372 sedwre1              2
17     18543372 turvul               1{code}
I get exactly what I should get. No species_code repeated (as in the original 
csv).

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17432) duplicated rows when importing large csv into parquet

Reply via email to