[ https://issues.apache.org/jira/browse/ARROW-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace reassigned ARROW-12321: ----------------------------------- Assignee: Weston Pace > [R][C++] Arrow opens too many files at once when writing a dataset > ------------------------------------------------------------------ > > Key: ARROW-12321 > URL: https://issues.apache.org/jira/browse/ARROW-12321 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R > Affects Versions: 3.0.0 > Reporter: Mauricio 'PachĂĄ' Vargas SepĂșlveda > Assignee: Weston Pace > Priority: Major > Fix For: 5.0.0 > > > _Related to:_ https://issues.apache.org/jira/browse/ARROW-12315 > Please see > https://drive.google.com/drive/folders/1e7WB36FPYzvdtm46dgAEEFAKAWDQs-e1?usp=sharing > where I added the raw data and the output. > This works: > {code:java} > library(data.table) > library(dplyr) > library(arrow) > d <- fread( > input = > "01-raw-data/sitc-rev2/parquet/type-C_r-ALL_ps-2019_freq-A_px-S2_pub-20210216_fmt-csv_ex-20210227.csv", > colClasses = list( > character = "Commodity Code", > numeric = c("Trade Value (US$)", "Qty", "Netweight (kg)") > )) > d <- d %>% > mutate( > `Reporter ISO` = case_when( > `Reporter ISO` %in% c(NA, "", " ") ~ "0-unspecified", > TRUE ~ `Reporter ISO` > ), > `Partner ISO` = case_when( > `Partner ISO` %in% c(NA, "", " ") ~ "0-unspecified", > TRUE ~ `Partner ISO` > ) > ) > # d %>% > # select(Year, `Reporter ISO`, `Partner ISO`) %>% > # distinct() %>% > # dim() > d %>% > group_by(Year, `Reporter ISO`) %>% > write_dataset("parquet", hive_style = F, max_partitions = 1024L) > {code} > But, if I add an additional column for partioning and increases the max > partitions to 12808 (to pass exactly the number of partitions that it needs), > I get the error: > {code:java} > d %>% > group_by(Year, `Reporter ISO`) %>% > write_dataset("parquet", hive_style = F, max_partitions = 12808) > Error: IOError: Failed to open local file > '/media/pacha/pacha_backup/tradestatistics/yearly-datasets-arrow/01-raw-data/sitc-rev2/parquet/2019/SEN/MOZ/part-5353.parquet'. > Detail: [errno 24] Too many open files > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)