[ https://issues.apache.org/jira/browse/ARROW-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicola Crane updated ARROW-18200: --------------------------------- Parent: ARROW-18215 Issue Type: Sub-task (was: Bug) > [R] Misleading error message if opening CSV dataset with invalid file in > directory > ---------------------------------------------------------------------------------- > > Key: ARROW-18200 > URL: https://issues.apache.org/jira/browse/ARROW-18200 > Project: Apache Arrow > Issue Type: Sub-task > Components: R > Reporter: Nicola Crane > Assignee: Nicola Crane > Priority: Major > > I made a mistake before where I thought a dataset contained CSVs which were, > in fact, Parquet files, but the error message I got was super unhelpful > {code:r} > library(arrow) > download.file( > url = > "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip", > destfile = here::here("data/nyc-taxi-tiny.zip") > ) > # (unzip the zip file into the data directory but don't delete it after) > open_dataset("data", format = "csv") > {code} > {code:r} > Error in nchar(x) : invalid multibyte string, element 1 > In addition: Warning message: > In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) : > input string 1 is invalid in this locale > {code} > Note, this only occurs with {{format="csv"}} and omitting this argument (i.e. > the default of {{format="parquet"}} leaves us with the much better error: > {code:r} > Error in `open_dataset()`: > ! Invalid: Error creating dataset. Could not read schema from > '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet > input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet > magic bytes not found in footer. Either the file is corrupted or this is not > a parquet file. > /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, > scan_options). Is this a 'parquet' file? > /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44 > InspectSchemas(std::move(options)) > /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265 > Inspect(options.inspect_options) > ℹ Did you mean to specify a 'format' other than the default (parquet)? > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)