[ https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson closed ARROW-12162. ----------------------------------- Resolution: Information Provided > [R] read_parquet returns Invalid UTF8 payload > --------------------------------------------- > > Key: ARROW-12162 > URL: https://issues.apache.org/jira/browse/ARROW-12162 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 3.0.0 > Environment: Windows 10 > R 4.0.3 > arrow 3.0.0 > dbplyr 2.0.0 > dplyr 1.0.2 > Reporter: David Wales > Priority: Major > Attachments: bad_char.rds > > > h2. EDIT: > I've found a solution for my specific use case. If I add the argument > `encoding="latin1"` to the `DBI::dbConnect` function, then everything works! > This issue might still be valid for other cases where Parquet tries to save > invalid data though. It would be nice to get an error on write, rather than > on read! > h2. Background > I am using the R arrow library. > I am reading from an SQL Server database with the `latin1` encoding using > `dbplyr` and saving the output as a parquet file: > {code:java} > # Assume `con` is a previously established connection to the database created > with DBI::dbConnect > tbl(con, in_schema("dbo", "latin1_table")) %>% > collect() %>% > write_parquet("output.parquet") > {code} > > However, when I try to read the file back, I get the error "Invalid UTF8 > payload": > {code:java} > > read_parquet("output.parquet") > Error: Invalid: Invalid UTF8 payload > {code} > > What I would really like is a way to tell arrow "This data is latin1 encoded. > Please convert it to UTF-8 before you save it as a Parquet file". > Or alternatively "This Parquet file contains latin1 encoded data". > h2. Minimal Reproducible Example > I have isolated this issue to a minimal reproducible example. > If the database table contains the latin1 single quote character, then it > will trigger the error. > I have attached a `.rds` file which contains an example tibble. > To reproduce, run the following: > {code:java} > readRDS(file.path(data_dir, "bad_char.rds")) %>% > write_parquet(file.path(data_dir, "bad_char.parquet")) > read_parquet(file.path(data_dir, "bad_char.parquet")) > {code} > h2. Possibly related issues > https://issues.apache.org/jira/browse/ARROW-12007 -- This message was sent by Atlassian Jira (v8.3.4#803005)