[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

David Wales (Jira) Tue, 30 Mar 2021 20:27:07 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Wales updated ARROW-12162:
--------------------------------
    Description: 
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
 

What I would really like is a way to tell arrow "This data is latin1 encoded. 
Please convert it to UTF-8 before you save it as a Parquet file".

Or alternatively "This Parquet file contains latin1 encoded data".
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
h2. Possibly related issues

https://issues.apache.org/jira/browse/ARROW-12007

  was:
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
h2. Possibly related issues

https://issues.apache.org/jira/browse/ARROW-12007


> [R] read_parquet returns Invalid UTF8 payload
> ---------------------------------------------
>
>                 Key: ARROW-12162
>                 URL: https://issues.apache.org/jira/browse/ARROW-12162
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 3.0.0
>         Environment: Windows 10
> R 4.0.3
> arrow 3.0.0
> dbplyr 2.0.0
> dplyr 1.0.2
>            Reporter: David Wales
>            Priority: Major
>         Attachments: bad_char.rds
>
>
> h2. Background
> I am using the R arrow library.
> I am reading from an SQL Server database with the `latin1` encoding using 
> `dbplyr` and saving the output as a parquet file: 
> {code:java}
> # Assume `con` is a previously established connection to the database created 
> with DBI::dbConnect
> tbl(con, in_schema("dbo", "latin1_table")) %>%
>   collect() %>%
>   write_parquet("output.parquet")
> {code}
>  
> However, when I try to read the file back, I get the error "Invalid UTF8 
> payload": 
> {code:java}
> > read_parquet("output.parquet")
> Error: Invalid: Invalid UTF8 payload
> {code}
>  
> What I would really like is a way to tell arrow "This data is latin1 encoded. 
> Please convert it to UTF-8 before you save it as a Parquet file".
> Or alternatively "This Parquet file contains latin1 encoded data".
> h2. Minimal Reproducible Example
> I have isolated this issue to a minimal reproducible example.
> If the database table contains the latin1 single quote character, then it 
> will trigger the error.
> I have attached a `.rds` file which contains an example tibble.
> To reproduce, run the following: 
> {code:java}
> readRDS(file.path(data_dir, "bad_char.rds")) %>% 
> write_parquet(file.path(data_dir, "bad_char.parquet"))
> read_parquet(file.path(data_dir, "bad_char.parquet"))
> {code}
> h2. Possibly related issues
> https://issues.apache.org/jira/browse/ARROW-12007



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

Reply via email to