[ 
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557714#comment-17557714
 ] 

Joris Van den Bossche commented on ARROW-16768:
-----------------------------------------------

Specifically for pandas this is not an issue, as it (currently) forbids missing 
values in the categories (factor labels).

In pyarrow both are allowed in a dictionary type (and when dictionary encoding, 
you can choose to keep the nulls in the data, or rather encode it in the 
dictionary), and there you can also run into this:

{code:python}
>>> arr = pa.array([1, 2, 2, 3, None])
# works fine
>>> pq.write_table(pa.table({"col": pc.dictionary_encode(arr, 
>>> null_encoding="mask")}), "test_dictionary_mask.parquet")
# raises
>>> pq.write_table(pa.table({"col": pc.dictionary_encode(arr, 
>>> null_encoding="encode")}), "test_dictionary_mask.parquet")
...
ArrowNotImplementedError: Writing DictionaryArray with null encoded in 
dictionary type not yet supported
{code}

> [R] Factor levels cannot contain NA
> -----------------------------------
>
>                 Key: ARROW-16768
>                 URL: https://issues.apache.org/jira/browse/ARROW-16768
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Kieran Martin
>            Priority: Minor
>             Fix For: 9.0.0
>
>
> If you try to write a data frame with a factor with a missing value to 
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values 
> containing nulls". 
> This seems likely due to how the metadata for factors is currently captured 
> in parquet files. Reprex follows:
>  
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to