[jira] [Comment Edited] (ARROW-7045) [R] Factor type not preserved in Parquet roundtrip

Hiroaki Yutani (Jira) Mon, 06 Jan 2020 04:31:02 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008799#comment-17008799
 ]


Hiroaki Yutani edited comment on ARROW-7045 at 1/6/20 12:29 PM:
----------------------------------------------------------------

I think I found the cause. This is because 
{{ParquetArrowWriterProperties$store_schema()}} is not called by default, the 
first {{if}} branch in {{ParquetArrowWriterProperties$create}} below:

{code:r}
  if (!use_deprecated_int96_timestamps && is.null(coerce_timestamps) && 
!allow_truncated_timestamps) {
    shared_ptr(ParquetArrowWriterProperties, 
parquet___default_arrow_writer_properties())
  } else {
    builder <- shared_ptr(ParquetArrowWriterPropertiesBuilder, 
parquet___ArrowWriterProperties___Builder__create())
    builder$store_schema()
    builder$set_int96_support(use_deprecated_int96_timestamps)
    builder$set_coerce_timestamps(coerce_timestamps)
    builder$set_allow_truncated_timestamps(allow_truncated_timestamps)
    shared_ptr(ParquetArrowWriterProperties, 
parquet___ArrowWriterProperties___Builder__build(builder))
  }
{code}
(https://github.com/apache/arrow/blob/dd6b17d0cc1a77aaff84c5a4472ac73bc79486af/r/R/parquet.R#L208-L217)

Actually, specifying either of the arguments works. For example:

{code:r}
test_that("Factors are preserved when writing/reading from Parquet", {
  tf <- tempfile()
  on.exit(unlink(tf))
  df <- data.frame(a = factor(c("a", "b")))
  write_parquet(df, tf, allow_truncated_timestamps = TRUE)
  expect_equivalent(read_parquet(tf), df)
})
{code}

According to https://github.com/apache/arrow/pull/5077 :

bq. Add ArrowWriterProperties::store_schema() option which stores the Arrow 
schema used to create a Parquet file in a special ARROW:schema key in the 
metadata, so that we can detect that a column was originally DictionaryArray. 
This option is off by default, but enabled in the Python bindings. We can 
always make it the default in the future

I think the R bindings can follow the Python and always use 
{{ParquetArrowWriterPropertiesBuilder}} instead of 
{{default_arrow_writer_properties()}}. If this sounds good, I'm happy to send a 
PR for this.


was (Author: yutannihilation):
I think I found the cause. This is because 
{{ParquetArrowWriterProperties$store_schema()}} is not called by default, the 
first {{if}} branch in {{ParquetArrowWriterProperties$create}} below:

{code:java}
  if (!use_deprecated_int96_timestamps && is.null(coerce_timestamps) && 
!allow_truncated_timestamps) {
    shared_ptr(ParquetArrowWriterProperties, 
parquet___default_arrow_writer_properties())
  } else {
    builder <- shared_ptr(ParquetArrowWriterPropertiesBuilder, 
parquet___ArrowWriterProperties___Builder__create())
    builder$store_schema()
    builder$set_int96_support(use_deprecated_int96_timestamps)
    builder$set_coerce_timestamps(coerce_timestamps)
    builder$set_allow_truncated_timestamps(allow_truncated_timestamps)
    shared_ptr(ParquetArrowWriterProperties, 
parquet___ArrowWriterProperties___Builder__build(builder))
  }
{code}
(https://github.com/apache/arrow/blob/dd6b17d0cc1a77aaff84c5a4472ac73bc79486af/r/R/parquet.R#L208-L217)

Actually, specifying either of the arguments works. For example:

{code:java}
test_that("Factors are preserved when writing/reading from Parquet", {
  tf <- tempfile()
  on.exit(unlink(tf))
  df <- data.frame(a = factor(c("a", "b")))
  write_parquet(df, tf, allow_truncated_timestamps = TRUE)
  expect_equivalent(read_parquet(tf), df)
})
{code}

According to https://github.com/apache/arrow/pull/5077 :

bq. Add ArrowWriterProperties::store_schema() option which stores the Arrow 
schema used to create a Parquet file in a special ARROW:schema key in the 
metadata, so that we can detect that a column was originally DictionaryArray. 
This option is off by default, but enabled in the Python bindings. We can 
always make it the default in the future

I think the R bindings can follow the Python and always use 
{{ParquetArrowWriterPropertiesBuilder}} instead of 
{{default_arrow_writer_properties()}}. If this sounds good, I'm happy to send a 
PR for this.

> [R] Factor type not preserved in Parquet roundtrip
> --------------------------------------------------
>
>                 Key: ARROW-7045
>                 URL: https://issues.apache.org/jira/browse/ARROW-7045
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Neal Richardson
>            Priority: Major
>
> {code:r}
> test_that("Factors are preserved when writing/reading from Parquet", {
>   tf <- tempfile()
>   on.exit(unlink(tf))
>   df <- data.frame(a = factor(c("a", "b")))
>   write_parquet(df, tf)
>   expect_equivalent(read_parquet(tf), df)
> })
> {code}
> Fails:
> {code}
> `object` not equivalent to `expected`.
> Component “a”: target is character, current is factor
> {code}
> This has to do with the translation with Parquet and not the R <--> Arrow 
> type mapping (unlike ARROW-7028). If you write_feather and read_feather, the 
> test passes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-7045) [R] Factor type not preserved in Parquet roundtrip

Reply via email to