[ https://issues.apache.org/jira/browse/ARROW-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008799#comment-17008799 ]
Hiroaki Yutani edited comment on ARROW-7045 at 1/6/20 12:29 PM: ---------------------------------------------------------------- I think I found the cause. This is because {{ParquetArrowWriterProperties$store_schema()}} is not called by default, the first {{if}} branch in {{ParquetArrowWriterProperties$create}} below: {code:r} if (!use_deprecated_int96_timestamps && is.null(coerce_timestamps) && !allow_truncated_timestamps) { shared_ptr(ParquetArrowWriterProperties, parquet___default_arrow_writer_properties()) } else { builder <- shared_ptr(ParquetArrowWriterPropertiesBuilder, parquet___ArrowWriterProperties___Builder__create()) builder$store_schema() builder$set_int96_support(use_deprecated_int96_timestamps) builder$set_coerce_timestamps(coerce_timestamps) builder$set_allow_truncated_timestamps(allow_truncated_timestamps) shared_ptr(ParquetArrowWriterProperties, parquet___ArrowWriterProperties___Builder__build(builder)) } {code} (https://github.com/apache/arrow/blob/dd6b17d0cc1a77aaff84c5a4472ac73bc79486af/r/R/parquet.R#L208-L217) Actually, specifying either of the arguments works. For example: {code:r} test_that("Factors are preserved when writing/reading from Parquet", { tf <- tempfile() on.exit(unlink(tf)) df <- data.frame(a = factor(c("a", "b"))) write_parquet(df, tf, allow_truncated_timestamps = TRUE) expect_equivalent(read_parquet(tf), df) }) {code} According to https://github.com/apache/arrow/pull/5077 : bq. Add ArrowWriterProperties::store_schema() option which stores the Arrow schema used to create a Parquet file in a special ARROW:schema key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future I think the R bindings can follow the Python and always use {{ParquetArrowWriterPropertiesBuilder}} instead of {{default_arrow_writer_properties()}}. If this sounds good, I'm happy to send a PR for this. was (Author: yutannihilation): I think I found the cause. This is because {{ParquetArrowWriterProperties$store_schema()}} is not called by default, the first {{if}} branch in {{ParquetArrowWriterProperties$create}} below: {code:java} if (!use_deprecated_int96_timestamps && is.null(coerce_timestamps) && !allow_truncated_timestamps) { shared_ptr(ParquetArrowWriterProperties, parquet___default_arrow_writer_properties()) } else { builder <- shared_ptr(ParquetArrowWriterPropertiesBuilder, parquet___ArrowWriterProperties___Builder__create()) builder$store_schema() builder$set_int96_support(use_deprecated_int96_timestamps) builder$set_coerce_timestamps(coerce_timestamps) builder$set_allow_truncated_timestamps(allow_truncated_timestamps) shared_ptr(ParquetArrowWriterProperties, parquet___ArrowWriterProperties___Builder__build(builder)) } {code} (https://github.com/apache/arrow/blob/dd6b17d0cc1a77aaff84c5a4472ac73bc79486af/r/R/parquet.R#L208-L217) Actually, specifying either of the arguments works. For example: {code:java} test_that("Factors are preserved when writing/reading from Parquet", { tf <- tempfile() on.exit(unlink(tf)) df <- data.frame(a = factor(c("a", "b"))) write_parquet(df, tf, allow_truncated_timestamps = TRUE) expect_equivalent(read_parquet(tf), df) }) {code} According to https://github.com/apache/arrow/pull/5077 : bq. Add ArrowWriterProperties::store_schema() option which stores the Arrow schema used to create a Parquet file in a special ARROW:schema key in the metadata, so that we can detect that a column was originally DictionaryArray. This option is off by default, but enabled in the Python bindings. We can always make it the default in the future I think the R bindings can follow the Python and always use {{ParquetArrowWriterPropertiesBuilder}} instead of {{default_arrow_writer_properties()}}. If this sounds good, I'm happy to send a PR for this. > [R] Factor type not preserved in Parquet roundtrip > -------------------------------------------------- > > Key: ARROW-7045 > URL: https://issues.apache.org/jira/browse/ARROW-7045 > Project: Apache Arrow > Issue Type: Bug > Components: R > Reporter: Neal Richardson > Priority: Major > > {code:r} > test_that("Factors are preserved when writing/reading from Parquet", { > tf <- tempfile() > on.exit(unlink(tf)) > df <- data.frame(a = factor(c("a", "b"))) > write_parquet(df, tf) > expect_equivalent(read_parquet(tf), df) > }) > {code} > Fails: > {code} > `object` not equivalent to `expected`. > Component “a”: target is character, current is factor > {code} > This has to do with the translation with Parquet and not the R <--> Arrow > type mapping (unlike ARROW-7028). If you write_feather and read_feather, the > test passes. -- This message was sent by Atlassian Jira (v8.3.4#803005)