Hi, using the R arrow package version 14.0.2.1, I'm stumped by
something seemingly simple. For date columns, I like to use R's Date
class, which is stored internally as a number but prints as a
YYYY-MM-DD string.
In most cases arrow handles these Date columns nicely. The exception
is when I partition on a Date column, as in column "d1" in my example
below. When I read my data back in with open_dataset(), the d1 column
is now a string instead of Date. In contrast, the types of all the
other columns are preserved, including my "d2" Date column, because I
did not partition on that one.
It sort of makes sense that d1 is now a string, because the directory
names on disk really are strings like "2024-01-01". But I'd really
like to convert it back to the Date class format! In plain R that's
easy, but with the Dataset mmap-ed on disk, I don't know how to do it.
What should I do to get arrow to convert the partitioned d1 column to
Arrow's date32[day] type, and thus back to R's Date class? Can I
somehow do this directly on the Dataset object itself, WITHOUT first
converting it to ArrowTabular or data.frame?
Thanks for your help!
Example follows:
--------------------------------------------------
require("arrow")
my.dir <- "/tmp/arrow"
# Example data with some Date-class columns:
aa <- do.call("rbind" ,lapply(split(iris ,iris$Species) ,function(xx){
cbind(head(xx ,5)
,d1=(as.Date('2024-01-01') + 0:4)
,d2=(as.Date('1980-01-01') + 0:4))
})); rownames(aa) <- NULL
arrow::write_dataset(aa ,my.dir ,partitioning=c('d1') ,hive_style=FALSE
,format="feather" ,codec=Codec$create("LZ4_FRAME"))
bb <- arrow::open_dataset(my.dir ,format="feather" ,unify_schemas=TRUE
,partitioning=c('d1'))
# Unfortunately the "d1" column is now a string.
> dim(aa)
[1] 15 7
> class(aa)
[1] "data.frame"
> sapply(aa ,class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species d1
d2
"numeric" "numeric" "numeric" "numeric" "factor" "Date"
"Date"
> sapply(aa ,storage.mode)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species d1
d2
"double" "double" "double" "double" "integer" "double"
"double"
> dim(bb)
[1] 15 7
> class(bb)
[1] "FileSystemDataset" "Dataset" "ArrowObject" "R6"
> bb$schema$d1
Field
d1: string
> bb$schema$d2
Field
d2: date32[day]
> bb
FileSystemDataset with 5 Feather files
Sepal.Length: double
Sepal.Width: double
Petal.Length: double
Petal.Width: double
Species: dictionary<values=string, indices=int8>
d2: date32[day]
d1: string
See $metadata for additional Schema metadata
> sapply(arrow:::as.data.frame.ArrowTabular(bb$NewScan()$Finish()$ToTable())
> ,class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species d2
d1
"numeric" "numeric" "numeric" "numeric" "factor" "Date"
"character"
--------------------------------------------------
--
Andrew Piskorski <[email protected]>