Carl Boettiger created ARROW-15151:
--------------------------------------

             Summary:  write_dataset() never increments {i} in partitions 
part-{i}
                 Key: ARROW-15151
                 URL: https://issues.apache.org/jira/browse/ARROW-15151
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 6.0.1
         Environment: Ubuntu 21.04
            Reporter: Carl Boettiger


Introducing partitioning in write_dataset() creates sub-folders just fine, but 
the lowest-level subfolder only ever contains a part-0.parquet.  I don't see 
how to get write_dataset() to ever generate output with multiple part-filenames 
in a single directory, like part-0.parquet, part-1.parquet, etc.  e.g. the 
documentation for open_dataset() implies we should get three `Z` level parts:


{code:java}
# You can also partition by the values in multiple columns
# (here: "cyl" and "gear").
# This creates a structure of the form cyl=X/gear=Y/part-Z.parquet.
two_levels_tree <- tempfile()
write_dataset(mtcars, two_levels_tree, partitioning = c("cyl", "gear"))
list.files(two_levels_tree, recursive = TRUE)

# In the two previous examples we would have:
# X = {4,6,8}, the number of cylinders.
# Y = {3,4,5}, the number of forward gears.
# Z = {0,1,2}, the number of saved parts, starting from 0. {code}
But I only get the expected structure with part-0.parquet files.

 

 

Context: I frequently need to partition large files that lack any natural 
grouping variable; I merely want a bunch of small parts of equal size.  It 
would be great if there was an automatic way of doing this; currently I can 
hack this by creating a partition column with integers 1...n where n is my 
desired number of partitions, and partition on that.  I'd then like to write 
these to a flat structure with part-0.parquet, part-1.parquet etc, not a nested 
folder structure, if possible. 

(Or better yet, it would be amazing if write_dataset() just let us set a 
maximum partition file size and could automate the sharding into parts while 
preserving the existing behavior for actually semantically meaningful groups.  
Maybe that is already the intent but I cannot see how to activate it!)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to