[GitHub] [arrow-cookbook] thisisnic edited a comment on issue #83: [R] Recipe for random sampling

GitBox Thu, 07 Oct 2021 08:57:40 -0700


thisisnic edited a comment on issue #83:
URL: https://github.com/apache/arrow-cookbook/issues/83#issuecomment-937927421



   Thanks for opening this issue @GShotwell!  I agree that this would be a 
really useful thing to a) be able to do in arrow and b) have a recipe for in 
this cookbook.
   
   There's currently no optimised way of doing this in arrow, but I've opened 
up a request for this to be implemented at the C++ level; once this is done, we 
can look at implementing this in R.
   
   https://issues.apache.org/jira/browse/ARROW-14254
   
   In the short-term, the snippet below shows how you can achieve this if you 
have existing files without having to write any new data to them, though I 
don't think we want to add a recipe for this as apparently it's a bit slow - 
I'd like to wait until we have a nice way of doing it properly utilising the 
underlying C++ functionality before adding it to the cookbook.
   
   ```
   tf <- tempfile()
   dir.create(tf)
   
   # I've added the grouping just so it's written as a partitioned dataset
   iris %>% group_by(Species) %>%
     write_dataset(tf)
   
   iris_dataset <- open_dataset(tf)
   
   # sample as many values as needed
   rows <- sample(seq_len(nrow(iris_dataset)), 10)
   
   # return the sample
   collect(iris_dataset[rows,])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] thisisnic edited a comment on issue #83: [R] Recipe for random sampling

Reply via email to