[GitHub] [arrow] neopaf opened a new issue, #15138: Clustered By -- how?

GitBox Sat, 31 Dec 2022 03:09:52 -0800


neopaf opened a new issue, #15138:
URL: https://github.com/apache/arrow/issues/15138


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   As I understand 
   https://cloudxlab.com/blog/bucketing-clustered-by-and-cluster-by/
   
   Hive supports "CLUSTERED BY (cols)" approach, where data are split into 
`0000<bucket>` files.
   Also Hive supports "PARTITIONED BY (cols)".
   
   In Arrow we have `arrow::dataset` that can be configured to write to 
different partitions like 
[this](https://arrow.apache.org/docs/cpp/dataset.html).
   
   I can't seem to find a way to write data, splitting them into relevant 
bucket-files.
   
   I guess, I need to define additional [custom] partitioning.
   
   But can't seem to find relevant hash function.
   
   Maybe somebody already walked that path and can share some insights (if not 
code)?
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] neopaf opened a new issue, #15138: Clustered By -- how?

Reply via email to