[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

GitBox Mon, 11 Jan 2021 11:26:56 -0800


jorisvandenbossche commented on a change in pull request #9130:
URL: https://github.com/apache/arrow/pull/9130#discussion_r555285509




##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -562,6 +569,8 @@ inline Result<std::shared_ptr<Array>> CountsToOffsets(
 // since no Writers accept a selection vector.
 class StructDictionary {
  public:
+  static constexpr int32_t kMaxGroups = std::numeric_limits<int16_t>::max();

Review comment:
       As long as it is configurable, then that is fine for me. 
   But I think something like 100 is too small. For example, the NYC taxi 
dataset partitioned by year + month for 8 years of data already has 8*12 = 96 
groups. And I think partitioning by day is not that uncommon in practice for 
big data (although for those cases you will probably not write that all at once)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9130: ARROW-10247: [C++][Dataset] Support writing datasets partitioned on dictionary columns

Reply via email to