[GitHub] [arrow] westonpace commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

GitBox Wed, 15 Dec 2021 17:51:08 -0800


westonpace commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770166418




##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,25 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets can improve performance when reading datasets, but have 
several 
+potential costs when reading and writing:
+
+#. Can significantly increase the number of files to write. The number of 
partitions is a
+   floor for the number of files in a dataset. If you partition a dataset by 
date with a 
+   year of data, you will have at least 365 files. If you further partition by 
another
+   dimension with 1,000 unique values, you will have 365,000 files. This can 
make it slower
+   to write and increase the size of the overall dataset because each file has 
some fixed
+   overhead. For example, each file in parquet dataset contains the schema.
+#. Multiple partitioning columns can produce deeply nested folder structures 
which are slow
+   to navigate because they require many recusive "list directory" calls to 
discover files.
+   These operations may be particularly expensive if you are using an object 
store 
+   filesystem such as S3. One workaround is to combine multiple columns into 
one for
+   partitioning. For example, instead of a schema like /year/month/day/ use 
/YYYY-MM-DD/.
+ 
+

Review comment:
       Probably a reasonable "rule of thumb" might be to structure files so 
that each column of data is at least 4MB large.  This is somewhat arbitrary 
when it comes to data/metadata ratio but 4MB is also around the point where an 
HDD's sequential vs random reads tradeoff starts to fall off.  Although for 
bitmaps the requirement for 32 million rows can be a bit extreme / difficult to 
satisfy.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

Reply via email to