westonpace commented on a change in pull request #12112: URL: https://github.com/apache/arrow/pull/12112#discussion_r836829845
########## File path: docs/source/python/dataset.rst ########## @@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data based on statistics, but very small groups can cause metadata to be a significant portion of file size. Arrow's file writer provides sensible defaults for group sizing in most cases. +Configuring files open during a write +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When writing data to the disk, there are a few parameters that can be +important to optimize the writes, i.e number of rows per file and +number of files open during write. + +Set the maximum number of files opened with the ``max_open_files`` parameter of +:meth:`write_dataset`. + +If ``max_open_files`` is set greater than 0 then this will limit the maximum +number of files that can be left open. If an attempt is made to open too many +files then the least recently used file will be closed. If this setting is set +too low you may end up fragmenting your data into many small files. + +The default value is 900 which also allows some number of files to be open +by the scannerbefore hitting the default Linux limit of 1024. Modify this value +depending on the nature of write operations associated with the usage. + Review comment: Sorry I missed this. This should help. Multi threading does cause the write_dataset call to be "jittery" but not completely random so this would help with the small files problem though you might still get one or two here and there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org