[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

GitBox Wed, 19 Jan 2022 14:07:13 -0800


wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r788004898




##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when 
reading and allow data
 based on statistics, but very small groups can cause metadata to be a 
significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing 
in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is 
set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this 
value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is 
``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` 
parameter of
+:meth:`write_dataset`.

Review comment:
       ```suggestion
   Set the maximum number of rows written in each file with the 
``max_rows_per_files`` parameter of
   :meth:`write_dataset`.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when 
reading and allow data
 based on statistics, but very small groups can cause metadata to be a 
significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing 
in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is 
set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this 
value 
+depending on the nature of write operations associated with the usage. 
+

Review comment:
       @westonpace does my understand below sound correct? I know it's a little 
complicated with multi-threading
   
   ```suggestion
   
   To mitigate the many-small-files problem caused by this limit, you can 
   also sort your data by the partition columns (assuming it is not already
   sorted). This ensures that files are usually closed after all data for
   their respective partition has been written.
   
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when 
reading and allow data
 based on statistics, but very small groups can cause metadata to be a 
significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing 
in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is 
set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this 
value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is 
``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` 
parameter of
+:meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one 
+file will be created in each output directory unless files need to be closed 
to respect 
+``max_open_files``. 

Review comment:
       ```suggestion
   ``max_open_files``. This setting is the primary way to control file size. 
   For workloads writing a lot of data files can get very large without a 
   row count cap, leading to out-of-memory errors in downstream readers. The 
   relationship between row count and file size depends on the dataset schema
   and how well compressed (if at all) the data is. For most applications,
   it's best to keep file sizes below 1GB.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when 
reading and allow data
 based on statistics, but very small groups can cause metadata to be a 
significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing 
in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is 
set 

Review comment:
       We should probably mention that this setting applies to partitioned 
datasets:
   
   ```suggestion
   If  ``max_open_files`` is set greater than 0 then this will limit the 
maximum 
   number of files that can be left open. This only applies to writing 
partitioned
   datasets, where rows are dispatched to the appropriate file depending on 
their
   partition values. If an attempt is made to open too many  files then the 
least
   recently used file will be closed.  If this setting is set 
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when 
reading and allow data
 based on statistics, but very small groups can cause metadata to be a 
significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing 
in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is 
set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this 
value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is 
``max_rows_per_file``. 

Review comment:
       ```suggestion
   Another important configuration used in :meth:`write_dataset` is 
``max_rows_per_file``.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when 
reading and allow data
 based on statistics, but very small groups can cause metadata to be a 
significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing 
in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is 
set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this 
value 
+depending on the nature of write operations associated with the usage. 

Review comment:
       If we can, let's eliminate "Modify this value depending on the nature of 
write operations associated with the usage" and replace with more specific 
advice.
   
   ```suggestion
   If your process is concurrently using other file handlers, either with a 
   dataset scanner or otherwise, you may hit a system file handler limit. For 
   example, if you are scanning a dataset with 300 files and writing out to
   900 files, the total of 1200 files may be over a system limit. (On Linux,
   this might be a "Too Many Open Files" error.) You can either reduce this
   ``max_open_files`` setting or increasing your file handler limit on your
   system. The default value is 900 which also allows some number of files
   to be open by the scanner before hitting the default Linux limit of 1024. 
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when 
reading and allow data
 based on statistics, but very small groups can cause metadata to be a 
significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing 
in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is 
set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this 
value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is 
``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` 
parameter of
+:meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one 
+file will be created in each output directory unless files need to be closed 
to respect 
+``max_open_files``. 
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+

Review comment:
       A few points worth discussing:
   
    * Row groups matter for Parquet and Feather/IPC; they affect how data is 
seen by reader and because of row group statistics can affect file size.
    * Row groups are just batch size for CSV / JSON; the readers aren't 
affected.
   
   My impression is that we have reasonable default for these values, and users 
generally won't want to set these. Can you think of examples where we would 
recommend users adjust these values?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Reply via email to