[GitHub] [arrow-cookbook] jorisvandenbossche commented on a change in pull request #87: ARROW-13712: Reading and Writing Compressed Data

GitBox Thu, 21 Oct 2021 04:54:33 -0700


jorisvandenbossche commented on a change in pull request #87:
URL: https://github.com/apache/arrow-cookbook/pull/87#discussion_r733598805




##########
File path: python/source/io.rst
##########
@@ -577,4 +577,121 @@ The content of the file can be read back to a 
:class:`pyarrow.Table` using
 
 .. testoutput::
 
-    {'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]}
\ No newline at end of file
+    {'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]}
+
+Writing Compressed Data
+=======================
+
+Arrow provides support for writing files in compressed format,
+both for formats that provide it natively like Parquet or Feather,
+and for formats that don't support it out of the box like CSV.
+
+Given a table:
+
+.. testcode::
+
+    table = pa.table([
+        pa.array([1, 2, 3, 4, 5])
+    ], names=["numbers"])
+
+Writing it compressed to parquet or feather requires passing the
+``compression`` argument to the :func:`pyarrow.feather.write_feather` and
+:func:`pyarrow.parquet.write_table` functions:
+
+.. testcode::
+
+    pa.feather.write_feather(table, "compressed.feather",
+                             compression="lz4")
+    pa.parquet.write_table(table, "compressed.parquet",
+                           compression="lz4")
+
+You can refer to the two functions documentation for a complete
+list of supported compression formats.
+
+.. note::
+
+    Arrow actually uses compression by default when writing
+    parquet or feather files. Feather is compressed using ``lz4``
+    by default and Parquet uses ``snappy`` by default.
+
+For formats that don't support compression natively, like CSV,
+it's possible to save compressed data using
+:class:`pyarrow.CompressedOutputStream`:
+
+.. testcode::
+
+    with pa.CompressedOutputStream("compressed.csv.gz", "gzip") as out:
+        pa.csv.write_csv(table, out)
+
+This requires decompressing the file when reading it back,
+which can be done using :class:`pyarrow.CompressedInputStream`
+as explained in the next recipe.
+
+Reading Compressed Data
+=======================
+
+Arrow provides support for reading compressed files,
+both for formats that provide it natively like Parquet or Feather,
+and for formats that don't support it out of the box like CSV.
+
+Reading compressed formats that have native support for compression
+doesn't require any special handling. We can for example read back
+the Parquet and Feather files we wrote in the previous recipe
+simply invoking :meth:`pyarrow.feather.read_table` and
+:meth:`pyarrow.parquet.read_table`:
+
+.. testcode::
+
+    table_feather = pa.feather.read_table("compressed.feather")
+    print(table_feather)
+
+.. testoutput::
+
+    pyarrow.Table
+    numbers: int64
+
+.. testcode::
+
+    table_parquet = pa.parquet.read_table("compressed.parquet")
+    print(table_parquet)
+
+.. testoutput::
+
+    pyarrow.Table
+    numbers: int64
+
+Reading data from formats that don't have native support for
+compression instead involves decompressing them before decoding them.
+This can be done using the :class:`pyarrow.CompressedInputStream` class

Review comment:
       Ah, I see you again have a note about that further below :)  
   (but I would personally again not keep that for a note, but for the actual 
main test first showing the easy way, and then as second example the advanced 
way in case you don't have the standard extension or so)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-cookbook] jorisvandenbossche commented on a change in pull request #87: ARROW-13712: Reading and Writing Compressed Data

Reply via email to