Repository: arrow Updated Branches: refs/heads/master 4c3481ea5 -> e97fbe640
ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved Author: Uwe L. Korn <uw...@xhochy.com> Closes #321 from xhochy/ARROW-531 and squashes the following commits: 55da9dc [Uwe L. Korn] ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved Project: http://git-wip-us.apache.org/repos/asf/arrow/repo Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/e97fbe64 Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/e97fbe64 Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/e97fbe64 Branch: refs/heads/master Commit: e97fbe6407e8b15c6d3ef745f7a728e01d499a23 Parents: 4c3481e Author: Uwe L. Korn <uw...@xhochy.com> Authored: Tue Feb 7 11:17:28 2017 -0500 Committer: Wes McKinney <wes.mckin...@twosigma.com> Committed: Tue Feb 7 11:17:28 2017 -0500 ---------------------------------------------------------------------- python/doc/getting_involved.rst | 37 +++++++++++++++++++++++++ python/doc/index.rst | 2 ++ python/doc/install.rst | 5 ++-- python/doc/jemalloc.rst | 52 ++++++++++++++++++++++++++++++++++++ python/doc/pandas.rst | 8 +++++- python/doc/parquet.rst | 47 ++++++++++++++++++++++++-------- 6 files changed, 137 insertions(+), 14 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/getting_involved.rst ---------------------------------------------------------------------- diff --git a/python/doc/getting_involved.rst b/python/doc/getting_involved.rst new file mode 100644 index 0000000..90fa3e4 --- /dev/null +++ b/python/doc/getting_involved.rst @@ -0,0 +1,37 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Getting Involved +================ + +Right now the primary audience for Apache Arrow are the developers of data +systems; most people will use Apache Arrow indirectly through systems that use +it for internal data handling and interoperating with other Arrow-enabled +systems. + +Even if you do not plan to contribute to Apache Arrow itself or Arrow +integrations in other projects, we'd be happy to have you involved: + + * Join the mailing list: send an email to + `dev-subscr...@arrow.apache.org <mailto:dev-subscr...@arrow.apache.org>`_. + Share your ideas and use cases for the project or read through the + `Archive <http://mail-archives.apache.org/mod_mbox/arrow-dev/>`_. + * Follow our activity on `JIRA <https://issues.apache.org/jira/browse/ARROW>`_ + * Learn the `Format / Specification + <https://github.com/apache/arrow/tree/master/format>`_ + * Chat with us on `Slack <https://apachearrowslackin.herokuapp.com/>`_ + http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/index.rst ---------------------------------------------------------------------- diff --git a/python/doc/index.rst b/python/doc/index.rst index 6725ae7..d64354b 100644 --- a/python/doc/index.rst +++ b/python/doc/index.rst @@ -37,10 +37,12 @@ structures. Installing pyarrow <install.rst> Pandas <pandas.rst> Module Reference <modules.rst> + Getting Involved <getting_involved.rst> .. toctree:: :maxdepth: 2 :caption: Additional Features Parquet format <parquet.rst> + jemalloc MemoryPool <jemalloc.rst> http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/install.rst ---------------------------------------------------------------------- diff --git a/python/doc/install.rst b/python/doc/install.rst index 1bab017..4d99fa0 100644 --- a/python/doc/install.rst +++ b/python/doc/install.rst @@ -120,10 +120,11 @@ Install `pyarrow` cd arrow/python - # --with-parquet enable the Apache Parquet support in PyArrow + # --with-parquet enables the Apache Parquet support in PyArrow + # --with-jemalloc enables the jemalloc allocator support in PyArrow # --build-type=release disables debugging information and turns on # compiler optimizations for native code - python setup.py build_ext --with-parquet --build-type=release install + python setup.py build_ext --with-parquet --with--jemalloc --build-type=release install python setup.py install .. warning:: http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/jemalloc.rst ---------------------------------------------------------------------- diff --git a/python/doc/jemalloc.rst b/python/doc/jemalloc.rst new file mode 100644 index 0000000..33fe617 --- /dev/null +++ b/python/doc/jemalloc.rst @@ -0,0 +1,52 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +jemalloc MemoryPool +=================== + +Arrow's default :class:`~pyarrow.memory.MemoryPool` uses the system's allocator +through the POSIX APIs. Although this already provides aligned allocation, the +POSIX interface doesn't support aligned reallocation. The default reallocation +strategy is to allocate a new region, copy over the old data and free the +previous region. Using `jemalloc <http://jemalloc.net/>`_ we can simply extend +the existing memory allocation to the requested size. While this may still be +linear in the size of allocated memory, it is magnitudes faster as only the page +mapping in the kernel is touched, not the actual data. + +The :mod:`~pyarrow.jemalloc` allocator is not enabled by default to allow the +use of the system allocator and/or other allocators like ``tcmalloc``. You can +either explicitly make it the default allocator or pass it only to single +operations. + +.. code:: python + + import pyarrow as pa + import pyarrow.jemalloc + import pyarrow.memory + + jemalloc_pool = pyarrow.jemalloc.default_pool() + + # Explicitly use jemalloc for allocating memory for an Arrow Table object + array = pa.Array.from_pylist([1, 2, 3], memory_pool=jemalloc_pool) + + # Set the global pool + pyarrow.memory.set_default_pool(jemalloc_pool) + # This operation has no explicit MemoryPool specified and will thus will + # also use jemalloc for its allocations. + array = pa.Array.from_pylist([1, 2, 3]) + + http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/pandas.rst ---------------------------------------------------------------------- diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst index c225d13..34445ae 100644 --- a/python/doc/pandas.rst +++ b/python/doc/pandas.rst @@ -84,9 +84,11 @@ Pandas -> Arrow Conversion +------------------------+--------------------------+ | ``str`` / ``unicode`` | ``STRING`` | +------------------------+--------------------------+ +| ``pd.Categorical`` | ``DICTIONARY`` | ++------------------------+--------------------------+ | ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` | +------------------------+--------------------------+ -| ``pd.Categorical`` | *not supported* | +| ``datetime.date`` | ``DATE`` | +------------------------+--------------------------+ Arrow -> Pandas Conversion @@ -109,5 +111,9 @@ Arrow -> Pandas Conversion +-------------------------------------+--------------------------------------------------------+ | ``STRING`` | ``str`` | +-------------------------------------+--------------------------------------------------------+ +| ``DICTIONARY`` | ``pd.Categorical`` | ++-------------------------------------+--------------------------------------------------------+ | ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | +-------------------------------------+--------------------------------------------------------+ +| ``DATE`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | ++-------------------------------------+--------------------------------------------------------+ http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/parquet.rst ---------------------------------------------------------------------- diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst index 674ed80..8e011e4 100644 --- a/python/doc/parquet.rst +++ b/python/doc/parquet.rst @@ -29,16 +29,30 @@ Reading Parquet To read a Parquet file into Arrow memory, you can use the following code snippet. It will read the whole Parquet file into memory as an -:class:`pyarrow.table.Table`. +:class:`~pyarrow.table.Table`. .. code-block:: python - import pyarrow - import pyarrow.parquet + import pyarrow.parquet as pq - A = pyarrow + table = pq.read_table('<filename>') - table = A.parquet.read_table('<filename>') +As DataFrames stored as Parquet are often stored in multiple files, a +convenience method :meth:`~pyarrow.parquet.read_multiple_files` is provided. + +If you already have the Parquet available in memory or get it via non-file +source, you can utilize :class:`pyarrow.io.BufferReader` to read it from +memory. As input to the :class:`~pyarrow.io.BufferReader` you can either supply +a Python ``bytes`` object or a :class:`pyarrow.io.Buffer`. + +.. code:: python + + import pyarrow.io as paio + import pyarrow.parquet as pq + + buf = ... # either bytes or paio.Buffer + reader = paio.BufferReader(buf) + table = pq.read_table(reader) Writing Parquet --------------- @@ -49,13 +63,11 @@ method. .. code-block:: python - import pyarrow - import pyarrow.parquet - - A = pyarrow + import pyarrow as pa + import pyarrow.parquet as pq - table = A.Table(..) - A.parquet.write_table(table, '<filename>') + table = pa.Table(..) + pq.write_table(table, '<filename>') By default this will write the Table as a single RowGroup using ``DICTIONARY`` encoding. To increase the potential of parallelism a query engine can process @@ -64,3 +76,16 @@ a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows If you also want to compress the columns, you can select a compression method using the ``compression`` argument. Typically, ``GZIP`` is the choice if you want to minimize size and ``SNAPPY`` for performance. + +Instead of writing to a file, you can also write to Python ``bytes`` by +utilizing an :class:`pyarrow.io.InMemoryOutputStream()`: + +.. code:: python + + import pyarrow.io as paio + import pyarrow.parquet as pq + + table = ... + output = paio.InMemoryOutputStream() + pq.write_table(table, output) + pybytes = output.get_result().to_pybytes()