http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/index.html ---------------------------------------------------------------------- diff --git a/build/blog/index.html b/build/blog/index.html new file mode 100644 index 0000000..cfddf8f --- /dev/null +++ b/build/blog/index.html @@ -0,0 +1,2172 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + <li><a href="/docs/js">Javascript API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + + +<h2>Project News and Blog</h2> +<hr> + + + + + <div class="container"> + <h2> + A Native Go Library for Apache Arrow + <a href="/blog/2018/03/22/go-code-donation/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 22 Mar 2018 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://github.com/pmc"><i class="fa fa-user"></i> The Apache Arrow PMC (pmc)</a> + </div> + </div> + </div> + <!-- + +--> + +<p>Since launching in early 2016, Apache Arrow has been growing fast. We have made +nine major releases through the efforts of over 120 distinct contributors. The +projectâs scope has also expanded. We began by focusing on the development of +the standardized in-memory columnar data format, which now serves as a pillar +of the project. Since then, we have been growing into a more general +cross-language platform for in-memory data analysis through new additions to +the project like the <a href="http://arrow.apache.org/blog/2017/08/16/0.6.0-release/">Plasma shared memory object store</a>. A primary goal of +the project is to enable data system developers to process and move data fast.</p> + +<p>So far, we officially have developed native Arrow implementations in C++, Java, +and JavaScript. We have created binding layers for the C++ libraries in C +(using the GLib libraries) and Python. We have also seen efforts to develop +interfaces to the Arrow C++ libraries in Go, Lua, Ruby, and Rust. While binding +layers serve many purposes, there can be benefits to native implementations, +and so weâve been keen to see future work on native implementations in growing +systems languages like Go and Rust.</p> + +<p>This past October, engineers <a href="https://github.com/stuartcarnie">Stuart Carnie</a>, <a href="https://github.com/nathanielc">Nathaniel Cook</a>, and +<a href="https://github.com/goller">Chris Goller</a>, employees of <a href="https://influxdata.com">InfluxData</a>, began developing a native [Go +language implementation of the <a href="https://github.com/influxdata/arrow">Apache Arrow</a> in-memory columnar format for +use in Go-based database systems like InfluxDB. We are excited to announce that +InfluxData has <a href="https://www.businesswire.com/news/home/20180322005393/en/InfluxData-Announces-Language-Implementation-Contribution-Apache-Arrow">donated this native Go implementation to the Apache Arrow +project</a>, where it will continue to be developed. This work features +low-level integration with the Go runtime and native support for SIMD +instruction sets. We are looking forward to working more closely with the Go +community on solving in-memory analytics and data interoperability problems.</p> + +<div align="center"> +<img src="/img/native_go_implementation.png" alt="Apache Arrow implementations and bindings" width="60%" class="img-responsive" /> +</div> + +<p>One of the mantras in <a href="https://www.apache.org">The Apache Software Foundation</a> is âCommunity over +Codeâ. By building an open and collaborative development community across many +programming language ecosystems, we will be able to development better and +longer-lived solutions to the systems problems faced by data developers.</p> + +<p>We are excited for what the future holds for the Apache Arrow project. Adding +first-class support for a popular systems programming language like Go is an +important step along the way. We welcome others from the Go community to get +involved in the project. We also welcome others who wish to explore building +Arrow support for other programming languages not yet represented. Learn more +at <a href="https://arrow.apache.org">https://arrow.apache.org</a> and join the mailing list +<a href="https://lists.apache.org/[email protected]">[email protected]</a>.</p> + + + </div> + + + + + + <div class="container"> + <h2> + Apache Arrow 0.9.0 Release + <a href="/blog/2018/03/22/0.9.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 22 Mar 2018 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (wesm)</a> + </div> + </div> + </div> + <!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.9.0 release. It is the +product of over 3 months of development and includes <a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.9.0"><strong>260 resolved +JIRAs</strong></a>.</p> + +<p>While we made some of backwards-incompatible columnar binary format changes in +last Decemberâs 0.8.0 release, the 0.9.0 release is backwards-compatible with +0.8.0. We will be working toward a 1.0.0 release this year, which will mark +longer-term binary stability for the Arrow columnar format and metadata.</p> + +<p>See the <a href="https://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="https://arrow.apache.org/release/0.8.0.html">complete changelog</a> is also available.</p> + +<p>We discuss some highlights from the release and other project news in this +post. This release has been overall focused more on bug fixes, compatibility, +and stability compared with previous releases which have pushed more on new and +expanded features.</p> + +<h2 id="new-arrow-committers-and-pmc-members">New Arrow committers and PMC members</h2> + +<p>Since the last release, we have added 2 new Arrow committers: <a href="https://github.com/theneuralbit">Brian +Hulette</a> and <a href="https://github.com/robertnishihara">Robert Nishihara</a>. Additionally, <a href="https://github.com/cpcloud">Phillip Cloud</a> and +<a href="https://github.com/pcmoritz">Philipp Moritz</a> have been promoted from committer to PMC +member. Congratulations and thank you for your contributions!</p> + +<h2 id="plasma-object-store-improvements">Plasma Object Store Improvements</h2> + +<p>The Plasma Object Store now supports managing interprocess shared memory on +CUDA-enabled GPUs. We are excited to see more GPU-related functionality develop +in Apache Arrow, as this has become a key computing environment for scalable +machine learning.</p> + +<h2 id="python-improvements">Python Improvements</h2> + +<p><a href="https://github.com/pitrou">Antoine Pitrou</a> has joined the Python development efforts and helped +significantly this release with interoperability with built-in CPython data +structures and NumPy structured data types.</p> + +<ul> + <li>New experimental support for reading Apache ORC files</li> + <li><code class="highlighter-rouge">pyarrow.array</code> now accepts lists of tuples or Python dicts for creating +Arrow struct type arrays.</li> + <li>NumPy structured dtypes (which are row/record-oriented) can be directly +converted to Arrow struct (column-oriented) arrays</li> + <li>Python 3.6 <code class="highlighter-rouge">pathlib</code> objects for file paths are now accepted in many file +APIs, including for Parquet files</li> + <li>Arrow integer arrays with nulls can now be converted to NumPy object arrays +with <code class="highlighter-rouge">None</code> values</li> + <li>New <code class="highlighter-rouge">pyarrow.foreign_buffer</code> API for interacting with memory blocks located +at particular memory addresses</li> +</ul> + +<h2 id="java-improvements">Java Improvements</h2> + +<p>Java now fully supports the <code class="highlighter-rouge">FixedSizeBinary</code> data type.</p> + +<h2 id="javascript-improvements">JavaScript Improvements</h2> + +<p>The JavaScript library has been significantly refactored and expanded. We are +making separate Apache releases (most recently <code class="highlighter-rouge">JS-0.3.1</code>) for JavaScript, +which are being <a href="https://www.npmjs.com/package/apache-arrow">published to NPM</a>.</p> + +<h2 id="upcoming-roadmap">Upcoming Roadmap</h2> + +<p>In the coming months, we will be working to move Apache Arrow closer to a 1.0.0 +release. We will also be discussing plans to develop native Arrow-based +computational libraries within the project.</p> + + + </div> + + + + + + <div class="container"> + <h2> + Apache Arrow 0.8.0 Release + <a href="/blog/2017/12/18/0.8.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 18 Dec 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (wesm)</a> + </div> + </div> + </div> + <!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.8.0 release. It is the +product of 10 weeks of development and includes <a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.8.0"><strong>286 resolved JIRAs</strong></a> with +many new features and bug fixes to the various language implementations. This +is the largest release since 0.3.0 earlier this year.</p> + +<p>As part of work towards a stabilizing the Arrow format and making a 1.0.0 +release sometime in 2018, we made a series of backwards-incompatible changes to +the serialized Arrow metadata that requires Arrow readers and writers (0.7.1 +and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We +expect future backwards-incompatible changes to be rare going forward.</p> + +<p>See the <a href="https://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="https://arrow.apache.org/release/0.8.0.html">complete changelog</a> is also available.</p> + +<p>We discuss some highlights from the release and other project news in this +post.</p> + +<h2 id="projects-powered-by-apache-arrow">Projects âPowered Byâ Apache Arrow</h2> + +<p>A growing ecosystem of projects are using Arrow to solve in-memory analytics +and data interchange problems. We have added a new <a href="http://arrow.apache.org/powered_by/">Powered By</a> page to the +Arrow website where we can acknowledge open source projects and companies which +are using Arrow. If you would like to add your project to the list as an Arrow +user, please let us know.</p> + +<h2 id="new-arrow-committers">New Arrow committers</h2> + +<p>Since the last release, we have added 5 new Apache committers:</p> + +<ul> + <li><a href="https://github.com/cpcloud">Phillip Cloud</a>, who has mainly contributed to C++ and Python</li> + <li><a href="https://github.com/BryanCutler">Bryan Cutler</a>, who has mainly contributed to Java and Spark integration</li> + <li><a href="https://github.com/icexelloss">Li Jin</a>, who has mainly contributed to Java and Spark integration</li> + <li><a href="https://github.com/trxcllnt">Paul Taylor</a>, who has mainly contributed to JavaScript</li> + <li><a href="https://github.com/siddharthteotia">Siddharth Teotia</a>, who has mainly contributed to Java</li> +</ul> + +<p>Welcome to the Arrow team, and thank you for your contributions!</p> + +<h2 id="improved-java-vector-api-performance-improvements">Improved Java vector API, performance improvements</h2> + +<p>Siddharth Teotia led efforts to revamp the Java vector API to make things +simpler and faster. As part of this, we removed the dichotomy between nullable +and non-nullable vectors.</p> + +<p>See <a href="https://arrow.apache.org/blog/2017/12/19/java-vector-improvements/">Siddâs blog post</a> for more about these changes.</p> + +<h2 id="decimal-support-in-c-python-consistency-with-java">Decimal support in C++, Python, consistency with Java</h2> + +<p><a href="https://github.com/cpcloud">Phillip Cloud</a> led efforts this release to harden details about exact +decimal values in the Arrow specification and ensure a consistent +implementation across Java, C++, and Python.</p> + +<p>Arrow now supports decimals represented internally as a 128-bit little-endian +integer, with a set precision and scale (as defined in many SQL-based +systems). As part of this work, we needed to change Javaâs internal +representation from big- to little-endian.</p> + +<p>We are now integration testing decimals between Java, C++, and Python, which +will facilitate Arrow adoption in Apache Spark and other systems that use both +Java and Python.</p> + +<p>Decimal data can now be read and written by the <a href="https://github.com/apache/parquet-cpp">Apache Parquet C++ +library</a>, including via pyarrow.</p> + +<p>In the future, we may implement support for smaller-precision decimals +represented by 32- or 64-bit integers.</p> + +<h2 id="c-improvements-expanded-kernels-library-and-more">C++ improvements: expanded kernels library and more</h2> + +<p>In C++, we have continued developing the new <code class="highlighter-rouge">arrow::compute</code> submodule +consisting of native computation fuctions for Arrow data. New contributor +<a href="https://github.com/licht-t">Licht Takeuchi</a> helped expand the supported types for type casting in +<code class="highlighter-rouge">compute::Cast</code>. We have also implemented new kernels <code class="highlighter-rouge">Unique</code> and +<code class="highlighter-rouge">DictionaryEncode</code> for computing the distinct elements of an array and +dictionary encoding (conversion to categorical), respectively.</p> + +<p>We expect the C++ computation âkernelâ library to be a major expansion area for +the project over the next year and beyond. Here, we can also implement SIMD- +and GPU-accelerated versions of basic in-memory analytics functionality.</p> + +<p>As minor breaking API change in C++, we have made the <code class="highlighter-rouge">RecordBatch</code> and <code class="highlighter-rouge">Table</code> +APIs âvirtualâ or abstract interfaces, to enable different implementations of a +record batch or table which conform to the standard interface. This will help +enable features like lazy IO or column loading.</p> + +<p>There was significant work improving the C++ library generally and supporting +work happening in Python and C. See the change log for full details.</p> + +<h2 id="glib-c-improvements-meson-build-gpu-support">GLib C improvements: Meson build, GPU support</h2> + +<p>Developing of the GLib-based C bindings has generally tracked work happening in +the C++ library. These bindings are being used to develop <a href="https://github.com/red-data-tools">data science tools +for Ruby users</a> and elsewhere.</p> + +<p>The C bindings now support the <a href="https://mesonbuild.com">Meson build system</a> in addition to +autotools, which enables them to be built on Windows.</p> + +<p>The Arrow GPU extension library is now also supported in the C bindings.</p> + +<h2 id="javascript-first-independent-release-on-npm">JavaScript: first independent release on NPM</h2> + +<p><a href="https://github.com/TheNeuralBit">Brian Hulette</a> and <a href="https://github.com/trxcllnt">Paul Taylor</a> have been continuing to drive efforts +on the TypeScript-based JavaScript implementation.</p> + +<p>Since the last release, we made a first JavaScript-only Apache release, version +0.2.0, which is <a href="http://npmjs.org/package/apache-arrow">now available on NPM</a>. We decided to make separate +JavaScript releases to enable the JS library to release more frequently than +the rest of the project.</p> + +<h2 id="python-improvements">Python improvements</h2> + +<p>In addition to some of the new features mentioned above, we have made a variety +of usability and performance improvements for integrations with pandas, NumPy, +Dask, and other Python projects which may make use of pyarrow, the Arrow Python +library.</p> + +<p>Some of these improvements include:</p> + +<ul> + <li><a href="http://arrow.apache.org/docs/python/ipc.html">Component-based serialization</a> for more flexible and memory-efficient +transport of large or complex Python objects</li> + <li>Substantially improved serialization performance for pandas objects when +using <code class="highlighter-rouge">pyarrow.serialize</code> and <code class="highlighter-rouge">pyarrow.deserialize</code>. This includes a special +<code class="highlighter-rouge">pyarrow.pandas_serialization_context</code> which further accelerates certain +internal details of pandas serialization * Support zero-copy reads for</li> + <li><code class="highlighter-rouge">pandas.DataFrame</code> using <code class="highlighter-rouge">pyarrow.deserialize</code> for objects without Python +objects</li> + <li>Multithreaded conversions from <code class="highlighter-rouge">pandas.DataFrame</code> to <code class="highlighter-rouge">pyarrow.Table</code> (we +already supported multithreaded conversions from Arrow back to pandas)</li> + <li>More efficient conversion from 1-dimensional NumPy arrays to Arrow format</li> + <li>New generic buffer compression and decompression APIs <code class="highlighter-rouge">pyarrow.compress</code> and +<code class="highlighter-rouge">pyarrow.decompress</code></li> + <li>Enhanced Parquet cross-compatibility with <a href="https://github.com/dask/fastparquet">fastparquet</a> and improved Dask +support</li> + <li>Python support for accessing Parquet row group column statistics</li> +</ul> + +<h2 id="upcoming-roadmap">Upcoming Roadmap</h2> + +<p>The 0.8.0 release includes some API and format changes, but upcoming releases +will focus on ompleting and stabilizing critical functionality to move the +project closer to a 1.0.0 release.</p> + +<p>With the ecosystem of projects using Arrow expanding rapidly, we will be +working to improve and expand the libraries in support of downstream use cases.</p> + +<p>We continue to look for more JavaScript, Julia, R, Rust, and other programming +language developers to join the project and expand the available +implementations and bindings to more languages.</p> + + + </div> + + + + + + <div class="container"> + <h2> + Improvements to Java Vector API in Apache Arrow 0.8.0 + <a href="/blog/2017/12/19/java-vector-improvements/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 18 Dec 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (Siddharth Teotia)</a> + </div> + </div> + </div> + <!-- + +--> + +<p>This post gives insight into the major improvements in the Java implementation +of vectors. We undertook this work over the last 10 weeks since the last Arrow +release.</p> + +<h2 id="design-goals">Design Goals</h2> + +<ol> + <li>Improved maintainability and extensibility</li> + <li>Improved heap memory usage</li> + <li>No performance overhead on hot code paths</li> +</ol> + +<h2 id="background">Background</h2> + +<h3 id="improved-maintainability-and-extensibility">Improved maintainability and extensibility</h3> + +<p>We use templates in several places for compile time Java code generation for +different vector classes, readers, writers etc. Templates are helpful as the +developers donât have to write a lot of duplicate code.</p> + +<p>However, we realized that over a period of time some specific Java +templates became extremely complex with giant if-else blocks, poor code indentation +and documentation. All this impacted the ability to easily extend these templates +for adding new functionality or improving the existing infrastructure.</p> + +<p>So we evaluated the usage of templates for compile time code generation and +decided not to use complex templates in some places by writing small amount of +duplicate code which is elegant, well documented and extensible.</p> + +<h3 id="improved-heap-usage">Improved heap usage</h3> + +<p>We did extensive memory analysis downstream in <a href="https://www.dremio.com/">Dremio</a> where Arrow is used +heavily for in-memory query execution on columnar data. The general conclusion +was that Arrowâs Java vector classes have non-negligible heap overhead and +volume of objects was too high. There were places in code where we were +creating objects unnecessarily and using structures that could be substituted +with better alternatives.</p> + +<h3 id="no-performance-overhead-on-hot-code-paths">No performance overhead on hot code paths</h3> + +<p>Java vectors used delegation and abstraction heavily throughout the object +hierarchy. The performance critical get/set methods of vectors went through a +chain of function calls back and forth between different objects before doing +meaningful work. We also evaluated the usage of branches in vector APIs and +reimplemented some of them by avoiding branches completely.</p> + +<p>We took inspiration from how the Java memory code in <code class="highlighter-rouge">ArrowBuf</code> works. For all +the performance critical methods, <code class="highlighter-rouge">ArrowBuf</code> bypasses all the netty object +hierarchy, grabs the target virtual address and directly interacts with the +memory.</p> + +<p>There were cases where branches could be avoided all together.</p> + +<p>In case of nullable vectors, we were doing multiple checks to confirm if +the value at a given position in the vector is null or not.</p> + +<h2 id="our-implementation-approach">Our implementation approach</h2> + +<ul> + <li>For scalars, the inheritance tree was simplified by writing different +abstract base classes for fixed and variable width scalars.</li> + <li>The base classes contained all the common functionality across different +types.</li> + <li>The individual subclasses implemented type specific APIs for fixed and +variable width scalar vectors.</li> + <li>For the performance critical methods, all the work is done either in +the vector class or corresponding ArrowBuf. There is no delegation to any +internal object.</li> + <li>The mutator and accessor based access to vector APIs is removed. These +objects led to unnecessary heap overhead and complicated the use of APIs.</li> + <li>Both scalar and complex vectors directly interact with underlying buffers +that manage the offsets, data and validity. Earlier we were creating different +inner vectors for each vector and delegating all the functionality to inner +vectors. This introduced a lot of bugs in memory management, excessive heap +overhead and performance penalty due to chain of delegations.</li> + <li>We reduced the number of vector classes by removing non-nullable vectors. +In the new implementation, all vectors in Java are nullable in nature.</li> +</ul> + + + </div> + + + + + + <div class="container"> + <h2> + Fast Python Serialization with Ray and Apache Arrow + <a href="/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 15 Oct 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (Philipp Moritz, Robert Nishihara)</a> + </div> + </div> + </div> + <!-- + +--> + +<p><em>This was originally posted on the <a href="https://ray-project.github.io/">Ray blog</a>. <a href="https://people.eecs.berkeley.edu/~pcmoritz/">Philipp Moritz</a> and <a href="http://www.robertnishihara.com">Robert Nishihara</a> are graduate students at UC Berkeley.</em></p> + +<p>This post elaborates on the integration between <a href="http://ray.readthedocs.io/en/latest/index.html">Ray</a> and <a href="https://arrow.apache.org/">Apache Arrow</a>. +The main problem this addresses is <a href="https://en.wikipedia.org/wiki/Serialization">data serialization</a>.</p> + +<p>From <a href="https://en.wikipedia.org/wiki/Serialization">Wikipedia</a>, <strong>serialization</strong> is</p> + +<blockquote> + <p>⦠the process of translating data structures or object state into a format +that can be stored ⦠or transmitted ⦠and reconstructed later (possibly +in a different computer environment).</p> +</blockquote> + +<p>Why is any translation necessary? Well, when you create a Python object, it may +have pointers to other Python objects, and these objects are all allocated in +different regions of memory, and all of this has to make sense when unpacked by +another process on another machine.</p> + +<p>Serialization and deserialization are <strong>bottlenecks in parallel and distributed +computing</strong>, especially in machine learning applications with large objects and +large quantities of data.</p> + +<h2 id="design-goals">Design Goals</h2> + +<p>As Ray is optimized for machine learning and AI applications, we have focused a +lot on serialization and data handling, with the following design goals:</p> + +<ol> + <li>It should be very efficient with <strong>large numerical data</strong> (this includes +NumPy arrays and Pandas DataFrames, as well as objects that recursively contain +Numpy arrays and Pandas DataFrames).</li> + <li>It should be about as fast as Pickle for <strong>general Python types</strong>.</li> + <li>It should be compatible with <strong>shared memory</strong>, allowing multiple processes +to use the same data without copying it.</li> + <li><strong>Deserialization</strong> should be extremely fast (when possible, it should not +require reading the entire serialized object).</li> + <li>It should be <strong>language independent</strong> (eventually weâd like to enable Python +workers to use objects created by workers in Java or other languages and vice +versa).</li> +</ol> + +<h2 id="our-approach-and-alternatives">Our Approach and Alternatives</h2> + +<p>The go-to serialization approach in Python is the <strong>pickle</strong> module. Pickle is +very general, especially if you use variants like <a href="https://github.com/cloudpipe/cloudpickle/">cloudpickle</a>. However, it +does not satisfy requirements 1, 3, 4, or 5. Alternatives like <strong>json</strong> satisfy +5, but not 1-4.</p> + +<p><strong>Our Approach:</strong> To satisfy requirements 1-5, we chose to use the +<a href="https://arrow.apache.org/">Apache Arrow</a> format as our underlying data representation. In collaboration +with the Apache Arrow team, we built <a href="https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization">libraries</a> for mapping general Python +objects to and from the Arrow format. Some properties of this approach:</p> + +<ul> + <li>The data layout is language independent (requirement 5).</li> + <li>Offsets into a serialized data blob can be computed in constant time without +reading the full object (requirements 1 and 4).</li> + <li>Arrow supports <strong>zero-copy reads</strong>, so objects can naturally be stored in +shared memory and used by multiple processes (requirements 1 and 3).</li> + <li>We can naturally fall back to pickle for anything we canât handle well +(requirement 2).</li> +</ul> + +<p><strong>Alternatives to Arrow:</strong> We could have built on top of +<a href="https://developers.google.com/protocol-buffers/"><strong>Protocol Buffers</strong></a>, but protocol buffers really isnât designed for +numerical data, and that approach wouldnât satisfy 1, 3, or 4. Building on top +of <a href="https://google.github.io/flatbuffers/"><strong>Flatbuffers</strong></a> actually could be made to work, but it would have +required implementing a lot of the facilities that Arrow already has and we +preferred a columnar data layout more optimized for big data.</p> + +<h2 id="speedups">Speedups</h2> + +<p>Here we show some performance improvements over Pythonâs pickle module. The +experiments were done using <code class="highlighter-rouge">pickle.HIGHEST_PROTOCOL</code>. Code for generating these +plots is included at the end of the post.</p> + +<p><strong>With NumPy arrays:</strong> In machine learning and AI applications, data (e.g., +images, neural network weights, text documents) are typically represented as +data structures containing NumPy arrays. When using NumPy arrays, the speedups +are impressive.</p> + +<p>The fact that the Ray bars for deserialization are barely visible is not a +mistake. This is a consequence of the support for zero-copy reads (the savings +largely come from the lack of memory movement).</p> + +<div align="center"> +<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups0.png" width="365" height="255" /> +<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups1.png" width="365" height="255" /> +</div> + +<p>Note that the biggest wins are with deserialization. The speedups here are +multiple orders of magnitude and get better as the NumPy arrays get larger +(thanks to design goals 1, 3, and 4). Making <strong>deserialization</strong> fast is +important for two reasons. First, an object may be serialized once and then +deserialized many times (e.g., an object that is broadcast to all workers). +Second, a common pattern is for many objects to be serialized in parallel and +then aggregated and deserialized one at a time on a single worker making +deserialization the bottleneck.</p> + +<p><strong>Without NumPy arrays:</strong> When using regular Python objects, for which we +cannot take advantage of shared memory, the results are comparable to pickle.</p> + +<div align="center"> +<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups2.png" width="365" height="255" /> +<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups3.png" width="365" height="255" /> +</div> + +<p>These are just a few examples of interesting Python objects. The most important +case is the case where NumPy arrays are nested within other objects. Note that +our serialization library works with very general Python types including custom +Python classes and deeply nested objects.</p> + +<h2 id="the-api">The API</h2> + +<p>The serialization library can be used directly through pyarrow as follows. More +documentation is available <a href="https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization">here</a>.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="s">'hello'</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])]</span> +<span class="n">serialized_x</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">to_buffer</span><span class="p">()</span> +<span class="n">deserialized_x</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">deserialize</span><span class="p">(</span><span class="n">serialized_x</span><span class="p">)</span> +</code></pre> +</div> + +<p>It can be used directly through the Ray API as follows.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="s">'hello'</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])]</span> +<span class="n">x_id</span> <span class="o">=</span> <span class="n">ray</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> +<span class="n">deserialized_x</span> <span class="o">=</span> <span class="n">ray</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">x_id</span><span class="p">)</span> +</code></pre> +</div> + +<h2 id="data-representation">Data Representation</h2> + +<p>We use Apache Arrow as the underlying language-independent data layout. Objects +are stored in two parts: a <strong>schema</strong> and a <strong>data blob</strong>. At a high level, the +data blob is roughly a flattened concatenation of all of the data values +recursively contained in the object, and the schema defines the types and +nesting structure of the data blob.</p> + +<p><strong>Technical Details:</strong> Python sequences (e.g., dictionaries, lists, tuples, +sets) are encoded as Arrow <a href="http://arrow.apache.org/docs/memory_layout.html#dense-union-type">UnionArrays</a> of other types (e.g., bools, ints, +strings, bytes, floats, doubles, date64s, tensors (i.e., NumPy arrays), lists, +tuples, dicts and sets). Nested sequences are encoded using Arrow +<a href="http://arrow.apache.org/docs/memory_layout.html#list-type">ListArrays</a>. All tensors are collected and appended to the end of the +serialized object, and the UnionArray contains references to these tensors.</p> + +<p>To give a concrete example, consider the following object.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="s">'hello'</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span class="mf">6.0</span><span class="p">])]</span> +</code></pre> +</div> + +<p>It would be represented in Arrow with the following structure.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>UnionArray(type_ids=[tuple, string, int, int, ndarray], + tuples=ListArray(offsets=[0, 2], + UnionArray(type_ids=[int, int], + ints=[1, 2])), + strings=['hello'], + ints=[3, 4], + ndarrays=[<offset of numpy array>]) +</code></pre> +</div> + +<p>Arrow uses Flatbuffers to encode serialized schemas. <strong>Using only the schema, we +can compute the offsets of each value in the data blob without scanning through +the data blob</strong> (unlike Pickle, this is what enables fast deserialization). This +means that we can avoid copying or otherwise converting large arrays and other +values during deserialization. Tensors are appended at the end of the UnionArray +and can be efficiently shared and accessed using shared memory.</p> + +<p>Note that the actual object would be laid out in memory as shown below.</p> + +<div align="center"> +<img src="/assets/fast_python_serialization_with_ray_and_arrow/python_object.png" width="600" /> +</div> +<div><i>The layout of a Python object in the heap. Each box is allocated in a +different memory region, and arrows between boxes represent pointers.</i></div> +<p><br /></p> + +<p>The Arrow serialized representation would be as follows.</p> + +<div align="center"> +<img src="/assets/fast_python_serialization_with_ray_and_arrow/arrow_object.png" width="400" /> +</div> +<div><i>The memory layout of the Arrow-serialized object.</i></div> +<p><br /></p> + +<h2 id="getting-involved">Getting Involved</h2> + +<p>We welcome contributions, especially in the following areas.</p> + +<ul> + <li>Use the C++ and Java implementations of Arrow to implement versions of this +for C++ and Java.</li> + <li>Implement support for more Python types and better test coverage.</li> +</ul> + +<h2 id="reproducing-the-figures-above">Reproducing the Figures Above</h2> + +<p>For reference, the figures can be reproduced with the following code. +Benchmarking <code class="highlighter-rouge">ray.put</code> and <code class="highlighter-rouge">ray.get</code> instead of <code class="highlighter-rouge">pyarrow.serialize</code> and +<code class="highlighter-rouge">pyarrow.deserialize</code> gives similar figures. The plots were generated at this +<a href="https://github.com/apache/arrow/tree/894f7400977693b4e0e8f4b9845fd89481f6bf29">commit</a>.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pickle</span> +<span class="kn">import</span> <span class="nn">pyarrow</span> +<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span> +<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> +<span class="kn">import</span> <span class="nn">timeit</span> + + +<span class="k">def</span> <span class="nf">benchmark_object</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">number</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span> + <span class="c"># Time serialization and deserialization for pickle.</span> + <span class="n">pickle_serialize</span> <span class="o">=</span> <span class="n">timeit</span><span class="o">.</span><span class="n">timeit</span><span class="p">(</span> + <span class="k">lambda</span><span class="p">:</span> <span class="n">pickle</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span class="n">pickle</span><span class="o">.</span><span class="n">HIGHEST_PROTOCOL</span><span class="p">),</span> + <span class="n">number</span><span class="o">=</span><span class="n">number</span><span class="p">)</span> + <span class="n">serialized_obj</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">pickle</span><span class="o">.</span><span class="n">HIGHEST_PROTOCOL</span><span class="p">)</span> + <span class="n">pickle_deserialize</span> <span class="o">=</span> <span class="n">timeit</span><span class="o">.</span><span class="n">timeit</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="n">pickle</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">serialized_obj</span><span class="p">),</span> + <span class="n">number</span><span class="o">=</span><span class="n">number</span><span class="p">)</span> + + <span class="c"># Time serialization and deserialization for Ray.</span> + <span class="n">ray_serialize</span> <span class="o">=</span> <span class="n">timeit</span><span class="o">.</span><span class="n">timeit</span><span class="p">(</span> + <span class="k">lambda</span><span class="p">:</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="n">obj</span><span class="p">)</span><span class="o">.</span><span class="n">to_buffer</span><span class="p">(),</span> <span class="n">number</span><span class="o">=</span><span class="n">number</span><span class="p">)</span> + <span class="n">serialized_obj</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="n">obj</span><span class="p">)</span><span class="o">.</span><span class="n">to_buffer</span><span class="p">()</span> + <span class="n">ray_deserialize</span> <span class="o">=</span> <span class="n">timeit</span><span class="o">.</span><span class="n">timeit</span><span class="p">(</span> + <span class="k">lambda</span><span class="p">:</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">deserialize</span><span class="p">(</span><span class="n">serialized_obj</span><span class="p">),</span> <span class="n">number</span><span class="o">=</span><span class="n">number</span><span class="p">)</span> + + <span class="k">return</span> <span class="p">[[</span><span class="n">pickle_serialize</span><span class="p">,</span> <span class="n">pickle_deserialize</span><span class="p">],</span> + <span class="p">[</span><span class="n">ray_serialize</span><span class="p">,</span> <span class="n">ray_deserialize</span><span class="p">]]</span> + + +<span class="k">def</span> <span class="nf">plot</span><span class="p">(</span><span class="n">pickle_times</span><span class="p">,</span> <span class="n">ray_times</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="n">i</span><span class="p">):</span> + <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">()</span> + <span class="n">fig</span><span class="o">.</span><span class="n">set_size_inches</span><span class="p">(</span><span class="mf">3.8</span><span class="p">,</span> <span class="mf">2.7</span><span class="p">)</span> + + <span class="n">bar_width</span> <span class="o">=</span> <span class="mf">0.35</span> + <span class="n">index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> + <span class="n">opacity</span> <span class="o">=</span> <span class="mf">0.6</span> + + <span class="n">plt</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">index</span><span class="p">,</span> <span class="n">pickle_times</span><span class="p">,</span> <span class="n">bar_width</span><span class="p">,</span> + <span class="n">alpha</span><span class="o">=</span><span class="n">opacity</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Pickle'</span><span class="p">)</span> + + <span class="n">plt</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">index</span> <span class="o">+</span> <span class="n">bar_width</span><span class="p">,</span> <span class="n">ray_times</span><span class="p">,</span> <span class="n">bar_width</span><span class="p">,</span> + <span class="n">alpha</span><span class="o">=</span><span class="n">opacity</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'c'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Ray'</span><span class="p">)</span> + + <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span> + <span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Time (seconds)'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> + <span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="s">'serialization'</span><span class="p">,</span> <span class="s">'deserialization'</span><span class="p">]</span> + <span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">index</span> <span class="o">+</span> <span class="n">bar_width</span> <span class="o">/</span> <span class="mi">2</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> + <span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">bbox_to_anchor</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span> + <span class="n">plt</span><span class="o">.</span><span class="n">tight_layout</span><span class="p">()</span> + <span class="n">plt</span><span class="o">.</span><span class="n">yticks</span><span class="p">(</span><span class="n">fontsize</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span> + <span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">'plot-'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">+</span> <span class="s">'.png'</span><span class="p">,</span> <span class="n">format</span><span class="o">=</span><span class="s">'png'</span><span class="p">)</span> + + +<span class="n">test_objects</span> <span class="o">=</span> <span class="p">[</span> + <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">50000</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)],</span> + <span class="p">{</span><span class="s">'weight-'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">):</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">50000</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)},</span> + <span class="p">{</span><span class="n">i</span><span class="p">:</span> <span class="nb">set</span><span class="p">([</span><span class="s">'string1'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">),</span> <span class="s">'string2'</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100000</span><span class="p">)},</span> + <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">200000</span><span class="p">)]</span> +<span class="p">]</span> + +<span class="n">titles</span> <span class="o">=</span> <span class="p">[</span> + <span class="s">'List of large numpy arrays'</span><span class="p">,</span> + <span class="s">'Dictionary of large numpy arrays'</span><span class="p">,</span> + <span class="s">'Large dictionary of small sets'</span><span class="p">,</span> + <span class="s">'Large list of strings'</span> +<span class="p">]</span> + +<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">test_objects</span><span class="p">)):</span> + <span class="n">plot</span><span class="p">(</span><span class="o">*</span><span class="n">benchmark_object</span><span class="p">(</span><span class="n">test_objects</span><span class="p">[</span><span class="n">i</span><span class="p">]),</span> <span class="n">titles</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">i</span><span class="p">)</span> +</code></pre> +</div> + + + </div> + + + + + + <div class="container"> + <h2> + Apache Arrow 0.7.0 Release + <a href="/blog/2017/09/19/0.7.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 19 Sep 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (wesm)</a> + </div> + </div> + </div> + <!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.7.0 release. It includes +<a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.7.0"><strong>133 resolved JIRAs</strong></a> many new features and bug fixes to the various +language implementations. The Arrow memory format remains stable since the +0.3.x release.</p> + +<p>See the <a href="http://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="http://arrow.apache.org/release/0.7.0.html">complete changelog</a> is also available.</p> + +<p>We include some highlights from the release in this post.</p> + +<h2 id="new-pmc-member-kouhei-sutou">New PMC Member: Kouhei Sutou</h2> + +<p>Since the last release we have added <a href="https://github.com/kou">Kou</a> to the Arrow Project Management +Committee. He is also a PMC for Apache Subversion, and a major contributor to +many other open source projects.</p> + +<p>As an active member of the Ruby community in Japan, Kou has been developing the +GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby +users to benefit from the work thatâs happening in Apache Arrow.</p> + +<p>We are excited to be collaborating with the Ruby community on shared +infrastructure for in-memory analytics and data science.</p> + +<h2 id="expanded-javascript-typescript-implementation">Expanded JavaScript (TypeScript) Implementation</h2> + +<p><a href="https://github.com/trxcllnt">Paul Taylor</a> from the <a href="https://github.com/netflix/falcor">Falcor</a> and <a href="http://reactivex.io">ReactiveX</a> projects has worked to +expand the JavaScript implementation (which is written in TypeScript), using +the latest in modern JavaScript build and packaging technology. We are looking +forward to building out the JS implementation and bringing it up to full +functionality with the C++ and Java implementations.</p> + +<p>We are looking for more JavaScript developers to join the project and work +together to make Arrow for JS work well with many kinds of front end use cases, +like real time data visualization.</p> + +<h2 id="type-casting-for-c-and-python">Type casting for C++ and Python</h2> + +<p>As part of longer-term efforts to build an Arrow-native in-memory analytics +library, we implemented a variety of type conversion functions. These functions +are essential in ETL tasks when conforming one table schema to another. These +are similar to the <code class="highlighter-rouge">astype</code> function in NumPy.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">In</span> <span class="p">[</span><span class="mi">17</span><span class="p">]:</span> <span class="kn">import</span> <span class="nn">pyarrow</span> <span class="kn">as</span> <span class="nn">pa</span> + +<span class="n">In</span> <span class="p">[</span><span class="mi">18</span><span class="p">]:</span> <span class="n">arr</span> <span class="o">=</span> <span class="n">pa</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="bp">True</span><span class="p">,</span> <span class="bp">False</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">True</span><span class="p">])</span> + +<span class="n">In</span> <span class="p">[</span><span class="mi">19</span><span class="p">]:</span> <span class="n">arr</span> +<span class="n">Out</span><span class="p">[</span><span class="mi">19</span><span class="p">]:</span> +<span class="o"><</span><span class="n">pyarrow</span><span class="o">.</span><span class="n">lib</span><span class="o">.</span><span class="n">BooleanArray</span> <span class="nb">object</span> <span class="n">at</span> <span class="mh">0x7ff6fb069b88</span><span class="o">></span> +<span class="p">[</span> + <span class="bp">True</span><span class="p">,</span> + <span class="bp">False</span><span class="p">,</span> + <span class="n">NA</span><span class="p">,</span> + <span class="bp">True</span> +<span class="p">]</span> + +<span class="n">In</span> <span class="p">[</span><span class="mi">20</span><span class="p">]:</span> <span class="n">arr</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">pa</span><span class="o">.</span><span class="n">int32</span><span class="p">())</span> +<span class="n">Out</span><span class="p">[</span><span class="mi">20</span><span class="p">]:</span> +<span class="o"><</span><span class="n">pyarrow</span><span class="o">.</span><span class="n">lib</span><span class="o">.</span><span class="n">Int32Array</span> <span class="nb">object</span> <span class="n">at</span> <span class="mh">0x7ff6fb0383b8</span><span class="o">></span> +<span class="p">[</span> + <span class="mi">1</span><span class="p">,</span> + <span class="mi">0</span><span class="p">,</span> + <span class="n">NA</span><span class="p">,</span> + <span class="mi">1</span> +<span class="p">]</span> +</code></pre> +</div> + +<p>Over time these will expand to support as many input-and-output type +combinations with optimized conversions.</p> + +<h2 id="new-arrow-gpu-cuda-extension-library-for-c">New Arrow GPU (CUDA) Extension Library for C++</h2> + +<p>To help with GPU-related projects using Arrow, like the <a href="http://gpuopenanalytics.com/">GPU Open Analytics +Initiative</a>, we have started a C++ add-on library to simplify Arrow memory +management on CUDA-enabled graphics cards. We would like to expand this to +include a library of reusable CUDA kernel functions for GPU analytics on Arrow +columnar memory.</p> + +<p>For example, we could write a record batch from CPU memory to GPU device memory +like so (some error checking omitted):</p> + +<div class="language-c++ highlighter-rouge"><pre class="highlight"><code><span class="cp">#include <arrow/api.h> +#include <arrow/gpu/cuda_api.h> +</span> +<span class="k">using</span> <span class="k">namespace</span> <span class="n">arrow</span><span class="p">;</span> + +<span class="n">gpu</span><span class="o">::</span><span class="n">CudaDeviceManager</span><span class="o">*</span> <span class="n">manager</span><span class="p">;</span> +<span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">gpu</span><span class="o">::</span><span class="n">CudaContext</span><span class="o">></span> <span class="n">context</span><span class="p">;</span> + +<span class="n">gpu</span><span class="o">::</span><span class="n">CudaDeviceManager</span><span class="o">::</span><span class="n">GetInstance</span><span class="p">(</span><span class="o">&</span><span class="n">manager</span><span class="p">)</span> +<span class="n">manager_</span><span class="o">-></span><span class="n">GetContext</span><span class="p">(</span><span class="n">kGpuNumber</span><span class="p">,</span> <span class="o">&</span><span class="n">context</span><span class="p">);</span> + +<span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">RecordBatch</span><span class="o">></span> <span class="n">batch</span> <span class="o">=</span> <span class="n">GetCpuData</span><span class="p">();</span> + +<span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o"><</span><span class="n">gpu</span><span class="o">::</span><span class="n">CudaBuffer</span><span class="o">></span> <span class="n">device_serialized</span><span class="p">;</span> +<span class="n">gpu</span><span class="o">::</span><span class="n">SerializeRecordBatch</span><span class="p">(</span><span class="o">*</span><span class="n">batch</span><span class="p">,</span> <span class="n">context_</span><span class="p">.</span><span class="n">get</span><span class="p">(),</span> <span class="o">&</span><span class="n">device_serialized</span><span class="p">));</span> +</code></pre> +</div> + +<p>We can then âreadâ the GPU record batch, but the returned <code class="highlighter-rouge">arrow::RecordBatch</code> +internally will contain GPU device pointers that you can use for CUDA kernel +calls:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>std::shared_ptr<RecordBatch> device_batch; +gpu::ReadRecordBatch(batch->schema(), device_serialized, + default_memory_pool(), &device_batch)); + +// Now run some CUDA kernels on device_batch +</code></pre> +</div> + +<h2 id="decimal-integration-tests">Decimal Integration Tests</h2> + +<p><a href="http://github.com/cpcloud">Phillip Cloud</a> has been working on decimal support in C++ to enable Parquet +read/write support in C++ and Python, and also end-to-end testing against the +Arrow Java libraries.</p> + +<p>In the upcoming releases, we hope to complete the remaining data types that +need end-to-end testing between Java and C++:</p> + +<ul> + <li>Fixed size lists (variable-size lists already implemented)</li> + <li>Fixes size binary</li> + <li>Unions</li> + <li>Maps</li> + <li>Time intervals</li> +</ul> + +<h2 id="other-notable-python-changes">Other Notable Python Changes</h2> + +<p>Some highlights of Python development outside of bug fixes and general API +improvements include:</p> + +<ul> + <li>Simplified <code class="highlighter-rouge">put</code> and <code class="highlighter-rouge">get</code> arbitrary Python objects in Plasma objects</li> + <li><a href="http://arrow.apache.org/docs/python/ipc.html">High-speed, memory efficient object serialization</a>. This is important +enough that we will likely write a dedicated blog post about it.</li> + <li>New <code class="highlighter-rouge">flavor='spark'</code> option to <code class="highlighter-rouge">pyarrow.parquet.write_table</code> to enable easy +writing of Parquet files maximized for Spark compatibility</li> + <li><code class="highlighter-rouge">parquet.write_to_dataset</code> function with support for partitioned writes</li> + <li>Improved support for Dask filesystems</li> + <li>Improved Python usability for IPC: read and write schemas and record batches +more easily. See the <a href="http://arrow.apache.org/docs/python/api.html">API docs</a> for more about these.</li> +</ul> + +<h2 id="the-road-ahead">The Road Ahead</h2> + +<p>Upcoming Arrow releases will continue to expand the project to cover more use +cases. In addition to completing end-to-end testing for all the major data +types, some of us will be shifting attention to building Arrow-native in-memory +analytics libraries.</p> + +<p>We are looking for more JavaScript, R, and other programming language +developers to join the project and expand the available implementations and +bindings to more languages.</p> + + + </div> + + + + + + <div class="container"> + <h2> + Apache Arrow 0.6.0 Release + <a href="/blog/2017/08/16/0.6.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 16 Aug 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (wesm)</a> + </div> + </div> + </div> + <!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.6.0 release. It includes +<a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.6.0"><strong>90 resolved JIRAs</strong></a> with the new Plasma shared memory object store, and +improvements and bug fixes to the various language implementations. The Arrow +memory format remains stable since the 0.3.x release.</p> + +<p>See the <a href="http://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="http://arrow.apache.org/release/0.6.0.html">complete changelog</a> is also available.</p> + +<h2 id="plasma-shared-memory-object-store">Plasma Shared Memory Object Store</h2> + +<p>This release includes the <a href="http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/">Plasma Store</a>, which you can read more about in +the linked blog post. This system was originally developed as part of the <a href="https://ray-project.github.io/ray/">Ray +Project</a> at the <a href="https://rise.cs.berkeley.edu/">UC Berkeley RISELab</a>. We recognized that Plasma would be +highly valuable to the Arrow community as a tool for shared memory management +and zero-copy deserialization. Additionally, we believe we will be able to +develop a stronger software stack through sharing of IO and buffer management +code.</p> + +<p>The Plasma store is a server application which runs as a separate process. A +reference C++ client, with Python bindings, is made available in this +release. Clients can be developed in Java or other languages in the future to +enable simple sharing of complex datasets through shared memory.</p> + +<h2 id="arrow-format-addition-map-type">Arrow Format Addition: Map type</h2> + +<p>We added a Map logical type to represent ordered and unordered maps +in-memory. This corresponds to the <code class="highlighter-rouge">MAP</code> logical type annotation in the Parquet +format (where maps are represented as repeated structs).</p> + +<p>Map is represented as a list of structs. It is the first example of a logical +type whose physical representation is a nested type. We have not yet created +implementations of Map containers in any of the implementations, but this can +be done in a future release.</p> + +<p>As an example, the Python data:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>data = [{'a': 1, 'bb': 2, 'cc': 3}, {'dddd': 4}] +</code></pre> +</div> + +<p>Could be represented in an Arrow <code class="highlighter-rouge">Map<String, Int32></code> as:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>Map<String, Int32> = List<Struct<keys: String, values: Int32>> + is_valid: [true, true] + offsets: [0, 3, 4] + values: Struct<keys: String, values: Int32> + children: + - keys: String + is_valid: [true, true, true, true] + offsets: [0, 1, 3, 5, 9] + data: abbccdddd + - values: Int32 + is_valid: [true, true, true, true] + data: [1, 2, 3, 4] +</code></pre> +</div> +<h2 id="python-changes">Python Changes</h2> + +<p>Some highlights of Python development outside of bug fixes and general API +improvements include:</p> + +<ul> + <li>New <code class="highlighter-rouge">strings_to_categorical=True</code> option when calling <code class="highlighter-rouge">Table.to_pandas</code> will +yield pandas <code class="highlighter-rouge">Categorical</code> types from Arrow binary and string columns</li> + <li>Expanded Hadoop Filesystem (HDFS) functionality to improve compatibility with +Dask and other HDFS-aware Python libraries.</li> + <li>s3fs and other Dask-oriented filesystems can now be used with +<code class="highlighter-rouge">pyarrow.parquet.ParquetDataset</code></li> + <li>More graceful handling of pandasâs nanosecond timestamps when writing to +Parquet format. You can now pass <code class="highlighter-rouge">coerce_timestamps='ms'</code> to cast to +milliseconds, or <code class="highlighter-rouge">'us'</code> for microseconds.</li> +</ul> + +<h2 id="toward-arrow-100-and-beyond">Toward Arrow 1.0.0 and Beyond</h2> + +<p>We are still discussing the roadmap to 1.0.0 release on the <a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">developer mailing +list</a>. The focus of the 1.0.0 release will likely be memory format stability +and hardening integration tests across the remaining data types implemented in +Java and C++. Please join the discussion there.</p> + + + </div> + + + + + + <div class="container"> + <h2> + Plasma In-Memory Object Store + <a href="/blog/2017/08/08/plasma-in-memory-object-store/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 08 Aug 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (Philipp Moritz and Robert Nishihara)</a> + </div> + </div> + </div> + <!-- + +--> + +<p><em><a href="https://people.eecs.berkeley.edu/~pcmoritz/">Philipp Moritz</a> and <a href="http://www.robertnishihara.com">Robert Nishihara</a> are graduate students at UC + Berkeley.</em></p> + +<h2 id="plasma-a-high-performance-shared-memory-object-store">Plasma: A High-Performance Shared-Memory Object Store</h2> + +<h3 id="motivating-plasma">Motivating Plasma</h3> + +<p>This blog post presents Plasma, an in-memory object store that is being +developed as part of Apache Arrow. <strong>Plasma holds immutable objects in shared +memory so that they can be accessed efficiently by many clients across process +boundaries.</strong> In light of the trend toward larger and larger multicore machines, +Plasma enables critical performance optimizations in the big data regime.</p> + +<p>Plasma was initially developed as part of <a href="https://github.com/ray-project/ray">Ray</a>, and has recently been moved +to Apache Arrow in the hopes that it will be broadly useful.</p> + +<p>One of the goals of Apache Arrow is to serve as a common data layer enabling +zero-copy data exchange between multiple frameworks. A key component of this +vision is the use of off-heap memory management (via Plasma) for storing and +sharing Arrow-serialized objects between applications.</p> + +<p><strong>Expensive serialization and deserialization as well as data copying are a +common performance bottleneck in distributed computing.</strong> For example, a +Python-based execution framework that wishes to distribute computation across +multiple Python âworkerâ processes and then aggregate the results in a single +âdriverâ process may choose to serialize data using the built-in <code class="highlighter-rouge">pickle</code> +library. Assuming one Python process per core, each worker process would have to +copy and deserialize the data, resulting in excessive memory usage. The driver +process would then have to deserialize results from each of the workers, +resulting in a bottleneck.</p> + +<p>Using Plasma plus Arrow, the data being operated on would be placed in the +Plasma store once, and all of the workers would read the data without copying or +deserializing it (the workers would map the relevant region of memory into their +own address spaces). The workers would then put the results of their computation +back into the Plasma store, which the driver could then read and aggregate +without copying or deserializing the data.</p> + +<h3 id="the-plasma-api">The Plasma API:</h3> + +<p>Below we illustrate a subset of the API. The C++ API is documented more fully +<a href="https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md">here</a>, and the Python API is documented <a href="https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst">here</a>.</p> + +<p><strong>Object IDs:</strong> Each object is associated with a string of bytes.</p> + +<p><strong>Creating an object:</strong> Objects are stored in Plasma in two stages. First, the +object store <em>creates</em> the object by allocating a buffer for it. At this point, +the client can write to the buffer and construct the object within the allocated +buffer. When the client is done, the client <em>seals</em> the buffer making the object +immutable and making it available to other Plasma clients.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Create an object.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="n">object_size</span> <span class="o">=</span> <span class="mi">1000</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">object_id</span><span class="p">,</span> <span class="n">object_size</span><span class="p">))</span> + +<span class="c"># Write to the buffer.</span> +<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span> + <span class="nb">buffer</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> + +<span class="c"># Seal the object making it immutable and available to other clients.</span> +<span class="n">client</span><span class="o">.</span><span class="n">seal</span><span class="p">(</span><span class="n">object_id</span><span class="p">)</span> +</code></pre> +</div> + +<p><strong>Getting an object:</strong> After an object has been sealed, any client who knows the +object ID can get the object.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Get the object from the store. This blocks until the object has been sealed.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="p">[</span><span class="n">buff</span><span class="p">]</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">get</span><span class="p">([</span><span class="n">object_id</span><span class="p">])</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">buff</span><span class="p">)</span> +</code></pre> +</div> + +<p>If the object has not been sealed yet, then the call to <code class="highlighter-rouge">client.get</code> will block +until the object has been sealed.</p> + +<h3 id="a-sorting-application">A sorting application</h3> + +<p>To illustrate the benefits of Plasma, we demonstrate an <strong>11x speedup</strong> (on a +machine with 20 physical cores) for sorting a large pandas DataFrame (one +billion entries). The baseline is the built-in pandas sort function, which sorts +the DataFrame in 477 seconds. To leverage multiple cores, we implement the +following standard distributed sorting scheme.</p> + +<ul> + <li>We assume that the data is partitioned across K pandas DataFrames and that +each one already lives in the Plasma store.</li> + <li>We subsample the data, sort the subsampled data, and use the result to define +L non-overlapping buckets.</li> + <li>For each of the K data partitions and each of the L buckets, we find the +subset of the data partition that falls in the bucket, and we sort that +subset.</li> + <li>For each of the L buckets, we gather all of the K sorted subsets that fall in +that bucket.</li> + <li>For each of the L buckets, we merge the corresponding K sorted subsets.</li> + <li>We turn each bucket into a pandas DataFrame and place it in the Plasma store.</li> +</ul> + +<p>Using this scheme, we can sort the DataFrame (the data starts and ends in the +Plasma store), in 44 seconds, giving an 11x speedup over the baseline.</p> + +<h3 id="design">Design</h3> + +<p>The Plasma store runs as a separate process. It is written in C++ and is +designed as a single-threaded event loop based on the <a href="https://redis.io/">Redis</a> event loop library. +The plasma client library can be linked into applications. Clients communicate +with the Plasma store via messages serialized using <a href="https://google.github.io/flatbuffers/">Google Flatbuffers</a>.</p> + +<h3 id="call-for-contributions">Call for contributions</h3> + +<p>Plasma is a work in progress, and the API is currently unstable. Today Plasma is +primarily used in <a href="https://github.com/ray-project/ray">Ray</a> as an in-memory cache for Arrow serialized objects. +We are looking for a broader set of use cases to help refine Plasmaâs API. In +addition, we are looking for contributions in a variety of areas including +improving performance and building other language bindings. Please let us know +if you are interested in getting involved with the project.</p> + + + </div> + + + + + + <div class="container"> + <h2> + Speeding up PySpark with Apache Arrow + <a href="/blog/2017/07/26/spark-arrow/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 26 Jul 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (BryanCutler)</a> + </div> + </div> + </div> + <!-- + +--> + +<p><em><a href="https://github.com/BryanCutler">Bryan Cutler</a> is a software engineer at IBMâs Spark Technology Center <a href="http://www.spark.tc/">STC</a></em></p> + +<p>Beginning with <a href="https://spark.apache.org/">Apache Spark</a> version 2.3, <a href="https://arrow.apache.org/">Apache Arrow</a> will be a supported +dependency and begin to offer increased performance with columnar data transfer. +If you are a Spark user that prefers to work in Python and Pandas, this is a cause +to be excited over! The initial work is limited to collecting a Spark DataFrame +with <code class="highlighter-rouge">toPandas()</code>, which I will discuss below, however there are many additional +improvements that are currently <a href="https://issues.apache.org/jira/issues/?filter=12335725&jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC">underway</a>.</p> + +<h1 id="optimizing-spark-conversion-to-pandas">Optimizing Spark Conversion to Pandas</h1> + +<p>The previous way of converting a Spark DataFrame to Pandas with <code class="highlighter-rouge">DataFrame.toPandas()</code> +in PySpark was painfully inefficient. Basically, it worked by first collecting all +rows to the Spark driver. Next, each row would get serialized into Pythonâs pickle +format and sent to a Python worker process. This child process unpickles each row into +a huge list of tuples. Finally, a Pandas DataFrame is created from the list using +<code class="highlighter-rouge">pandas.DataFrame.from_records()</code>.</p> + +<p>This all might seem like standard procedure, but suffers from 2 glaring issues: 1) +even using CPickle, Python serialization is a slow process and 2) creating +a <code class="highlighter-rouge">pandas.DataFrame</code> using <code class="highlighter-rouge">from_records</code> must slowly iterate over the list of pure +Python data and convert each value to Pandas format. See <a href="https://gist.github.com/wesm/0cb5531b1c2e346a0007">here</a> for a detailed +analysis.</p> + +<p>Here is where Arrow really shines to help optimize these steps: 1) Once the data is +in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can +be sent directly to the Python process, 2) When the Arrow data is received in Python, +then pyarrow can utilize zero-copy methods to create a <code class="highlighter-rouge">pandas.DataFrame</code> from entire +chunks of data at once instead of processing individual scalar values. Additionally, +the conversion to Arrow data can be done on the JVM and pushed back for the Spark +executors to perform in parallel, drastically reducing the load on the driver.</p> + +<p>As of the merging of <a href="https://issues.apache.org/jira/browse/SPARK-13534">SPARK-13534</a>, the use of Arrow when calling <code class="highlighter-rouge">toPandas()</code> +needs to be enabled by setting the SQLConf âspark.sql.execution.arrow.enabledâ to +âtrueâ. Letâs look at a simple usage example.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /__ / .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT + /_/ + +Using Python version 2.7.13 (default, Dec 20 2016 23:09:15) +SparkSession available as 'spark'. + +In [1]: from pyspark.sql.functions import rand + ...: df = spark.range(1 << 22).toDF("id").withColumn("x", rand()) + ...: df.printSchema() + ...: +root + |-- id: long (nullable = false) + |-- x: double (nullable = false) + + +In [2]: %time pdf = df.toPandas() +CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s +Wall time: 20.7 s + +In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true") + +In [4]: %time pdf = df.toPandas() +CPU times: user 40 ms, sys: 32 ms, total: 72 ms +Wall time: 737 ms + +In [5]: pdf.describe() +Out[5]: + id x +count 4.194304e+06 4.194304e+06 +mean 2.097152e+06 4.998996e-01 +std 1.210791e+06 2.887247e-01 +min 0.000000e+00 8.291929e-07 +25% 1.048576e+06 2.498116e-01 +50% 2.097152e+06 4.999210e-01 +75% 3.145727e+06 7.498380e-01 +max 4.194303e+06 9.999996e-01 +</code></pre> +</div> + +<p>This example was run locally on my laptop using Spark defaults so the times +shown should not be taken precisely. Even though, it is clear there is a huge +performance boost and using Arrow took something that was excruciatingly slow +and speeds it up to be barely noticeable.</p> + +<h1 id="notes-on-usage">Notes on Usage</h1> + +<p>Here are some things to keep in mind before making use of this new feature. At +the time of writing this, pyarrow will not be installed automatically with +pyspark and needs to be manually installed, see installation <a href="https://github.com/apache/arrow/blob/master/site/install.md">instructions</a>. +It is planned to add pyarrow as a pyspark dependency so that +<code class="highlighter-rouge">> pip install pyspark</code> will also install pyarrow.</p> + +<p>Currently, the controlling SQLConf is disabled by default. This can be enabled +programmatically as in the example above or by adding the line +âspark.sql.execution.arrow.enabled=trueâ to <code class="highlighter-rouge">SPARK_HOME/conf/spark-defaults.conf</code>.</p> + +<p>Also, not all Spark data types are currently supported and limited to primitive +types. Expanded type support is in the works and expected to also be in the Spark +2.3 release.</p> + +<h1 id="future-improvements">Future Improvements</h1> + +<p>As mentioned, this was just a first step in using Arrow to make life easier for +Spark Python users. A few exciting initiatives in the works are to allow for +vectorized UDF evaluation (<a href="https://issues.apache.org/jira/browse/SPARK-21190">SPARK-21190</a>, <a href="https://issues.apache.org/jira/browse/SPARK-21404">SPARK-21404</a>), and the ability +to apply a function on grouped data using a Pandas DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20396">SPARK-20396</a>). +Just as Arrow helped in converting a Spark to Pandas, it can also work in the +other direction when creating a Spark DataFrame from an existing Pandas +DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20791">SPARK-20791</a>). Stay tuned for more!</p> + +<h1 id="collaborators">Collaborators</h1> + +<p>Reaching this first milestone was a group effort from both the Apache Arrow and +Spark communities. Thanks to the hard work of <a href="https://github.com/wesm">Wes McKinney</a>, <a href="https://github.com/icexelloss">Li Jin</a>, +<a href="https://github.com/holdenk">Holden Karau</a>, Reynold Xin, Wenchen Fan, Shane Knapp and many others that +helped push this effort forwards.</p> + + + </div> + + + + + + <div class="container"> + <h2> + Apache Arrow 0.5.0 Release + <a href="/blog/2017/07/25/0.5.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div cla
<TRUNCATED>
