[41/51] [partial] arrow-site git commit: Update API documentation for version 0.9.0

cpcloud Thu, 22 Mar 2018 16:39:49 -0700

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/index.html
----------------------------------------------------------------------
diff --git a/build/blog/index.html b/build/blog/index.html
new file mode 100644
index 0000000..cfddf8f
--- /dev/null
+++ b/build/blog/index.html
@@ -0,0 +1,2172 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <title>Apache Arrow Homepage</title>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="generator" content="Jekyll v3.4.3">
+    <!-- The above 3 meta tags *must* come first in the head; any other head 
content must come *after* these tags -->
+    <link rel="icon" type="image/x-icon" href="/favicon.ico">
+
+    <link rel="stylesheet" 
href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.2.1.min.js";
+            integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4="
+            crossorigin="anonymous"></script>
+    <script src="/assets/javascripts/bootstrap.min.js"></script>
+    
+    <!-- Global Site Tag (gtag.js) - Google Analytics -->
+<script async 
src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1";></script>
+<script>
+  window.dataLayer = window.dataLayer || [];
+  function gtag(){dataLayer.push(arguments)};
+  gtag('js', new Date());
+
+  gtag('config', 'UA-107500873-1');
+</script>
+
+    
+  </head>
+
+
+<body class="wrap">
+  <div class="container">
+    <nav class="navbar navbar-default">
+  <div class="container-fluid">
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle" data-toggle="collapse" 
data-target="#arrow-navbar">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+      <a class="navbar-brand" href="/">Apache 
Arrow&#8482;&nbsp;&nbsp;&nbsp;</a>
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Project Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/install/">Install</a></li>
+            <li><a href="/blog/">Blog</a></li>
+            <li><a href="/release/">Releases</a></li>
+            <li><a href="https://issues.apache.org/jira/browse/ARROW";>Issue 
Tracker</a></li>
+            <li><a href="https://github.com/apache/arrow";>Source Code</a></li>
+            <li><a 
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/";>Mailing List</a></li>
+            <li><a href="https://apachearrowslackin.herokuapp.com";>Slack 
Channel</a></li>
+            <li><a href="/committers/">Committers</a></li>
+            <li><a href="/powered_by/">Powered By</a></li>
+          </ul>
+        </li>
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Specification<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/memory_layout.html">Memory Layout</a></li>
+            <li><a href="/docs/metadata.html">Metadata</a></li>
+            <li><a href="/docs/ipc.html">Messaging / IPC</a></li>
+          </ul>
+        </li>
+
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Documentation<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/python">Python</a></li>
+            <li><a href="/docs/cpp">C++ API</a></li>
+            <li><a href="/docs/java">Java API</a></li>
+            <li><a href="/docs/c_glib">C GLib API</a></li>
+            <li><a href="/docs/js">Javascript API</a></li>
+          </ul>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">ASF Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="http://www.apache.org/";>ASF Website</a></li>
+            <li><a href="http://www.apache.org/licenses/";>License</a></li>
+            <li><a 
href="http://www.apache.org/foundation/sponsorship.html";>Donate</a></li>
+            <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+            <li><a href="http://www.apache.org/security/";>Security</a></li>
+          </ul>
+        </li>
+      </ul>
+      <a href="http://www.apache.org/";>
+        <img style="float:right;" src="/img/asf_logo.svg" width="120px"/>
+      </a>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+
+    
+
+<h2>Project News and Blog</h2>
+<hr>
+
+
+  
+    
+  <div class="container">
+    <h2>
+      A Native Go Library for Apache Arrow
+      <a href="/blog/2018/03/22/go-code-donation/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            22 Mar 2018
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://github.com/pmc";><i class="fa fa-user"></i> The 
Apache Arrow PMC (pmc)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p>Since launching in early 2016, Apache Arrow has been growing fast. We have 
made
+nine major releases through the efforts of over 120 distinct contributors. The
+projectâs scope has also expanded. We began by focusing on the development of
+the standardized in-memory columnar data format, which now serves as a pillar
+of the project. Since then, we have been growing into a more general
+cross-language platform for in-memory data analysis through new additions to
+the project like the <a 
href="http://arrow.apache.org/blog/2017/08/16/0.6.0-release/";>Plasma shared 
memory object store</a>. A primary goal of
+the project is to enable data system developers to process and move data 
fast.</p>
+
+<p>So far, we officially have developed native Arrow implementations in C++, 
Java,
+and JavaScript. We have created binding layers for the C++ libraries in C
+(using the GLib libraries) and Python. We have also seen efforts to develop
+interfaces to the Arrow C++ libraries in Go, Lua, Ruby, and Rust. While binding
+layers serve many purposes, there can be benefits to native implementations,
+and so weâve been keen to see future work on native implementations in 
growing
+systems languages like Go and Rust.</p>
+
+<p>This past October, engineers <a 
href="https://github.com/stuartcarnie";>Stuart Carnie</a>, <a 
href="https://github.com/nathanielc";>Nathaniel Cook</a>, and
+<a href="https://github.com/goller";>Chris Goller</a>, employees of <a 
href="https://influxdata.com";>InfluxData</a>, began developing a native [Go
+language implementation of the <a 
href="https://github.com/influxdata/arrow";>Apache Arrow</a> in-memory columnar 
format for
+use in Go-based database systems like InfluxDB. We are excited to announce that
+InfluxData has <a 
href="https://www.businesswire.com/news/home/20180322005393/en/InfluxData-Announces-Language-Implementation-Contribution-Apache-Arrow";>donated
 this native Go implementation to the Apache Arrow
+project</a>, where it will continue to be developed. This work features
+low-level integration with the Go runtime and native support for SIMD
+instruction sets. We are looking forward to working more closely with the Go
+community on solving in-memory analytics and data interoperability 
problems.</p>
+
+<div align="center">
+<img src="/img/native_go_implementation.png" alt="Apache Arrow implementations 
and bindings" width="60%" class="img-responsive" />
+</div>
+
+<p>One of the mantras in <a href="https://www.apache.org";>The Apache Software 
Foundation</a> is âCommunity over
+Codeâ. By building an open and collaborative development community across 
many
+programming language ecosystems, we will be able to development better and
+longer-lived solutions to the systems problems faced by data developers.</p>
+
+<p>We are excited for what the future holds for the Apache Arrow project. 
Adding
+first-class support for a popular systems programming language like Go is an
+important step along the way. We welcome others from the Go community to get
+involved in the project. We also welcome others who wish to explore building
+Arrow support for other programming languages not yet represented. Learn more
+at <a href="https://arrow.apache.org";>https://arrow.apache.org</a> and join 
the mailing list
+<a 
href="https://lists.apache.org/[email protected]";>[email protected]</a>.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Apache Arrow 0.9.0 Release
+      <a href="/blog/2018/03/22/0.9.0-release/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            22 Mar 2018
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (wesm)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p>The Apache Arrow team is pleased to announce the 0.9.0 release. It is the
+product of over 3 months of development and includes <a 
href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.9.0"><strong>260
 resolved
+JIRAs</strong></a>.</p>
+
+<p>While we made some of backwards-incompatible columnar binary format changes 
in
+last Decemberâs 0.8.0 release, the 0.9.0 release is backwards-compatible with
+0.8.0. We will be working toward a 1.0.0 release this year, which will mark
+longer-term binary stability for the Arrow columnar format and metadata.</p>
+
+<p>See the <a href="https://arrow.apache.org/install";>Install Page</a> to 
learn how to get the libraries for your
+platform. The <a href="https://arrow.apache.org/release/0.8.0.html";>complete 
changelog</a> is also available.</p>
+
+<p>We discuss some highlights from the release and other project news in this
+post. This release has been overall focused more on bug fixes, compatibility,
+and stability compared with previous releases which have pushed more on new and
+expanded features.</p>
+
+<h2 id="new-arrow-committers-and-pmc-members">New Arrow committers and PMC 
members</h2>
+
+<p>Since the last release, we have added 2 new Arrow committers: <a 
href="https://github.com/theneuralbit";>Brian
+Hulette</a> and <a href="https://github.com/robertnishihara";>Robert 
Nishihara</a>. Additionally, <a href="https://github.com/cpcloud";>Phillip 
Cloud</a> and
+<a href="https://github.com/pcmoritz";>Philipp Moritz</a> have been promoted 
from committer to PMC
+member. Congratulations and thank you for your contributions!</p>
+
+<h2 id="plasma-object-store-improvements">Plasma Object Store Improvements</h2>
+
+<p>The Plasma Object Store now supports managing interprocess shared memory on
+CUDA-enabled GPUs. We are excited to see more GPU-related functionality develop
+in Apache Arrow, as this has become a key computing environment for scalable
+machine learning.</p>
+
+<h2 id="python-improvements">Python Improvements</h2>
+
+<p><a href="https://github.com/pitrou";>Antoine Pitrou</a> has joined the 
Python development efforts and helped
+significantly this release with interoperability with built-in CPython data
+structures and NumPy structured data types.</p>
+
+<ul>
+  <li>New experimental support for reading Apache ORC files</li>
+  <li><code class="highlighter-rouge">pyarrow.array</code> now accepts lists 
of tuples or Python dicts for creating
+Arrow struct type arrays.</li>
+  <li>NumPy structured dtypes (which are row/record-oriented) can be directly
+converted to Arrow struct (column-oriented) arrays</li>
+  <li>Python 3.6 <code class="highlighter-rouge">pathlib</code> objects for 
file paths are now accepted in many file
+APIs, including for Parquet files</li>
+  <li>Arrow integer arrays with nulls can now be converted to NumPy object 
arrays
+with <code class="highlighter-rouge">None</code> values</li>
+  <li>New <code class="highlighter-rouge">pyarrow.foreign_buffer</code> API 
for interacting with memory blocks located
+at particular memory addresses</li>
+</ul>
+
+<h2 id="java-improvements">Java Improvements</h2>
+
+<p>Java now fully supports the <code 
class="highlighter-rouge">FixedSizeBinary</code> data type.</p>
+
+<h2 id="javascript-improvements">JavaScript Improvements</h2>
+
+<p>The JavaScript library has been significantly refactored and expanded. We 
are
+making separate Apache releases (most recently <code 
class="highlighter-rouge">JS-0.3.1</code>) for JavaScript,
+which are being <a href="https://www.npmjs.com/package/apache-arrow";>published 
to NPM</a>.</p>
+
+<h2 id="upcoming-roadmap">Upcoming Roadmap</h2>
+
+<p>In the coming months, we will be working to move Apache Arrow closer to a 
1.0.0
+release. We will also be discussing plans to develop native Arrow-based
+computational libraries within the project.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Apache Arrow 0.8.0 Release
+      <a href="/blog/2017/12/18/0.8.0-release/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            18 Dec 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (wesm)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p>The Apache Arrow team is pleased to announce the 0.8.0 release. It is the
+product of 10 weeks of development and includes <a 
href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.8.0"><strong>286
 resolved JIRAs</strong></a> with
+many new features and bug fixes to the various language implementations. This
+is the largest release since 0.3.0 earlier this year.</p>
+
+<p>As part of work towards a stabilizing the Arrow format and making a 1.0.0
+release sometime in 2018, we made a series of backwards-incompatible changes to
+the serialized Arrow metadata that requires Arrow readers and writers (0.7.1
+and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We
+expect future backwards-incompatible changes to be rare going forward.</p>
+
+<p>See the <a href="https://arrow.apache.org/install";>Install Page</a> to 
learn how to get the libraries for your
+platform. The <a href="https://arrow.apache.org/release/0.8.0.html";>complete 
changelog</a> is also available.</p>
+
+<p>We discuss some highlights from the release and other project news in this
+post.</p>
+
+<h2 id="projects-powered-by-apache-arrow">Projects âPowered Byâ Apache 
Arrow</h2>
+
+<p>A growing ecosystem of projects are using Arrow to solve in-memory analytics
+and data interchange problems. We have added a new <a 
href="http://arrow.apache.org/powered_by/";>Powered By</a> page to the
+Arrow website where we can acknowledge open source projects and companies which
+are using Arrow. If you would like to add your project to the list as an Arrow
+user, please let us know.</p>
+
+<h2 id="new-arrow-committers">New Arrow committers</h2>
+
+<p>Since the last release, we have added 5 new Apache committers:</p>
+
+<ul>
+  <li><a href="https://github.com/cpcloud";>Phillip Cloud</a>, who has mainly 
contributed to C++ and Python</li>
+  <li><a href="https://github.com/BryanCutler";>Bryan Cutler</a>, who has 
mainly contributed to Java and Spark integration</li>
+  <li><a href="https://github.com/icexelloss";>Li Jin</a>, who has mainly 
contributed to Java and Spark integration</li>
+  <li><a href="https://github.com/trxcllnt";>Paul Taylor</a>, who has mainly 
contributed to JavaScript</li>
+  <li><a href="https://github.com/siddharthteotia";>Siddharth Teotia</a>, who 
has mainly contributed to Java</li>
+</ul>
+
+<p>Welcome to the Arrow team, and thank you for your contributions!</p>
+
+<h2 id="improved-java-vector-api-performance-improvements">Improved Java 
vector API, performance improvements</h2>
+
+<p>Siddharth Teotia led efforts to revamp the Java vector API to make things
+simpler and faster. As part of this, we removed the dichotomy between nullable
+and non-nullable vectors.</p>
+
+<p>See <a 
href="https://arrow.apache.org/blog/2017/12/19/java-vector-improvements/";>Siddâs
 blog post</a> for more about these changes.</p>
+
+<h2 id="decimal-support-in-c-python-consistency-with-java">Decimal support in 
C++, Python, consistency with Java</h2>
+
+<p><a href="https://github.com/cpcloud";>Phillip Cloud</a> led efforts this 
release to harden details about exact
+decimal values in the Arrow specification and ensure a consistent
+implementation across Java, C++, and Python.</p>
+
+<p>Arrow now supports decimals represented internally as a 128-bit 
little-endian
+integer, with a set precision and scale (as defined in many SQL-based
+systems). As part of this work, we needed to change Javaâs internal
+representation from big- to little-endian.</p>
+
+<p>We are now integration testing decimals between Java, C++, and Python, which
+will facilitate Arrow adoption in Apache Spark and other systems that use both
+Java and Python.</p>
+
+<p>Decimal data can now be read and written by the <a 
href="https://github.com/apache/parquet-cpp";>Apache Parquet C++
+library</a>, including via pyarrow.</p>
+
+<p>In the future, we may implement support for smaller-precision decimals
+represented by 32- or 64-bit integers.</p>
+
+<h2 id="c-improvements-expanded-kernels-library-and-more">C++ improvements: 
expanded kernels library and more</h2>
+
+<p>In C++, we have continued developing the new <code 
class="highlighter-rouge">arrow::compute</code> submodule
+consisting of native computation fuctions for Arrow data. New contributor
+<a href="https://github.com/licht-t";>Licht Takeuchi</a> helped expand the 
supported types for type casting in
+<code class="highlighter-rouge">compute::Cast</code>. We have also implemented 
new kernels <code class="highlighter-rouge">Unique</code> and
+<code class="highlighter-rouge">DictionaryEncode</code> for computing the 
distinct elements of an array and
+dictionary encoding (conversion to categorical), respectively.</p>
+
+<p>We expect the C++ computation âkernelâ library to be a major expansion 
area for
+the project over the next year and beyond. Here, we can also implement SIMD-
+and GPU-accelerated versions of basic in-memory analytics functionality.</p>
+
+<p>As minor breaking API change in C++, we have made the <code 
class="highlighter-rouge">RecordBatch</code> and <code 
class="highlighter-rouge">Table</code>
+APIs âvirtualâ or abstract interfaces, to enable different implementations 
of a
+record batch or table which conform to the standard interface. This will help
+enable features like lazy IO or column loading.</p>
+
+<p>There was significant work improving the C++ library generally and 
supporting
+work happening in Python and C. See the change log for full details.</p>
+
+<h2 id="glib-c-improvements-meson-build-gpu-support">GLib C improvements: 
Meson build, GPU support</h2>
+
+<p>Developing of the GLib-based C bindings has generally tracked work 
happening in
+the C++ library. These bindings are being used to develop <a 
href="https://github.com/red-data-tools";>data science tools
+for Ruby users</a> and elsewhere.</p>
+
+<p>The C bindings now support the <a href="https://mesonbuild.com";>Meson build 
system</a> in addition to
+autotools, which enables them to be built on Windows.</p>
+
+<p>The Arrow GPU extension library is now also supported in the C bindings.</p>
+
+<h2 id="javascript-first-independent-release-on-npm">JavaScript: first 
independent release on NPM</h2>
+
+<p><a href="https://github.com/TheNeuralBit";>Brian Hulette</a> and <a 
href="https://github.com/trxcllnt";>Paul Taylor</a> have been continuing to 
drive efforts
+on the TypeScript-based JavaScript implementation.</p>
+
+<p>Since the last release, we made a first JavaScript-only Apache release, 
version
+0.2.0, which is <a href="http://npmjs.org/package/apache-arrow";>now available 
on NPM</a>. We decided to make separate
+JavaScript releases to enable the JS library to release more frequently than
+the rest of the project.</p>
+
+<h2 id="python-improvements">Python improvements</h2>
+
+<p>In addition to some of the new features mentioned above, we have made a 
variety
+of usability and performance improvements for integrations with pandas, NumPy,
+Dask, and other Python projects which may make use of pyarrow, the Arrow Python
+library.</p>
+
+<p>Some of these improvements include:</p>
+
+<ul>
+  <li><a href="http://arrow.apache.org/docs/python/ipc.html";>Component-based 
serialization</a> for more flexible and memory-efficient
+transport of large or complex Python objects</li>
+  <li>Substantially improved serialization performance for pandas objects when
+using <code class="highlighter-rouge">pyarrow.serialize</code> and <code 
class="highlighter-rouge">pyarrow.deserialize</code>. This includes a special
+<code class="highlighter-rouge">pyarrow.pandas_serialization_context</code> 
which further accelerates certain
+internal details of pandas serialization * Support zero-copy reads for</li>
+  <li><code class="highlighter-rouge">pandas.DataFrame</code> using <code 
class="highlighter-rouge">pyarrow.deserialize</code> for objects without Python
+objects</li>
+  <li>Multithreaded conversions from <code 
class="highlighter-rouge">pandas.DataFrame</code> to <code 
class="highlighter-rouge">pyarrow.Table</code> (we
+already supported multithreaded conversions from Arrow back to pandas)</li>
+  <li>More efficient conversion from 1-dimensional NumPy arrays to Arrow 
format</li>
+  <li>New generic buffer compression and decompression APIs <code 
class="highlighter-rouge">pyarrow.compress</code> and
+<code class="highlighter-rouge">pyarrow.decompress</code></li>
+  <li>Enhanced Parquet cross-compatibility with <a 
href="https://github.com/dask/fastparquet";>fastparquet</a> and improved Dask
+support</li>
+  <li>Python support for accessing Parquet row group column statistics</li>
+</ul>
+
+<h2 id="upcoming-roadmap">Upcoming Roadmap</h2>
+
+<p>The 0.8.0 release includes some API and format changes, but upcoming 
releases
+will focus on ompleting and stabilizing critical functionality to move the
+project closer to a 1.0.0 release.</p>
+
+<p>With the ecosystem of projects using Arrow expanding rapidly, we will be
+working to improve and expand the libraries in support of downstream use 
cases.</p>
+
+<p>We continue to look for more JavaScript, Julia, R, Rust, and other 
programming
+language developers to join the project and expand the available
+implementations and bindings to more languages.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Improvements to Java Vector API in Apache Arrow 0.8.0
+      <a href="/blog/2017/12/19/java-vector-improvements/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            18 Dec 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (Siddharth Teotia)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p>This post gives insight into the major improvements in the Java 
implementation
+of vectors. We undertook this work over the last 10 weeks since the last Arrow
+release.</p>
+
+<h2 id="design-goals">Design Goals</h2>
+
+<ol>
+  <li>Improved maintainability and extensibility</li>
+  <li>Improved heap memory usage</li>
+  <li>No performance overhead on hot code paths</li>
+</ol>
+
+<h2 id="background">Background</h2>
+
+<h3 id="improved-maintainability-and-extensibility">Improved maintainability 
and extensibility</h3>
+
+<p>We use templates in several places for compile time Java code generation for
+different vector classes, readers, writers etc. Templates are helpful as the
+developers donât have to write a lot of duplicate code.</p>
+
+<p>However, we realized that over a period of time some specific Java
+templates became extremely complex with giant if-else blocks, poor code 
indentation
+and documentation. All this impacted the ability to easily extend these 
templates
+for adding new functionality or improving the existing infrastructure.</p>
+
+<p>So we evaluated the usage of templates for compile time code generation and
+decided not to use complex templates in some places by writing small amount of
+duplicate code which is elegant, well documented and extensible.</p>
+
+<h3 id="improved-heap-usage">Improved heap usage</h3>
+
+<p>We did extensive memory analysis downstream in <a 
href="https://www.dremio.com/";>Dremio</a> where Arrow is used
+heavily for in-memory query execution on columnar data. The general conclusion
+was that Arrowâs Java vector classes have non-negligible heap overhead and
+volume of objects was too high. There were places in code where we were
+creating objects unnecessarily and using structures that could be substituted
+with better alternatives.</p>
+
+<h3 id="no-performance-overhead-on-hot-code-paths">No performance overhead on 
hot code paths</h3>
+
+<p>Java vectors used delegation and abstraction heavily throughout the object
+hierarchy. The performance critical get/set methods of vectors went through a
+chain of function calls back and forth between different objects before doing
+meaningful work. We also evaluated the usage of branches in vector APIs and
+reimplemented some of them by avoiding branches completely.</p>
+
+<p>We took inspiration from how the Java memory code in <code 
class="highlighter-rouge">ArrowBuf</code> works. For all
+the performance critical methods, <code 
class="highlighter-rouge">ArrowBuf</code> bypasses all the netty object
+hierarchy, grabs the target virtual address and directly interacts with the
+memory.</p>
+
+<p>There were cases where branches could be avoided all together.</p>
+
+<p>In case of nullable vectors, we were doing multiple checks to confirm if
+the value at a given position in the vector is null or not.</p>
+
+<h2 id="our-implementation-approach">Our implementation approach</h2>
+
+<ul>
+  <li>For scalars, the inheritance tree was simplified by writing different
+abstract base classes for fixed and variable width scalars.</li>
+  <li>The base classes contained all the common functionality across different
+types.</li>
+  <li>The individual subclasses implemented type specific APIs for fixed and
+variable width scalar vectors.</li>
+  <li>For the performance critical methods, all the work is done either in
+the vector class or corresponding ArrowBuf. There is no delegation to any
+internal object.</li>
+  <li>The mutator and accessor based access to vector APIs is removed. These
+objects led to unnecessary heap overhead and complicated the use of APIs.</li>
+  <li>Both scalar and complex vectors directly interact with underlying buffers
+that manage the offsets, data and validity. Earlier we were creating different
+inner vectors for each vector and delegating all the functionality to inner
+vectors. This introduced a lot of bugs in memory management, excessive heap
+overhead and performance penalty due to chain of delegations.</li>
+  <li>We reduced the number of vector classes by removing non-nullable vectors.
+In the new implementation, all vectors in Java are nullable in nature.</li>
+</ul>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Fast Python Serialization with Ray and Apache Arrow
+      <a href="/blog/2017/10/15/fast-python-serialization-with-ray-and-arrow/" 
class="permalink" title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            15 Oct 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (Philipp Moritz, Robert Nishihara)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p><em>This was originally posted on the <a 
href="https://ray-project.github.io/";>Ray blog</a>. <a 
href="https://people.eecs.berkeley.edu/~pcmoritz/";>Philipp Moritz</a> and <a 
href="http://www.robertnishihara.com";>Robert Nishihara</a> are graduate 
students at UC Berkeley.</em></p>
+
+<p>This post elaborates on the integration between <a 
href="http://ray.readthedocs.io/en/latest/index.html";>Ray</a> and <a 
href="https://arrow.apache.org/";>Apache Arrow</a>.
+The main problem this addresses is <a 
href="https://en.wikipedia.org/wiki/Serialization";>data serialization</a>.</p>
+
+<p>From <a href="https://en.wikipedia.org/wiki/Serialization";>Wikipedia</a>, 
<strong>serialization</strong> is</p>
+
+<blockquote>
+  <p>â¦ the process of translating data structures or object state into a 
format
+that can be stored â¦ or transmitted â¦ and reconstructed later (possibly
+in a different computer environment).</p>
+</blockquote>
+
+<p>Why is any translation necessary? Well, when you create a Python object, it 
may
+have pointers to other Python objects, and these objects are all allocated in
+different regions of memory, and all of this has to make sense when unpacked by
+another process on another machine.</p>
+
+<p>Serialization and deserialization are <strong>bottlenecks in parallel and 
distributed
+computing</strong>, especially in machine learning applications with large 
objects and
+large quantities of data.</p>
+
+<h2 id="design-goals">Design Goals</h2>
+
+<p>As Ray is optimized for machine learning and AI applications, we have 
focused a
+lot on serialization and data handling, with the following design goals:</p>
+
+<ol>
+  <li>It should be very efficient with <strong>large numerical data</strong> 
(this includes
+NumPy arrays and Pandas DataFrames, as well as objects that recursively contain
+Numpy arrays and Pandas DataFrames).</li>
+  <li>It should be about as fast as Pickle for <strong>general Python 
types</strong>.</li>
+  <li>It should be compatible with <strong>shared memory</strong>, allowing 
multiple processes
+to use the same data without copying it.</li>
+  <li><strong>Deserialization</strong> should be extremely fast (when 
possible, it should not
+require reading the entire serialized object).</li>
+  <li>It should be <strong>language independent</strong> (eventually weâd 
like to enable Python
+workers to use objects created by workers in Java or other languages and vice
+versa).</li>
+</ol>
+
+<h2 id="our-approach-and-alternatives">Our Approach and Alternatives</h2>
+
+<p>The go-to serialization approach in Python is the <strong>pickle</strong> 
module. Pickle is
+very general, especially if you use variants like <a 
href="https://github.com/cloudpipe/cloudpickle/";>cloudpickle</a>. However, it
+does not satisfy requirements 1, 3, 4, or 5. Alternatives like 
<strong>json</strong> satisfy
+5, but not 1-4.</p>
+
+<p><strong>Our Approach:</strong> To satisfy requirements 1-5, we chose to use 
the
+<a href="https://arrow.apache.org/";>Apache Arrow</a> format as our underlying 
data representation. In collaboration
+with the Apache Arrow team, we built <a 
href="https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization";>libraries</a>
 for mapping general Python
+objects to and from the Arrow format. Some properties of this approach:</p>
+
+<ul>
+  <li>The data layout is language independent (requirement 5).</li>
+  <li>Offsets into a serialized data blob can be computed in constant time 
without
+reading the full object (requirements 1 and 4).</li>
+  <li>Arrow supports <strong>zero-copy reads</strong>, so objects can 
naturally be stored in
+shared memory and used by multiple processes (requirements 1 and 3).</li>
+  <li>We can naturally fall back to pickle for anything we canât handle well
+(requirement 2).</li>
+</ul>
+
+<p><strong>Alternatives to Arrow:</strong> We could have built on top of
+<a href="https://developers.google.com/protocol-buffers/";><strong>Protocol 
Buffers</strong></a>, but protocol buffers really isnât designed for
+numerical data, and that approach wouldnât satisfy 1, 3, or 4. Building on 
top
+of <a 
href="https://google.github.io/flatbuffers/";><strong>Flatbuffers</strong></a> 
actually could be made to work, but it would have
+required implementing a lot of the facilities that Arrow already has and we
+preferred a columnar data layout more optimized for big data.</p>
+
+<h2 id="speedups">Speedups</h2>
+
+<p>Here we show some performance improvements over Pythonâs pickle module. 
The
+experiments were done using <code 
class="highlighter-rouge">pickle.HIGHEST_PROTOCOL</code>. Code for generating 
these
+plots is included at the end of the post.</p>
+
+<p><strong>With NumPy arrays:</strong> In machine learning and AI 
applications, data (e.g.,
+images, neural network weights, text documents) are typically represented as
+data structures containing NumPy arrays. When using NumPy arrays, the speedups
+are impressive.</p>
+
+<p>The fact that the Ray bars for deserialization are barely visible is not a
+mistake. This is a consequence of the support for zero-copy reads (the savings
+largely come from the lack of memory movement).</p>
+
+<div align="center">
+<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups0.png" 
width="365" height="255" />
+<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups1.png" 
width="365" height="255" />
+</div>
+
+<p>Note that the biggest wins are with deserialization. The speedups here are
+multiple orders of magnitude and get better as the NumPy arrays get larger
+(thanks to design goals 1, 3, and 4). Making <strong>deserialization</strong> 
fast is
+important for two reasons. First, an object may be serialized once and then
+deserialized many times (e.g., an object that is broadcast to all workers).
+Second, a common pattern is for many objects to be serialized in parallel and
+then aggregated and deserialized one at a time on a single worker making
+deserialization the bottleneck.</p>
+
+<p><strong>Without NumPy arrays:</strong> When using regular Python objects, 
for which we
+cannot take advantage of shared memory, the results are comparable to 
pickle.</p>
+
+<div align="center">
+<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups2.png" 
width="365" height="255" />
+<img src="/assets/fast_python_serialization_with_ray_and_arrow/speedups3.png" 
width="365" height="255" />
+</div>
+
+<p>These are just a few examples of interesting Python objects. The most 
important
+case is the case where NumPy arrays are nested within other objects. Note that
+our serialization library works with very general Python types including custom
+Python classes and deeply nested objects.</p>
+
+<h2 id="the-api">The API</h2>
+
+<p>The serialization library can be used directly through pyarrow as follows. 
More
+documentation is available <a 
href="https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization";>here</a>.</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span 
class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span 
class="mi">2</span><span class="p">),</span> <span 
class="s">'hello'</span><span class="p">,</span> <span class="mi">3</span><span 
class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span 
class="n">np</span><span class="o">.</span><span class="n">array</span><span 
class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span 
class="mf">6.0</span><span class="p">])]</span>
+<span class="n">serialized_x</span> <span class="o">=</span> <span 
class="n">pyarrow</span><span class="o">.</span><span 
class="n">serialize</span><span class="p">(</span><span class="n">x</span><span 
class="p">)</span><span class="o">.</span><span class="n">to_buffer</span><span 
class="p">()</span>
+<span class="n">deserialized_x</span> <span class="o">=</span> <span 
class="n">pyarrow</span><span class="o">.</span><span 
class="n">deserialize</span><span class="p">(</span><span 
class="n">serialized_x</span><span class="p">)</span>
+</code></pre>
+</div>
+
+<p>It can be used directly through the Ray API as follows.</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span 
class="p">[(</span><span class="mi">1</span><span class="p">,</span> <span 
class="mi">2</span><span class="p">),</span> <span 
class="s">'hello'</span><span class="p">,</span> <span class="mi">3</span><span 
class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span 
class="n">np</span><span class="o">.</span><span class="n">array</span><span 
class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span 
class="mf">6.0</span><span class="p">])]</span>
+<span class="n">x_id</span> <span class="o">=</span> <span 
class="n">ray</span><span class="o">.</span><span class="n">put</span><span 
class="p">(</span><span class="n">x</span><span class="p">)</span>
+<span class="n">deserialized_x</span> <span class="o">=</span> <span 
class="n">ray</span><span class="o">.</span><span class="n">get</span><span 
class="p">(</span><span class="n">x_id</span><span class="p">)</span>
+</code></pre>
+</div>
+
+<h2 id="data-representation">Data Representation</h2>
+
+<p>We use Apache Arrow as the underlying language-independent data layout. 
Objects
+are stored in two parts: a <strong>schema</strong> and a <strong>data 
blob</strong>. At a high level, the
+data blob is roughly a flattened concatenation of all of the data values
+recursively contained in the object, and the schema defines the types and
+nesting structure of the data blob.</p>
+
+<p><strong>Technical Details:</strong> Python sequences (e.g., dictionaries, 
lists, tuples,
+sets) are encoded as Arrow <a 
href="http://arrow.apache.org/docs/memory_layout.html#dense-union-type";>UnionArrays</a>
 of other types (e.g., bools, ints,
+strings, bytes, floats, doubles, date64s, tensors (i.e., NumPy arrays), lists,
+tuples, dicts and sets). Nested sequences are encoded using Arrow
+<a 
href="http://arrow.apache.org/docs/memory_layout.html#list-type";>ListArrays</a>.
 All tensors are collected and appended to the end of the
+serialized object, and the UnionArray contains references to these tensors.</p>
+
+<p>To give a concrete example, consider the following object.</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="p">[(</span><span class="mi">1</span><span 
class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span 
class="s">'hello'</span><span class="p">,</span> <span class="mi">3</span><span 
class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span 
class="n">np</span><span class="o">.</span><span class="n">array</span><span 
class="p">([</span><span class="mf">5.0</span><span class="p">,</span> <span 
class="mf">6.0</span><span class="p">])]</span>
+</code></pre>
+</div>
+
+<p>It would be represented in Arrow with the following structure.</p>
+
+<div class="highlighter-rouge"><pre 
class="highlight"><code>UnionArray(type_ids=[tuple, string, int, int, ndarray],
+           tuples=ListArray(offsets=[0, 2],
+                            UnionArray(type_ids=[int, int],
+                                       ints=[1, 2])),
+           strings=['hello'],
+           ints=[3, 4],
+           ndarrays=[&lt;offset of numpy array&gt;])
+</code></pre>
+</div>
+
+<p>Arrow uses Flatbuffers to encode serialized schemas. <strong>Using only the 
schema, we
+can compute the offsets of each value in the data blob without scanning through
+the data blob</strong> (unlike Pickle, this is what enables fast 
deserialization). This
+means that we can avoid copying or otherwise converting large arrays and other
+values during deserialization. Tensors are appended at the end of the 
UnionArray
+and can be efficiently shared and accessed using shared memory.</p>
+
+<p>Note that the actual object would be laid out in memory as shown below.</p>
+
+<div align="center">
+<img 
src="/assets/fast_python_serialization_with_ray_and_arrow/python_object.png" 
width="600" />
+</div>
+<div><i>The layout of a Python object in the heap. Each box is allocated in a
+different memory region, and arrows between boxes represent pointers.</i></div>
+<p><br /></p>
+
+<p>The Arrow serialized representation would be as follows.</p>
+
+<div align="center">
+<img 
src="/assets/fast_python_serialization_with_ray_and_arrow/arrow_object.png" 
width="400" />
+</div>
+<div><i>The memory layout of the Arrow-serialized object.</i></div>
+<p><br /></p>
+
+<h2 id="getting-involved">Getting Involved</h2>
+
+<p>We welcome contributions, especially in the following areas.</p>
+
+<ul>
+  <li>Use the C++ and Java implementations of Arrow to implement versions of 
this
+for C++ and Java.</li>
+  <li>Implement support for more Python types and better test coverage.</li>
+</ul>
+
+<h2 id="reproducing-the-figures-above">Reproducing the Figures Above</h2>
+
+<p>For reference, the figures can be reproduced with the following code.
+Benchmarking <code class="highlighter-rouge">ray.put</code> and <code 
class="highlighter-rouge">ray.get</code> instead of <code 
class="highlighter-rouge">pyarrow.serialize</code> and
+<code class="highlighter-rouge">pyarrow.deserialize</code> gives similar 
figures. The plots were generated at this
+<a 
href="https://github.com/apache/arrow/tree/894f7400977693b4e0e8f4b9845fd89481f6bf29";>commit</a>.</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="kn">import</span> <span 
class="nn">pickle</span>
+<span class="kn">import</span> <span class="nn">pyarrow</span>
+<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span 
class="kn">as</span> <span class="nn">plt</span>
+<span class="kn">import</span> <span class="nn">numpy</span> <span 
class="kn">as</span> <span class="nn">np</span>
+<span class="kn">import</span> <span class="nn">timeit</span>
+
+
+<span class="k">def</span> <span class="nf">benchmark_object</span><span 
class="p">(</span><span class="n">obj</span><span class="p">,</span> <span 
class="n">number</span><span class="o">=</span><span class="mi">10</span><span 
class="p">):</span>
+    <span class="c"># Time serialization and deserialization for pickle.</span>
+    <span class="n">pickle_serialize</span> <span class="o">=</span> <span 
class="n">timeit</span><span class="o">.</span><span 
class="n">timeit</span><span class="p">(</span>
+        <span class="k">lambda</span><span class="p">:</span> <span 
class="n">pickle</span><span class="o">.</span><span 
class="n">dumps</span><span class="p">(</span><span class="n">obj</span><span 
class="p">,</span> <span class="n">protocol</span><span class="o">=</span><span 
class="n">pickle</span><span class="o">.</span><span 
class="n">HIGHEST_PROTOCOL</span><span class="p">),</span>
+        <span class="n">number</span><span class="o">=</span><span 
class="n">number</span><span class="p">)</span>
+    <span class="n">serialized_obj</span> <span class="o">=</span> <span 
class="n">pickle</span><span class="o">.</span><span 
class="n">dumps</span><span class="p">(</span><span class="n">obj</span><span 
class="p">,</span> <span class="n">pickle</span><span class="o">.</span><span 
class="n">HIGHEST_PROTOCOL</span><span class="p">)</span>
+    <span class="n">pickle_deserialize</span> <span class="o">=</span> <span 
class="n">timeit</span><span class="o">.</span><span 
class="n">timeit</span><span class="p">(</span><span 
class="k">lambda</span><span class="p">:</span> <span 
class="n">pickle</span><span class="o">.</span><span 
class="n">loads</span><span class="p">(</span><span 
class="n">serialized_obj</span><span class="p">),</span>
+                                       <span class="n">number</span><span 
class="o">=</span><span class="n">number</span><span class="p">)</span>
+
+    <span class="c"># Time serialization and deserialization for Ray.</span>
+    <span class="n">ray_serialize</span> <span class="o">=</span> <span 
class="n">timeit</span><span class="o">.</span><span 
class="n">timeit</span><span class="p">(</span>
+        <span class="k">lambda</span><span class="p">:</span> <span 
class="n">pyarrow</span><span class="o">.</span><span 
class="n">serialize</span><span class="p">(</span><span 
class="n">obj</span><span class="p">)</span><span class="o">.</span><span 
class="n">to_buffer</span><span class="p">(),</span> <span 
class="n">number</span><span class="o">=</span><span 
class="n">number</span><span class="p">)</span>
+    <span class="n">serialized_obj</span> <span class="o">=</span> <span 
class="n">pyarrow</span><span class="o">.</span><span 
class="n">serialize</span><span class="p">(</span><span 
class="n">obj</span><span class="p">)</span><span class="o">.</span><span 
class="n">to_buffer</span><span class="p">()</span>
+    <span class="n">ray_deserialize</span> <span class="o">=</span> <span 
class="n">timeit</span><span class="o">.</span><span 
class="n">timeit</span><span class="p">(</span>
+        <span class="k">lambda</span><span class="p">:</span> <span 
class="n">pyarrow</span><span class="o">.</span><span 
class="n">deserialize</span><span class="p">(</span><span 
class="n">serialized_obj</span><span class="p">),</span> <span 
class="n">number</span><span class="o">=</span><span 
class="n">number</span><span class="p">)</span>
+
+    <span class="k">return</span> <span class="p">[[</span><span 
class="n">pickle_serialize</span><span class="p">,</span> <span 
class="n">pickle_deserialize</span><span class="p">],</span>
+            <span class="p">[</span><span class="n">ray_serialize</span><span 
class="p">,</span> <span class="n">ray_deserialize</span><span 
class="p">]]</span>
+
+
+<span class="k">def</span> <span class="nf">plot</span><span 
class="p">(</span><span class="n">pickle_times</span><span class="p">,</span> 
<span class="n">ray_times</span><span class="p">,</span> <span 
class="n">title</span><span class="p">,</span> <span class="n">i</span><span 
class="p">):</span>
+    <span class="n">fig</span><span class="p">,</span> <span 
class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span 
class="o">.</span><span class="n">subplots</span><span class="p">()</span>
+    <span class="n">fig</span><span class="o">.</span><span 
class="n">set_size_inches</span><span class="p">(</span><span 
class="mf">3.8</span><span class="p">,</span> <span class="mf">2.7</span><span 
class="p">)</span>
+
+    <span class="n">bar_width</span> <span class="o">=</span> <span 
class="mf">0.35</span>
+    <span class="n">index</span> <span class="o">=</span> <span 
class="n">np</span><span class="o">.</span><span class="n">arange</span><span 
class="p">(</span><span class="mi">2</span><span class="p">)</span>
+    <span class="n">opacity</span> <span class="o">=</span> <span 
class="mf">0.6</span>
+
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">bar</span><span class="p">(</span><span class="n">index</span><span 
class="p">,</span> <span class="n">pickle_times</span><span class="p">,</span> 
<span class="n">bar_width</span><span class="p">,</span>
+            <span class="n">alpha</span><span class="o">=</span><span 
class="n">opacity</span><span class="p">,</span> <span 
class="n">color</span><span class="o">=</span><span class="s">'r'</span><span 
class="p">,</span> <span class="n">label</span><span class="o">=</span><span 
class="s">'Pickle'</span><span class="p">)</span>
+
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">bar</span><span class="p">(</span><span class="n">index</span> <span 
class="o">+</span> <span class="n">bar_width</span><span class="p">,</span> 
<span class="n">ray_times</span><span class="p">,</span> <span 
class="n">bar_width</span><span class="p">,</span>
+            <span class="n">alpha</span><span class="o">=</span><span 
class="n">opacity</span><span class="p">,</span> <span 
class="n">color</span><span class="o">=</span><span class="s">'c'</span><span 
class="p">,</span> <span class="n">label</span><span class="o">=</span><span 
class="s">'Ray'</span><span class="p">)</span>
+
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">title</span><span class="p">(</span><span class="n">title</span><span 
class="p">,</span> <span class="n">fontweight</span><span 
class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">ylabel</span><span class="p">(</span><span class="s">'Time 
(seconds)'</span><span class="p">,</span> <span class="n">fontsize</span><span 
class="o">=</span><span class="mi">10</span><span class="p">)</span>
+    <span class="n">labels</span> <span class="o">=</span> <span 
class="p">[</span><span class="s">'serialization'</span><span 
class="p">,</span> <span class="s">'deserialization'</span><span 
class="p">]</span>
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">xticks</span><span class="p">(</span><span class="n">index</span> 
<span class="o">+</span> <span class="n">bar_width</span> <span 
class="o">/</span> <span class="mi">2</span><span class="p">,</span> <span 
class="n">labels</span><span class="p">,</span> <span 
class="n">fontsize</span><span class="o">=</span><span 
class="mi">10</span><span class="p">)</span>
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">legend</span><span class="p">(</span><span 
class="n">fontsize</span><span class="o">=</span><span 
class="mi">10</span><span class="p">,</span> <span 
class="n">bbox_to_anchor</span><span class="o">=</span><span 
class="p">(</span><span class="mi">1</span><span class="p">,</span> <span 
class="mi">1</span><span class="p">))</span>
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">tight_layout</span><span class="p">()</span>
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">yticks</span><span class="p">(</span><span 
class="n">fontsize</span><span class="o">=</span><span 
class="mi">10</span><span class="p">)</span>
+    <span class="n">plt</span><span class="o">.</span><span 
class="n">savefig</span><span class="p">(</span><span class="s">'plot-'</span> 
<span class="o">+</span> <span class="nb">str</span><span 
class="p">(</span><span class="n">i</span><span class="p">)</span> <span 
class="o">+</span> <span class="s">'.png'</span><span class="p">,</span> <span 
class="n">format</span><span class="o">=</span><span 
class="s">'png'</span><span class="p">)</span>
+
+
+<span class="n">test_objects</span> <span class="o">=</span> <span 
class="p">[</span>
+    <span class="p">[</span><span class="n">np</span><span 
class="o">.</span><span class="n">random</span><span class="o">.</span><span 
class="n">randn</span><span class="p">(</span><span 
class="mi">50000</span><span class="p">)</span> <span class="k">for</span> 
<span class="n">i</span> <span class="ow">in</span> <span 
class="nb">range</span><span class="p">(</span><span class="mi">100</span><span 
class="p">)],</span>
+    <span class="p">{</span><span class="s">'weight-'</span> <span 
class="o">+</span> <span class="nb">str</span><span class="p">(</span><span 
class="n">i</span><span class="p">):</span> <span class="n">np</span><span 
class="o">.</span><span class="n">random</span><span class="o">.</span><span 
class="n">randn</span><span class="p">(</span><span 
class="mi">50000</span><span class="p">)</span> <span class="k">for</span> 
<span class="n">i</span> <span class="ow">in</span> <span 
class="nb">range</span><span class="p">(</span><span class="mi">100</span><span 
class="p">)},</span>
+    <span class="p">{</span><span class="n">i</span><span class="p">:</span> 
<span class="nb">set</span><span class="p">([</span><span 
class="s">'string1'</span> <span class="o">+</span> <span 
class="nb">str</span><span class="p">(</span><span class="n">i</span><span 
class="p">),</span> <span class="s">'string2'</span> <span class="o">+</span> 
<span class="nb">str</span><span class="p">(</span><span 
class="n">i</span><span class="p">)])</span> <span class="k">for</span> <span 
class="n">i</span> <span class="ow">in</span> <span 
class="nb">range</span><span class="p">(</span><span 
class="mi">100000</span><span class="p">)},</span>
+    <span class="p">[</span><span class="nb">str</span><span 
class="p">(</span><span class="n">i</span><span class="p">)</span> <span 
class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span 
class="nb">range</span><span class="p">(</span><span 
class="mi">200000</span><span class="p">)]</span>
+<span class="p">]</span>
+
+<span class="n">titles</span> <span class="o">=</span> <span class="p">[</span>
+    <span class="s">'List of large numpy arrays'</span><span class="p">,</span>
+    <span class="s">'Dictionary of large numpy arrays'</span><span 
class="p">,</span>
+    <span class="s">'Large dictionary of small sets'</span><span 
class="p">,</span>
+    <span class="s">'Large list of strings'</span>
+<span class="p">]</span>
+
+<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> 
<span class="nb">range</span><span class="p">(</span><span 
class="nb">len</span><span class="p">(</span><span 
class="n">test_objects</span><span class="p">)):</span>
+    <span class="n">plot</span><span class="p">(</span><span 
class="o">*</span><span class="n">benchmark_object</span><span 
class="p">(</span><span class="n">test_objects</span><span 
class="p">[</span><span class="n">i</span><span class="p">]),</span> <span 
class="n">titles</span><span class="p">[</span><span class="n">i</span><span 
class="p">],</span> <span class="n">i</span><span class="p">)</span>
+</code></pre>
+</div>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Apache Arrow 0.7.0 Release
+      <a href="/blog/2017/09/19/0.7.0-release/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            19 Sep 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (wesm)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p>The Apache Arrow team is pleased to announce the 0.7.0 release. It includes
+<a 
href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.7.0"><strong>133
 resolved JIRAs</strong></a> many new features and bug fixes to the various
+language implementations. The Arrow memory format remains stable since the
+0.3.x release.</p>
+
+<p>See the <a href="http://arrow.apache.org/install";>Install Page</a> to learn 
how to get the libraries for your
+platform. The <a href="http://arrow.apache.org/release/0.7.0.html";>complete 
changelog</a> is also available.</p>
+
+<p>We include some highlights from the release in this post.</p>
+
+<h2 id="new-pmc-member-kouhei-sutou">New PMC Member: Kouhei Sutou</h2>
+
+<p>Since the last release we have added <a 
href="https://github.com/kou";>Kou</a> to the Arrow Project Management
+Committee. He is also a PMC for Apache Subversion, and a major contributor to
+many other open source projects.</p>
+
+<p>As an active member of the Ruby community in Japan, Kou has been developing 
the
+GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby
+users to benefit from the work thatâs happening in Apache Arrow.</p>
+
+<p>We are excited to be collaborating with the Ruby community on shared
+infrastructure for in-memory analytics and data science.</p>
+
+<h2 id="expanded-javascript-typescript-implementation">Expanded JavaScript 
(TypeScript) Implementation</h2>
+
+<p><a href="https://github.com/trxcllnt";>Paul Taylor</a> from the <a 
href="https://github.com/netflix/falcor";>Falcor</a> and <a 
href="http://reactivex.io";>ReactiveX</a> projects has worked to
+expand the JavaScript implementation (which is written in TypeScript), using
+the latest in modern JavaScript build and packaging technology. We are looking
+forward to building out the JS implementation and bringing it up to full
+functionality with the C++ and Java implementations.</p>
+
+<p>We are looking for more JavaScript developers to join the project and work
+together to make Arrow for JS work well with many kinds of front end use cases,
+like real time data visualization.</p>
+
+<h2 id="type-casting-for-c-and-python">Type casting for C++ and Python</h2>
+
+<p>As part of longer-term efforts to build an Arrow-native in-memory analytics
+library, we implemented a variety of type conversion functions. These functions
+are essential in ETL tasks when conforming one table schema to another. These
+are similar to the <code class="highlighter-rouge">astype</code> function in 
NumPy.</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="n">In</span> <span class="p">[</span><span 
class="mi">17</span><span class="p">]:</span> <span class="kn">import</span> 
<span class="nn">pyarrow</span> <span class="kn">as</span> <span 
class="nn">pa</span>
+
+<span class="n">In</span> <span class="p">[</span><span 
class="mi">18</span><span class="p">]:</span> <span class="n">arr</span> <span 
class="o">=</span> <span class="n">pa</span><span class="o">.</span><span 
class="n">array</span><span class="p">([</span><span 
class="bp">True</span><span class="p">,</span> <span 
class="bp">False</span><span class="p">,</span> <span 
class="bp">None</span><span class="p">,</span> <span 
class="bp">True</span><span class="p">])</span>
+
+<span class="n">In</span> <span class="p">[</span><span 
class="mi">19</span><span class="p">]:</span> <span class="n">arr</span>
+<span class="n">Out</span><span class="p">[</span><span 
class="mi">19</span><span class="p">]:</span>
+<span class="o">&lt;</span><span class="n">pyarrow</span><span 
class="o">.</span><span class="n">lib</span><span class="o">.</span><span 
class="n">BooleanArray</span> <span class="nb">object</span> <span 
class="n">at</span> <span class="mh">0x7ff6fb069b88</span><span 
class="o">&gt;</span>
+<span class="p">[</span>
+  <span class="bp">True</span><span class="p">,</span>
+  <span class="bp">False</span><span class="p">,</span>
+  <span class="n">NA</span><span class="p">,</span>
+  <span class="bp">True</span>
+<span class="p">]</span>
+
+<span class="n">In</span> <span class="p">[</span><span 
class="mi">20</span><span class="p">]:</span> <span class="n">arr</span><span 
class="o">.</span><span class="n">cast</span><span class="p">(</span><span 
class="n">pa</span><span class="o">.</span><span class="n">int32</span><span 
class="p">())</span>
+<span class="n">Out</span><span class="p">[</span><span 
class="mi">20</span><span class="p">]:</span>
+<span class="o">&lt;</span><span class="n">pyarrow</span><span 
class="o">.</span><span class="n">lib</span><span class="o">.</span><span 
class="n">Int32Array</span> <span class="nb">object</span> <span 
class="n">at</span> <span class="mh">0x7ff6fb0383b8</span><span 
class="o">&gt;</span>
+<span class="p">[</span>
+  <span class="mi">1</span><span class="p">,</span>
+  <span class="mi">0</span><span class="p">,</span>
+  <span class="n">NA</span><span class="p">,</span>
+  <span class="mi">1</span>
+<span class="p">]</span>
+</code></pre>
+</div>
+
+<p>Over time these will expand to support as many input-and-output type
+combinations with optimized conversions.</p>
+
+<h2 id="new-arrow-gpu-cuda-extension-library-for-c">New Arrow GPU (CUDA) 
Extension Library for C++</h2>
+
+<p>To help with GPU-related projects using Arrow, like the <a 
href="http://gpuopenanalytics.com/";>GPU Open Analytics
+Initiative</a>, we have started a C++ add-on library to simplify Arrow memory
+management on CUDA-enabled graphics cards. We would like to expand this to
+include a library of reusable CUDA kernel functions for GPU analytics on Arrow
+columnar memory.</p>
+
+<p>For example, we could write a record batch from CPU memory to GPU device 
memory
+like so (some error checking omitted):</p>
+
+<div class="language-c++ highlighter-rouge"><pre class="highlight"><code><span 
class="cp">#include &lt;arrow/api.h&gt;
+#include &lt;arrow/gpu/cuda_api.h&gt;
+</span>
+<span class="k">using</span> <span class="k">namespace</span> <span 
class="n">arrow</span><span class="p">;</span>
+
+<span class="n">gpu</span><span class="o">::</span><span 
class="n">CudaDeviceManager</span><span class="o">*</span> <span 
class="n">manager</span><span class="p">;</span>
+<span class="n">std</span><span class="o">::</span><span 
class="n">shared_ptr</span><span class="o">&lt;</span><span 
class="n">gpu</span><span class="o">::</span><span 
class="n">CudaContext</span><span class="o">&gt;</span> <span 
class="n">context</span><span class="p">;</span>
+
+<span class="n">gpu</span><span class="o">::</span><span 
class="n">CudaDeviceManager</span><span class="o">::</span><span 
class="n">GetInstance</span><span class="p">(</span><span 
class="o">&amp;</span><span class="n">manager</span><span class="p">)</span>
+<span class="n">manager_</span><span class="o">-&gt;</span><span 
class="n">GetContext</span><span class="p">(</span><span 
class="n">kGpuNumber</span><span class="p">,</span> <span 
class="o">&amp;</span><span class="n">context</span><span class="p">);</span>
+
+<span class="n">std</span><span class="o">::</span><span 
class="n">shared_ptr</span><span class="o">&lt;</span><span 
class="n">RecordBatch</span><span class="o">&gt;</span> <span 
class="n">batch</span> <span class="o">=</span> <span 
class="n">GetCpuData</span><span class="p">();</span>
+
+<span class="n">std</span><span class="o">::</span><span 
class="n">shared_ptr</span><span class="o">&lt;</span><span 
class="n">gpu</span><span class="o">::</span><span 
class="n">CudaBuffer</span><span class="o">&gt;</span> <span 
class="n">device_serialized</span><span class="p">;</span>
+<span class="n">gpu</span><span class="o">::</span><span 
class="n">SerializeRecordBatch</span><span class="p">(</span><span 
class="o">*</span><span class="n">batch</span><span class="p">,</span> <span 
class="n">context_</span><span class="p">.</span><span 
class="n">get</span><span class="p">(),</span> <span 
class="o">&amp;</span><span class="n">device_serialized</span><span 
class="p">));</span>
+</code></pre>
+</div>
+
+<p>We can then âreadâ the GPU record batch, but the returned <code 
class="highlighter-rouge">arrow::RecordBatch</code>
+internally will contain GPU device pointers that you can use for CUDA kernel
+calls:</p>
+
+<div class="highlighter-rouge"><pre 
class="highlight"><code>std::shared_ptr&lt;RecordBatch&gt; device_batch;
+gpu::ReadRecordBatch(batch-&gt;schema(), device_serialized,
+                     default_memory_pool(), &amp;device_batch));
+
+// Now run some CUDA kernels on device_batch
+</code></pre>
+</div>
+
+<h2 id="decimal-integration-tests">Decimal Integration Tests</h2>
+
+<p><a href="http://github.com/cpcloud";>Phillip Cloud</a> has been working on 
decimal support in C++ to enable Parquet
+read/write support in C++ and Python, and also end-to-end testing against the
+Arrow Java libraries.</p>
+
+<p>In the upcoming releases, we hope to complete the remaining data types that
+need end-to-end testing between Java and C++:</p>
+
+<ul>
+  <li>Fixed size lists (variable-size lists already implemented)</li>
+  <li>Fixes size binary</li>
+  <li>Unions</li>
+  <li>Maps</li>
+  <li>Time intervals</li>
+</ul>
+
+<h2 id="other-notable-python-changes">Other Notable Python Changes</h2>
+
+<p>Some highlights of Python development outside of bug fixes and general API
+improvements include:</p>
+
+<ul>
+  <li>Simplified <code class="highlighter-rouge">put</code> and <code 
class="highlighter-rouge">get</code> arbitrary Python objects in Plasma 
objects</li>
+  <li><a href="http://arrow.apache.org/docs/python/ipc.html";>High-speed, 
memory efficient object serialization</a>. This is important
+enough that we will likely write a dedicated blog post about it.</li>
+  <li>New <code class="highlighter-rouge">flavor='spark'</code> option to 
<code class="highlighter-rouge">pyarrow.parquet.write_table</code> to enable 
easy
+writing of Parquet files maximized for Spark compatibility</li>
+  <li><code class="highlighter-rouge">parquet.write_to_dataset</code> function 
with support for partitioned writes</li>
+  <li>Improved support for Dask filesystems</li>
+  <li>Improved Python usability for IPC: read and write schemas and record 
batches
+more easily. See the <a 
href="http://arrow.apache.org/docs/python/api.html";>API docs</a> for more about 
these.</li>
+</ul>
+
+<h2 id="the-road-ahead">The Road Ahead</h2>
+
+<p>Upcoming Arrow releases will continue to expand the project to cover more 
use
+cases. In addition to completing end-to-end testing for all the major data
+types, some of us will be shifting attention to building Arrow-native in-memory
+analytics libraries.</p>
+
+<p>We are looking for more JavaScript, R, and other programming language
+developers to join the project and expand the available implementations and
+bindings to more languages.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Apache Arrow 0.6.0 Release
+      <a href="/blog/2017/08/16/0.6.0-release/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            16 Aug 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (wesm)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p>The Apache Arrow team is pleased to announce the 0.6.0 release. It includes
+<a 
href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.6.0"><strong>90
 resolved JIRAs</strong></a> with the new Plasma shared memory object store, and
+improvements and bug fixes to the various language implementations. The Arrow
+memory format remains stable since the 0.3.x release.</p>
+
+<p>See the <a href="http://arrow.apache.org/install";>Install Page</a> to learn 
how to get the libraries for your
+platform. The <a href="http://arrow.apache.org/release/0.6.0.html";>complete 
changelog</a> is also available.</p>
+
+<h2 id="plasma-shared-memory-object-store">Plasma Shared Memory Object 
Store</h2>
+
+<p>This release includes the <a 
href="http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/";>Plasma
 Store</a>, which you can read more about in
+the linked blog post. This system was originally developed as part of the <a 
href="https://ray-project.github.io/ray/";>Ray
+Project</a> at the <a href="https://rise.cs.berkeley.edu/";>UC Berkeley 
RISELab</a>. We recognized that Plasma would be
+highly valuable to the Arrow community as a tool for shared memory management
+and zero-copy deserialization. Additionally, we believe we will be able to
+develop a stronger software stack through sharing of IO and buffer management
+code.</p>
+
+<p>The Plasma store is a server application which runs as a separate process. A
+reference C++ client, with Python bindings, is made available in this
+release. Clients can be developed in Java or other languages in the future to
+enable simple sharing of complex datasets through shared memory.</p>
+
+<h2 id="arrow-format-addition-map-type">Arrow Format Addition: Map type</h2>
+
+<p>We added a Map logical type to represent ordered and unordered maps
+in-memory. This corresponds to the <code class="highlighter-rouge">MAP</code> 
logical type annotation in the Parquet
+format (where maps are represented as repeated structs).</p>
+
+<p>Map is represented as a list of structs. It is the first example of a 
logical
+type whose physical representation is a nested type. We have not yet created
+implementations of Map containers in any of the implementations, but this can
+be done in a future release.</p>
+
+<p>As an example, the Python data:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>data = [{'a': 1, 
'bb': 2, 'cc': 3}, {'dddd': 4}]
+</code></pre>
+</div>
+
+<p>Could be represented in an Arrow <code 
class="highlighter-rouge">Map&lt;String, Int32&gt;</code> as:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>Map&lt;String, 
Int32&gt; = List&lt;Struct&lt;keys: String, values: Int32&gt;&gt;
+  is_valid: [true, true]
+  offsets: [0, 3, 4]
+  values: Struct&lt;keys: String, values: Int32&gt;
+    children:
+      - keys: String
+          is_valid: [true, true, true, true]
+          offsets: [0, 1, 3, 5, 9]
+          data: abbccdddd
+      - values: Int32
+          is_valid: [true, true, true, true]
+          data: [1, 2, 3, 4]
+</code></pre>
+</div>
+<h2 id="python-changes">Python Changes</h2>
+
+<p>Some highlights of Python development outside of bug fixes and general API
+improvements include:</p>
+
+<ul>
+  <li>New <code class="highlighter-rouge">strings_to_categorical=True</code> 
option when calling <code class="highlighter-rouge">Table.to_pandas</code> will
+yield pandas <code class="highlighter-rouge">Categorical</code> types from 
Arrow binary and string columns</li>
+  <li>Expanded Hadoop Filesystem (HDFS) functionality to improve compatibility 
with
+Dask and other HDFS-aware Python libraries.</li>
+  <li>s3fs and other Dask-oriented filesystems can now be used with
+<code class="highlighter-rouge">pyarrow.parquet.ParquetDataset</code></li>
+  <li>More graceful handling of pandasâs nanosecond timestamps when writing 
to
+Parquet format. You can now pass <code 
class="highlighter-rouge">coerce_timestamps='ms'</code> to cast to
+milliseconds, or <code class="highlighter-rouge">'us'</code> for 
microseconds.</li>
+</ul>
+
+<h2 id="toward-arrow-100-and-beyond">Toward Arrow 1.0.0 and Beyond</h2>
+
+<p>We are still discussing the roadmap to 1.0.0 release on the <a 
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/";>developer mailing
+list</a>. The focus of the 1.0.0 release will likely be memory format stability
+and hardening integration tests across the remaining data types implemented in
+Java and C++. Please join the discussion there.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Plasma In-Memory Object Store
+      <a href="/blog/2017/08/08/plasma-in-memory-object-store/" 
class="permalink" title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            08 Aug 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (Philipp Moritz and Robert Nishihara)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p><em><a href="https://people.eecs.berkeley.edu/~pcmoritz/";>Philipp 
Moritz</a> and <a href="http://www.robertnishihara.com";>Robert Nishihara</a> 
are graduate students at UC
+ Berkeley.</em></p>
+
+<h2 id="plasma-a-high-performance-shared-memory-object-store">Plasma: A 
High-Performance Shared-Memory Object Store</h2>
+
+<h3 id="motivating-plasma">Motivating Plasma</h3>
+
+<p>This blog post presents Plasma, an in-memory object store that is being
+developed as part of Apache Arrow. <strong>Plasma holds immutable objects in 
shared
+memory so that they can be accessed efficiently by many clients across process
+boundaries.</strong> In light of the trend toward larger and larger multicore 
machines,
+Plasma enables critical performance optimizations in the big data regime.</p>
+
+<p>Plasma was initially developed as part of <a 
href="https://github.com/ray-project/ray";>Ray</a>, and has recently been moved
+to Apache Arrow in the hopes that it will be broadly useful.</p>
+
+<p>One of the goals of Apache Arrow is to serve as a common data layer enabling
+zero-copy data exchange between multiple frameworks. A key component of this
+vision is the use of off-heap memory management (via Plasma) for storing and
+sharing Arrow-serialized objects between applications.</p>
+
+<p><strong>Expensive serialization and deserialization as well as data copying 
are a
+common performance bottleneck in distributed computing.</strong> For example, a
+Python-based execution framework that wishes to distribute computation across
+multiple Python âworkerâ processes and then aggregate the results in a 
single
+âdriverâ process may choose to serialize data using the built-in <code 
class="highlighter-rouge">pickle</code>
+library. Assuming one Python process per core, each worker process would have 
to
+copy and deserialize the data, resulting in excessive memory usage. The driver
+process would then have to deserialize results from each of the workers,
+resulting in a bottleneck.</p>
+
+<p>Using Plasma plus Arrow, the data being operated on would be placed in the
+Plasma store once, and all of the workers would read the data without copying 
or
+deserializing it (the workers would map the relevant region of memory into 
their
+own address spaces). The workers would then put the results of their 
computation
+back into the Plasma store, which the driver could then read and aggregate
+without copying or deserializing the data.</p>
+
+<h3 id="the-plasma-api">The Plasma API:</h3>
+
+<p>Below we illustrate a subset of the API. The C++ API is documented more 
fully
+<a 
href="https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md";>here</a>,
 and the Python API is documented <a 
href="https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst";>here</a>.</p>
+
+<p><strong>Object IDs:</strong> Each object is associated with a string of 
bytes.</p>
+
+<p><strong>Creating an object:</strong> Objects are stored in Plasma in two 
stages. First, the
+object store <em>creates</em> the object by allocating a buffer for it. At 
this point,
+the client can write to the buffer and construct the object within the 
allocated
+buffer. When the client is done, the client <em>seals</em> the buffer making 
the object
+immutable and making it available to other Plasma clients.</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="c"># Create an object.</span>
+<span class="n">object_id</span> <span class="o">=</span> <span 
class="n">pyarrow</span><span class="o">.</span><span 
class="n">plasma</span><span class="o">.</span><span 
class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> 
<span class="o">*</span> <span class="n">b</span><span 
class="s">'a'</span><span class="p">)</span>
+<span class="n">object_size</span> <span class="o">=</span> <span 
class="mi">1000</span>
+<span class="nb">buffer</span> <span class="o">=</span> <span 
class="n">memoryview</span><span class="p">(</span><span 
class="n">client</span><span class="o">.</span><span 
class="n">create</span><span class="p">(</span><span 
class="n">object_id</span><span class="p">,</span> <span 
class="n">object_size</span><span class="p">))</span>
+
+<span class="c"># Write to the buffer.</span>
+<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> 
<span class="nb">range</span><span class="p">(</span><span 
class="mi">1000</span><span class="p">):</span>
+    <span class="nb">buffer</span><span class="p">[</span><span 
class="n">i</span><span class="p">]</span> <span class="o">=</span> <span 
class="mi">0</span>
+
+<span class="c"># Seal the object making it immutable and available to other 
clients.</span>
+<span class="n">client</span><span class="o">.</span><span 
class="n">seal</span><span class="p">(</span><span 
class="n">object_id</span><span class="p">)</span>
+</code></pre>
+</div>
+
+<p><strong>Getting an object:</strong> After an object has been sealed, any 
client who knows the
+object ID can get the object.</p>
+
+<div class="language-python highlighter-rouge"><pre 
class="highlight"><code><span class="c"># Get the object from the store. This 
blocks until the object has been sealed.</span>
+<span class="n">object_id</span> <span class="o">=</span> <span 
class="n">pyarrow</span><span class="o">.</span><span 
class="n">plasma</span><span class="o">.</span><span 
class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> 
<span class="o">*</span> <span class="n">b</span><span 
class="s">'a'</span><span class="p">)</span>
+<span class="p">[</span><span class="n">buff</span><span class="p">]</span> 
<span class="o">=</span> <span class="n">client</span><span 
class="o">.</span><span class="n">get</span><span class="p">([</span><span 
class="n">object_id</span><span class="p">])</span>
+<span class="nb">buffer</span> <span class="o">=</span> <span 
class="n">memoryview</span><span class="p">(</span><span 
class="n">buff</span><span class="p">)</span>
+</code></pre>
+</div>
+
+<p>If the object has not been sealed yet, then the call to <code 
class="highlighter-rouge">client.get</code> will block
+until the object has been sealed.</p>
+
+<h3 id="a-sorting-application">A sorting application</h3>
+
+<p>To illustrate the benefits of Plasma, we demonstrate an <strong>11x 
speedup</strong> (on a
+machine with 20 physical cores) for sorting a large pandas DataFrame (one
+billion entries). The baseline is the built-in pandas sort function, which 
sorts
+the DataFrame in 477 seconds. To leverage multiple cores, we implement the
+following standard distributed sorting scheme.</p>
+
+<ul>
+  <li>We assume that the data is partitioned across K pandas DataFrames and 
that
+each one already lives in the Plasma store.</li>
+  <li>We subsample the data, sort the subsampled data, and use the result to 
define
+L non-overlapping buckets.</li>
+  <li>For each of the K data partitions and each of the L buckets, we find the
+subset of the data partition that falls in the bucket, and we sort that
+subset.</li>
+  <li>For each of the L buckets, we gather all of the K sorted subsets that 
fall in
+that bucket.</li>
+  <li>For each of the L buckets, we merge the corresponding K sorted 
subsets.</li>
+  <li>We turn each bucket into a pandas DataFrame and place it in the Plasma 
store.</li>
+</ul>
+
+<p>Using this scheme, we can sort the DataFrame (the data starts and ends in 
the
+Plasma store), in 44 seconds, giving an 11x speedup over the baseline.</p>
+
+<h3 id="design">Design</h3>
+
+<p>The Plasma store runs as a separate process. It is written in C++ and is
+designed as a single-threaded event loop based on the <a 
href="https://redis.io/";>Redis</a> event loop library.
+The plasma client library can be linked into applications. Clients communicate
+with the Plasma store via messages serialized using <a 
href="https://google.github.io/flatbuffers/";>Google Flatbuffers</a>.</p>
+
+<h3 id="call-for-contributions">Call for contributions</h3>
+
+<p>Plasma is a work in progress, and the API is currently unstable. Today 
Plasma is
+primarily used in <a href="https://github.com/ray-project/ray";>Ray</a> as an 
in-memory cache for Arrow serialized objects.
+We are looking for a broader set of use cases to help refine Plasmaâs API. In
+addition, we are looking for contributions in a variety of areas including
+improving performance and building other language bindings. Please let us know
+if you are interested in getting involved with the project.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Speeding up PySpark with Apache Arrow
+      <a href="/blog/2017/07/26/spark-arrow/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            26 Jul 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://wesmckinney.com";><i class="fa fa-user"></i> Wes 
McKinney (BryanCutler)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/BryanCutler";>Bryan Cutler</a> is a software 
engineer at IBMâs Spark Technology Center <a 
href="http://www.spark.tc/";>STC</a></em></p>
+
+<p>Beginning with <a href="https://spark.apache.org/";>Apache Spark</a> version 
2.3, <a href="https://arrow.apache.org/";>Apache Arrow</a> will be a supported
+dependency and begin to offer increased performance with columnar data 
transfer.
+If you are a Spark user that prefers to work in Python and Pandas, this is a 
cause
+to be excited over! The initial work is limited to collecting a Spark DataFrame
+with <code class="highlighter-rouge">toPandas()</code>, which I will discuss 
below, however there are many additional
+improvements that are currently <a 
href="https://issues.apache.org/jira/issues/?filter=12335725&amp;jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC">underway</a>.</p>
+
+<h1 id="optimizing-spark-conversion-to-pandas">Optimizing Spark Conversion to 
Pandas</h1>
+
+<p>The previous way of converting a Spark DataFrame to Pandas with <code 
class="highlighter-rouge">DataFrame.toPandas()</code>
+in PySpark was painfully inefficient. Basically, it worked by first collecting 
all
+rows to the Spark driver. Next, each row would get serialized into Pythonâs 
pickle
+format and sent to a Python worker process. This child process unpickles each 
row into
+a huge list of tuples. Finally, a Pandas DataFrame is created from the list 
using
+<code class="highlighter-rouge">pandas.DataFrame.from_records()</code>.</p>
+
+<p>This all might seem like standard procedure, but suffers from 2 glaring 
issues: 1)
+even using CPickle, Python serialization is a slow process and 2) creating
+a <code class="highlighter-rouge">pandas.DataFrame</code> using <code 
class="highlighter-rouge">from_records</code> must slowly iterate over the list 
of pure
+Python data and convert each value to Pandas format. See <a 
href="https://gist.github.com/wesm/0cb5531b1c2e346a0007";>here</a> for a detailed
+analysis.</p>
+
+<p>Here is where Arrow really shines to help optimize these steps: 1) Once the 
data is
+in Arrow memory format, there is no need to serialize/pickle anymore as Arrow 
data can
+be sent directly to the Python process, 2) When the Arrow data is received in 
Python,
+then pyarrow can utilize zero-copy methods to create a <code 
class="highlighter-rouge">pandas.DataFrame</code> from entire
+chunks of data at once instead of processing individual scalar values. 
Additionally,
+the conversion to Arrow data can be done on the JVM and pushed back for the 
Spark
+executors to perform in parallel, drastically reducing the load on the 
driver.</p>
+
+<p>As of the merging of <a 
href="https://issues.apache.org/jira/browse/SPARK-13534";>SPARK-13534</a>, the 
use of Arrow when calling <code class="highlighter-rouge">toPandas()</code>
+needs to be enabled by setting the SQLConf 
âspark.sql.execution.arrow.enabledâ to
+âtrueâ.  Letâs look at a simple usage example.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
+      /_/
+
+Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
+SparkSession available as 'spark'.
+
+In [1]: from pyspark.sql.functions import rand
+   ...: df = spark.range(1 &lt;&lt; 22).toDF("id").withColumn("x", rand())
+   ...: df.printSchema()
+   ...: 
+root
+ |-- id: long (nullable = false)
+ |-- x: double (nullable = false)
+
+
+In [2]: %time pdf = df.toPandas()
+CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
+Wall time: 20.7 s
+
+In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true")
+
+In [4]: %time pdf = df.toPandas()
+CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                
 
+Wall time: 737 ms
+
+In [5]: pdf.describe()
+Out[5]: 
+                 id             x
+count  4.194304e+06  4.194304e+06
+mean   2.097152e+06  4.998996e-01
+std    1.210791e+06  2.887247e-01
+min    0.000000e+00  8.291929e-07
+25%    1.048576e+06  2.498116e-01
+50%    2.097152e+06  4.999210e-01
+75%    3.145727e+06  7.498380e-01
+max    4.194303e+06  9.999996e-01
+</code></pre>
+</div>
+
+<p>This example was run locally on my laptop using Spark defaults so the times
+shown should not be taken precisely. Even though, it is clear there is a huge
+performance boost and using Arrow took something that was excruciatingly slow
+and speeds it up to be barely noticeable.</p>
+
+<h1 id="notes-on-usage">Notes on Usage</h1>
+
+<p>Here are some things to keep in mind before making use of this new feature. 
At
+the time of writing this, pyarrow will not be installed automatically with
+pyspark and needs to be manually installed, see installation <a 
href="https://github.com/apache/arrow/blob/master/site/install.md";>instructions</a>.
+It is planned to add pyarrow as a pyspark dependency so that 
+<code class="highlighter-rouge">&gt; pip install pyspark</code> will also 
install pyarrow.</p>
+
+<p>Currently, the controlling SQLConf is disabled by default. This can be 
enabled
+programmatically as in the example above or by adding the line
+âspark.sql.execution.arrow.enabled=trueâ to <code 
class="highlighter-rouge">SPARK_HOME/conf/spark-defaults.conf</code>.</p>
+
+<p>Also, not all Spark data types are currently supported and limited to 
primitive
+types. Expanded type support is in the works and expected to also be in the 
Spark
+2.3 release.</p>
+
+<h1 id="future-improvements">Future Improvements</h1>
+
+<p>As mentioned, this was just a first step in using Arrow to make life easier 
for
+Spark Python users. A few exciting initiatives in the works are to allow for
+vectorized UDF evaluation (<a 
href="https://issues.apache.org/jira/browse/SPARK-21190";>SPARK-21190</a>, <a 
href="https://issues.apache.org/jira/browse/SPARK-21404";>SPARK-21404</a>), and 
the ability
+to apply a function on grouped data using a Pandas DataFrame (<a 
href="https://issues.apache.org/jira/browse/SPARK-20396";>SPARK-20396</a>).
+Just as Arrow helped in converting a Spark to Pandas, it can also work in the
+other direction when creating a Spark DataFrame from an existing Pandas
+DataFrame (<a 
href="https://issues.apache.org/jira/browse/SPARK-20791";>SPARK-20791</a>). Stay 
tuned for more!</p>
+
+<h1 id="collaborators">Collaborators</h1>
+
+<p>Reaching this first milestone was a group effort from both the Apache Arrow 
and
+Spark communities. Thanks to the hard work of <a 
href="https://github.com/wesm";>Wes McKinney</a>, <a 
href="https://github.com/icexelloss";>Li Jin</a>,
+<a href="https://github.com/holdenk";>Holden Karau</a>, Reynold Xin, Wenchen 
Fan, Shane Knapp and many others that
+helped push this effort forwards.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
+      Apache Arrow 0.5.0 Release
+      <a href="/blog/2017/07/25/0.5.0-release/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div cla


<TRUNCATED>

[41/51] [partial] arrow-site git commit: Update API documentation for version 0.9.0

Reply via email to