arrow-site git commit: Add Spark+Arrow blog post [Forced Update!]

wesm Thu, 27 Jul 2017 08:48:25 -0700

Repository: arrow-site
Updated Branches:
  refs/heads/asf-site b3256101e -> beb16696a (forced update)



Add Spark+Arrow blog post


Project: http://git-wip-us.apache.org/repos/asf/arrow-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow-site/commit/beb16696
Tree: http://git-wip-us.apache.org/repos/asf/arrow-site/tree/beb16696
Diff: http://git-wip-us.apache.org/repos/asf/arrow-site/diff/beb16696

Branch: refs/heads/asf-site
Commit: beb16696a2eb7801f0a2773a27fbb1156e79410a
Parents: 8936624
Author: Wes McKinney <[email protected]>
Authored: Thu Jul 27 11:28:56 2017 -0400
Committer: Wes McKinney <[email protected]>
Committed: Thu Jul 27 11:47:28 2017 -0400

----------------------------------------------------------------------
 blog/2017/07/26/spark-arrow/index.html | 259 ++++++++++++++++++++++++++++
 blog/index.html                        | 151 ++++++++++++++++
 feed.xml                               | 121 ++++++++++++-
 3 files changed, 530 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow-site/blob/beb16696/blog/2017/07/26/spark-arrow/index.html
----------------------------------------------------------------------
diff --git a/blog/2017/07/26/spark-arrow/index.html 
b/blog/2017/07/26/spark-arrow/index.html
new file mode 100644
index 0000000..2b483f9
--- /dev/null
+++ b/blog/2017/07/26/spark-arrow/index.html
@@ -0,0 +1,259 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <title>Speeding up PySpark with Apache Arrow</title>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta name="generator" content="Jekyll v3.4.3">
+    <!-- The above 3 meta tags *must* come first in the head; any other head 
content must come *after* these tags -->
+    <link rel="icon" type="image/x-icon" href="/favicon.ico">
+
+    <title>Apache Arrow Homepage</title>
+    <link rel="stylesheet" 
href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900">
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="https://code.jquery.com/jquery-3.2.1.min.js";
+            integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4="
+            crossorigin="anonymous"></script>
+    <script src="/assets/javascripts/bootstrap.min.js"></script>
+  </head>
+
+
+
+<body class="wrap">
+  <div class="container">
+    <nav class="navbar navbar-default">
+  <div class="container-fluid">
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle" data-toggle="collapse" 
data-target="#arrow-navbar">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+      <a class="navbar-brand" href="/">Apache 
Arrow&#8482;&nbsp;&nbsp;&nbsp;</a>
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Project Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/install/">Install</a></li>
+            <li><a href="/blog/">Blog</a></li>
+            <li><a href="/release/">Releases</a></li>
+            <li><a href="https://issues.apache.org/jira/browse/ARROW";>Issue 
Tracker</a></li>
+            <li><a href="https://github.com/apache/arrow";>Source Code</a></li>
+            <li><a 
href="http://mail-archives.apache.org/mod_mbox/arrow-dev/";>Mailing List</a></li>
+            <li><a href="https://apachearrowslackin.herokuapp.com";>Slack 
Channel</a></li>
+            <li><a href="/committers/">Committers</a></li>
+          </ul>
+        </li>
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Specification<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/memory_layout.html">Memory Layout</a></li>
+            <li><a href="/docs/metadata.html">Metadata</a></li>
+            <li><a href="/docs/ipc.html">Messaging / IPC</a></li>
+          </ul>
+        </li>
+
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">Documentation<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="/docs/python">Python</a></li>
+            <li><a href="/docs/cpp">C++ API</a></li>
+            <li><a href="/docs/java">Java API</a></li>
+            <li><a href="/docs/c_glib">C GLib API</a></li>
+          </ul>
+        </li>
+        <!-- <li><a href="/blog">Blog</a></li> -->
+        <li class="dropdown">
+          <a href="#" class="dropdown-toggle" data-toggle="dropdown"
+             role="button" aria-haspopup="true"
+             aria-expanded="false">ASF Links<span class="caret"></span>
+          </a>
+          <ul class="dropdown-menu">
+            <li><a href="http://www.apache.org/";>ASF Website</a></li>
+            <li><a href="http://www.apache.org/licenses/";>License</a></li>
+            <li><a 
href="http://www.apache.org/foundation/sponsorship.html";>Donate</a></li>
+            <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+            <li><a href="http://www.apache.org/security/";>Security</a></li>
+          </ul>
+        </li>
+      </ul>
+      <a href="http://www.apache.org/";>
+        <img style="float:right;" src="/img/asf_logo.svg" width="120px"/>
+      </a>
+      </div><!-- /.navbar-collapse -->
+    </div>
+  </nav>
+
+
+    <h2>
+      Speeding up PySpark with Apache Arrow
+      <a href="/blog/2017/07/26/spark-arrow/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            26 Jul 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href="http://people.apache.org/~BryanCutler";><i class="fa 
fa-user"></i>  (BryanCutler)</a>
+        </div>
+      </div>
+    </div>
+
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/BryanCutler";>Bryan Cutler</a> is a software 
engineer at IBMâs Spark Technology Center <a 
href="http://www.spark.tc/";>STC</a></em></p>
+
+<p>Beginning with <a href="https://spark.apache.org/";>Apache Spark</a> version 
2.3, <a href="https://arrow.apache.org/";>Apache Arrow</a> will be a supported
+dependency and begin to offer increased performance with columnar data 
transfer.
+If you are a Spark user that prefers to work in Python and Pandas, this is a 
cause
+to be excited over! The initial work is limited to collecting a Spark DataFrame
+with <code class="highlighter-rouge">toPandas()</code>, which I will discuss 
below, however there are many additional
+improvements that are currently <a 
href="https://issues.apache.org/jira/issues/?filter=12335725&amp;jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC">underway</a>.</p>
+
+<h1 id="optimizing-spark-conversion-to-pandas">Optimizing Spark Conversion to 
Pandas</h1>
+
+<p>The previous way of converting a Spark DataFrame to Pandas with <code 
class="highlighter-rouge">DataFrame.toPandas()</code>
+in PySpark was painfully inefficient. Basically, it worked by first collecting 
all
+rows to the Spark driver. Next, each row would get serialized into Pythonâs 
pickle
+format and sent to a Python worker process. This child process unpickles each 
row into
+a huge list of tuples. Finally, a Pandas DataFrame is created from the list 
using
+<code class="highlighter-rouge">pandas.DataFrame.from_records()</code>.</p>
+
+<p>This all might seem like standard procedure, but suffers from 2 glaring 
issues: 1)
+even using CPickle, Python serialization is a slow process and 2) creating
+a <code class="highlighter-rouge">pandas.DataFrame</code> using <code 
class="highlighter-rouge">from_records</code> must slowly iterate over the list 
of pure
+Python data and convert each value to Pandas format. See <a 
href="https://gist.github.com/wesm/0cb5531b1c2e346a0007";>here</a> for a detailed
+analysis.</p>
+
+<p>Here is where Arrow really shines to help optimize these steps: 1) Once the 
data is
+in Arrow memory format, there is no need to serialize/pickle anymore as Arrow 
data can
+be sent directly to the Python process, 2) When the Arrow data is received in 
Python,
+then pyarrow can utilize zero-copy methods to create a <code 
class="highlighter-rouge">pandas.DataFrame</code> from entire
+chunks of data at once instead of processing individual scalar values. 
Additionally,
+the conversion to Arrow data can be done on the JVM and pushed back for the 
Spark
+executors to perform in parallel, drastically reducing the load on the 
driver.</p>
+
+<p>As of the merging of <a 
href="https://issues.apache.org/jira/browse/SPARK-13534";>SPARK-13534</a>, the 
use of Arrow when calling <code class="highlighter-rouge">toPandas()</code>
+needs to be enabled by setting the SQLConf 
âspark.sql.execution.arrow.enableâ to
+âtrueâ.  Letâs look at a simple usage example.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
+      /_/
+
+Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
+SparkSession available as 'spark'.
+
+In [1]: from pyspark.sql.functions import rand
+   ...: df = spark.range(1 &lt;&lt; 22).toDF("id").withColumn("x", rand())
+   ...: df.printSchema()
+   ...: 
+root
+ |-- id: long (nullable = false)
+ |-- x: double (nullable = false)
+
+
+In [2]: %time pdf = df.toPandas()
+CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
+Wall time: 20.7 s
+
+In [3]: spark.conf.set("spark.sql.execution.arrow.enable", "true")
+
+In [4]: %time pdf = df.toPandas()
+CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                
 
+Wall time: 737 ms
+
+In [5]: pdf.describe()
+Out[5]: 
+                 id             x
+count  4.194304e+06  4.194304e+06
+mean   2.097152e+06  4.998996e-01
+std    1.210791e+06  2.887247e-01
+min    0.000000e+00  8.291929e-07
+25%    1.048576e+06  2.498116e-01
+50%    2.097152e+06  4.999210e-01
+75%    3.145727e+06  7.498380e-01
+max    4.194303e+06  9.999996e-01
+</code></pre>
+</div>
+
+<p>This example was run locally on my laptop using Spark defaults so the times
+shown should not be taken precisely. Even though, it is clear there is a huge
+performance boost and using Arrow took something that was excruciatingly slow
+and speeds it up to be barely noticeable.</p>
+
+<h1 id="notes-on-usage">Notes on Usage</h1>
+
+<p>Here are some things to keep in mind before making use of this new feature. 
At
+the time of writing this, pyarrow will not be installed automatically with
+pyspark and needs to be manually installed, see installation <a 
href="https://github.com/apache/arrow/blob/master/site/install.md";>instructions</a>.
+It is planned to add pyarrow as a pyspark dependency so that 
+<code class="highlighter-rouge">&gt; pip install pyspark</code> will also 
install pyarrow.</p>
+
+<p>Currently, the controlling SQLConf is disabled by default. This can be 
enabled
+programmatically as in the example above or by adding the line
+âspark.sql.execution.arrow.enable=trueâ to <code 
class="highlighter-rouge">SPARK_HOME/conf/spark-defaults.conf</code>.</p>
+
+<p>Also, not all Spark data types are currently supported and limited to 
primitive
+types. Expanded type support is in the works and expected to also be in the 
Spark
+2.3 release.</p>
+
+<h1 id="future-improvements">Future Improvements</h1>
+
+<p>As mentioned, this was just a first step in using Arrow to make life easier 
for
+Spark Python users. A few exciting initiatives in the works are to allow for
+vectorized UDF evaluation (<a 
href="https://issues.apache.org/jira/browse/SPARK-21190";>SPARK-21190</a>, <a 
href="https://issues.apache.org/jira/browse/SPARK-21404";>SPARK-21404</a>), and 
the ability
+to apply a function on grouped data using a Pandas DataFrame (<a 
href="https://issues.apache.org/jira/browse/SPARK-20396";>SPARK-20396</a>).
+Just as Arrow helped in converting a Spark to Pandas, it can also work in the
+other direction when creating a Spark DataFrame from an existing Pandas
+DataFrame (<a 
href="https://issues.apache.org/jira/browse/SPARK-20791";>SPARK-20791</a>). Stay 
tuned for more!</p>
+
+<h1 id="collaborators">Collaborators</h1>
+
+<p>Reaching this first milestone was a group effort from both the Apache Arrow 
and
+Spark communities. Thanks to the hard work of <a 
href="https://github.com/wesm";>Wes McKinney</a>, <a 
href="https://github.com/icexelloss";>Li Jin</a>,
+<a href="https://github.com/holdenk";>Holden Karau</a>, Reynold Xin, Wenchen 
Fan, Shane Knapp and many others that
+helped push this effort forwards.</p>
+
+
+
+    <hr/>
+<footer class="footer">
+  <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache 
Arrow project logo are either registered trademarks or trademarks of The Apache 
Software Foundation in the United States and other countries.</p>
+  <p>&copy; 2017 Apache Software Foundation</p>
+</footer>
+
+  </div>
+</body>
+</html>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/beb16696/blog/index.html
----------------------------------------------------------------------
diff --git a/blog/index.html b/blog/index.html
index c8a810a..1350659 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -111,6 +111,157 @@
     
   <div class="container">
     <h2>
+      Speeding up PySpark with Apache Arrow
+      <a href="/blog/2017/07/26/spark-arrow/" class="permalink" 
title="Permalink">â</a>
+    </h2>
+
+    
+
+    <div class="panel">
+      <div class="panel-body">
+        <div>
+          <span class="label label-default">Published</span>
+          <span class="published">
+            <i class="fa fa-calendar"></i>
+            26 Jul 2017
+          </span>
+        </div>
+        <div>
+          <span class="label label-default">By</span>
+          <a href=""><i class="fa fa-user"></i>  (BryanCutler)</a>
+        </div>
+      </div>
+    </div>
+    <!--
+
+-->
+
+<p><em><a href="https://github.com/BryanCutler";>Bryan Cutler</a> is a software 
engineer at IBMâs Spark Technology Center <a 
href="http://www.spark.tc/";>STC</a></em></p>
+
+<p>Beginning with <a href="https://spark.apache.org/";>Apache Spark</a> version 
2.3, <a href="https://arrow.apache.org/";>Apache Arrow</a> will be a supported
+dependency and begin to offer increased performance with columnar data 
transfer.
+If you are a Spark user that prefers to work in Python and Pandas, this is a 
cause
+to be excited over! The initial work is limited to collecting a Spark DataFrame
+with <code class="highlighter-rouge">toPandas()</code>, which I will discuss 
below, however there are many additional
+improvements that are currently <a 
href="https://issues.apache.org/jira/issues/?filter=12335725&amp;jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC">underway</a>.</p>
+
+<h1 id="optimizing-spark-conversion-to-pandas">Optimizing Spark Conversion to 
Pandas</h1>
+
+<p>The previous way of converting a Spark DataFrame to Pandas with <code 
class="highlighter-rouge">DataFrame.toPandas()</code>
+in PySpark was painfully inefficient. Basically, it worked by first collecting 
all
+rows to the Spark driver. Next, each row would get serialized into Pythonâs 
pickle
+format and sent to a Python worker process. This child process unpickles each 
row into
+a huge list of tuples. Finally, a Pandas DataFrame is created from the list 
using
+<code class="highlighter-rouge">pandas.DataFrame.from_records()</code>.</p>
+
+<p>This all might seem like standard procedure, but suffers from 2 glaring 
issues: 1)
+even using CPickle, Python serialization is a slow process and 2) creating
+a <code class="highlighter-rouge">pandas.DataFrame</code> using <code 
class="highlighter-rouge">from_records</code> must slowly iterate over the list 
of pure
+Python data and convert each value to Pandas format. See <a 
href="https://gist.github.com/wesm/0cb5531b1c2e346a0007";>here</a> for a detailed
+analysis.</p>
+
+<p>Here is where Arrow really shines to help optimize these steps: 1) Once the 
data is
+in Arrow memory format, there is no need to serialize/pickle anymore as Arrow 
data can
+be sent directly to the Python process, 2) When the Arrow data is received in 
Python,
+then pyarrow can utilize zero-copy methods to create a <code 
class="highlighter-rouge">pandas.DataFrame</code> from entire
+chunks of data at once instead of processing individual scalar values. 
Additionally,
+the conversion to Arrow data can be done on the JVM and pushed back for the 
Spark
+executors to perform in parallel, drastically reducing the load on the 
driver.</p>
+
+<p>As of the merging of <a 
href="https://issues.apache.org/jira/browse/SPARK-13534";>SPARK-13534</a>, the 
use of Arrow when calling <code class="highlighter-rouge">toPandas()</code>
+needs to be enabled by setting the SQLConf 
âspark.sql.execution.arrow.enableâ to
+âtrueâ.  Letâs look at a simple usage example.</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
+      /_/
+
+Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
+SparkSession available as 'spark'.
+
+In [1]: from pyspark.sql.functions import rand
+   ...: df = spark.range(1 &lt;&lt; 22).toDF("id").withColumn("x", rand())
+   ...: df.printSchema()
+   ...: 
+root
+ |-- id: long (nullable = false)
+ |-- x: double (nullable = false)
+
+
+In [2]: %time pdf = df.toPandas()
+CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
+Wall time: 20.7 s
+
+In [3]: spark.conf.set("spark.sql.execution.arrow.enable", "true")
+
+In [4]: %time pdf = df.toPandas()
+CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                
 
+Wall time: 737 ms
+
+In [5]: pdf.describe()
+Out[5]: 
+                 id             x
+count  4.194304e+06  4.194304e+06
+mean   2.097152e+06  4.998996e-01
+std    1.210791e+06  2.887247e-01
+min    0.000000e+00  8.291929e-07
+25%    1.048576e+06  2.498116e-01
+50%    2.097152e+06  4.999210e-01
+75%    3.145727e+06  7.498380e-01
+max    4.194303e+06  9.999996e-01
+</code></pre>
+</div>
+
+<p>This example was run locally on my laptop using Spark defaults so the times
+shown should not be taken precisely. Even though, it is clear there is a huge
+performance boost and using Arrow took something that was excruciatingly slow
+and speeds it up to be barely noticeable.</p>
+
+<h1 id="notes-on-usage">Notes on Usage</h1>
+
+<p>Here are some things to keep in mind before making use of this new feature. 
At
+the time of writing this, pyarrow will not be installed automatically with
+pyspark and needs to be manually installed, see installation <a 
href="https://github.com/apache/arrow/blob/master/site/install.md";>instructions</a>.
+It is planned to add pyarrow as a pyspark dependency so that 
+<code class="highlighter-rouge">&gt; pip install pyspark</code> will also 
install pyarrow.</p>
+
+<p>Currently, the controlling SQLConf is disabled by default. This can be 
enabled
+programmatically as in the example above or by adding the line
+âspark.sql.execution.arrow.enable=trueâ to <code 
class="highlighter-rouge">SPARK_HOME/conf/spark-defaults.conf</code>.</p>
+
+<p>Also, not all Spark data types are currently supported and limited to 
primitive
+types. Expanded type support is in the works and expected to also be in the 
Spark
+2.3 release.</p>
+
+<h1 id="future-improvements">Future Improvements</h1>
+
+<p>As mentioned, this was just a first step in using Arrow to make life easier 
for
+Spark Python users. A few exciting initiatives in the works are to allow for
+vectorized UDF evaluation (<a 
href="https://issues.apache.org/jira/browse/SPARK-21190";>SPARK-21190</a>, <a 
href="https://issues.apache.org/jira/browse/SPARK-21404";>SPARK-21404</a>), and 
the ability
+to apply a function on grouped data using a Pandas DataFrame (<a 
href="https://issues.apache.org/jira/browse/SPARK-20396";>SPARK-20396</a>).
+Just as Arrow helped in converting a Spark to Pandas, it can also work in the
+other direction when creating a Spark DataFrame from an existing Pandas
+DataFrame (<a 
href="https://issues.apache.org/jira/browse/SPARK-20791";>SPARK-20791</a>). Stay 
tuned for more!</p>
+
+<h1 id="collaborators">Collaborators</h1>
+
+<p>Reaching this first milestone was a group effort from both the Apache Arrow 
and
+Spark communities. Thanks to the hard work of <a 
href="https://github.com/wesm";>Wes McKinney</a>, <a 
href="https://github.com/icexelloss";>Li Jin</a>,
+<a href="https://github.com/holdenk";>Holden Karau</a>, Reynold Xin, Wenchen 
Fan, Shane Knapp and many others that
+helped push this effort forwards.</p>
+
+
+  </div>
+
+  
+
+  
+    
+  <div class="container">
+    <h2>
       Apache Arrow 0.5.0 Release
       <a href="/blog/2017/07/25/0.5.0-release/" class="permalink" 
title="Permalink">â</a>
     </h2>

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/beb16696/feed.xml
----------------------------------------------------------------------
diff --git a/feed.xml b/feed.xml
index c999bac..f01301e 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,123 @@
-<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2017-07-25T13:38:25-04:00</updated><id>/</id><entry><title 
type="html">Apache Arrow 0.5.0 Release</title><link 
href="/blog/2017/07/25/0.5.0-release/" rel="alternate" type="text/html" 
title="Apache Arrow 0.5.0 Release" 
/><published>2017-07-25T00:00:00-04:00</published><updated>2017-07-25T00:00:00-04:00</updated><id>/blog/2017/07/25/0.5.0-release</id><content
 type="html" xml:base="/blog/2017/07/25/0.5.0-release/">&lt;!--
+<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" 
type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" 
/><updated>2017-07-27T11:28:36-04:00</updated><id>/</id><entry><title 
type="html">Speeding up PySpark with Apache Arrow</title><link 
href="/blog/2017/07/26/spark-arrow/" rel="alternate" type="text/html" 
title="Speeding up PySpark with Apache Arrow" 
/><published>2017-07-26T12:00:00-04:00</published><updated>2017-07-26T12:00:00-04:00</updated><id>/blog/2017/07/26/spark-arrow</id><content
 type="html" xml:base="/blog/2017/07/26/spark-arrow/">&lt;!--
+
+--&gt;
+
+&lt;p&gt;&lt;em&gt;&lt;a 
href=&quot;https://github.com/BryanCutler&quot;&gt;Bryan Cutler&lt;/a&gt; is a 
software engineer at IBMâs Spark Technology Center &lt;a 
href=&quot;http://www.spark.tc/&quot;&gt;STC&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;Beginning with &lt;a 
href=&quot;https://spark.apache.org/&quot;&gt;Apache Spark&lt;/a&gt; version 
2.3, &lt;a href=&quot;https://arrow.apache.org/&quot;&gt;Apache Arrow&lt;/a&gt; 
will be a supported
+dependency and begin to offer increased performance with columnar data 
transfer.
+If you are a Spark user that prefers to work in Python and Pandas, this is a 
cause
+to be excited over! The initial work is limited to collecting a Spark DataFrame
+with &lt;code class=&quot;highlighter-rouge&quot;&gt;toPandas()&lt;/code&gt;, 
which I will discuss below, however there are many additional
+improvements that are currently &lt;a 
href=&quot;https://issues.apache.org/jira/issues/?filter=12335725&amp;amp;jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC&quot;&gt;underway&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h1 id=&quot;optimizing-spark-conversion-to-pandas&quot;&gt;Optimizing 
Spark Conversion to Pandas&lt;/h1&gt;
+
+&lt;p&gt;The previous way of converting a Spark DataFrame to Pandas with 
&lt;code 
class=&quot;highlighter-rouge&quot;&gt;DataFrame.toPandas()&lt;/code&gt;
+in PySpark was painfully inefficient. Basically, it worked by first collecting 
all
+rows to the Spark driver. Next, each row would get serialized into Pythonâs 
pickle
+format and sent to a Python worker process. This child process unpickles each 
row into
+a huge list of tuples. Finally, a Pandas DataFrame is created from the list 
using
+&lt;code 
class=&quot;highlighter-rouge&quot;&gt;pandas.DataFrame.from_records()&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;This all might seem like standard procedure, but suffers from 2 
glaring issues: 1)
+even using CPickle, Python serialization is a slow process and 2) creating
+a &lt;code 
class=&quot;highlighter-rouge&quot;&gt;pandas.DataFrame&lt;/code&gt; using 
&lt;code class=&quot;highlighter-rouge&quot;&gt;from_records&lt;/code&gt; must 
slowly iterate over the list of pure
+Python data and convert each value to Pandas format. See &lt;a 
href=&quot;https://gist.github.com/wesm/0cb5531b1c2e346a0007&quot;&gt;here&lt;/a&gt;
 for a detailed
+analysis.&lt;/p&gt;
+
+&lt;p&gt;Here is where Arrow really shines to help optimize these steps: 1) 
Once the data is
+in Arrow memory format, there is no need to serialize/pickle anymore as Arrow 
data can
+be sent directly to the Python process, 2) When the Arrow data is received in 
Python,
+then pyarrow can utilize zero-copy methods to create a &lt;code 
class=&quot;highlighter-rouge&quot;&gt;pandas.DataFrame&lt;/code&gt; from entire
+chunks of data at once instead of processing individual scalar values. 
Additionally,
+the conversion to Arrow data can be done on the JVM and pushed back for the 
Spark
+executors to perform in parallel, drastically reducing the load on the 
driver.&lt;/p&gt;
+
+&lt;p&gt;As of the merging of &lt;a 
href=&quot;https://issues.apache.org/jira/browse/SPARK-13534&quot;&gt;SPARK-13534&lt;/a&gt;,
 the use of Arrow when calling &lt;code 
class=&quot;highlighter-rouge&quot;&gt;toPandas()&lt;/code&gt;
+needs to be enabled by setting the SQLConf 
âspark.sql.execution.arrow.enableâ to
+âtrueâ.  Letâs look at a simple usage example.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
+      /_/
+
+Using Python version 2.7.13 (default, Dec 20 2016 23:09:15)
+SparkSession available as 'spark'.
+
+In [1]: from pyspark.sql.functions import rand
+   ...: df = spark.range(1 &amp;lt;&amp;lt; 
22).toDF(&quot;id&quot;).withColumn(&quot;x&quot;, rand())
+   ...: df.printSchema()
+   ...: 
+root
+ |-- id: long (nullable = false)
+ |-- x: double (nullable = false)
+
+
+In [2]: %time pdf = df.toPandas()
+CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s
+Wall time: 20.7 s
+
+In [3]: spark.conf.set(&quot;spark.sql.execution.arrow.enable&quot;, 
&quot;true&quot;)
+
+In [4]: %time pdf = df.toPandas()
+CPU times: user 40 ms, sys: 32 ms, total: 72 ms                                
 
+Wall time: 737 ms
+
+In [5]: pdf.describe()
+Out[5]: 
+                 id             x
+count  4.194304e+06  4.194304e+06
+mean   2.097152e+06  4.998996e-01
+std    1.210791e+06  2.887247e-01
+min    0.000000e+00  8.291929e-07
+25%    1.048576e+06  2.498116e-01
+50%    2.097152e+06  4.999210e-01
+75%    3.145727e+06  7.498380e-01
+max    4.194303e+06  9.999996e-01
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;This example was run locally on my laptop using Spark defaults so the 
times
+shown should not be taken precisely. Even though, it is clear there is a huge
+performance boost and using Arrow took something that was excruciatingly slow
+and speeds it up to be barely noticeable.&lt;/p&gt;
+
+&lt;h1 id=&quot;notes-on-usage&quot;&gt;Notes on Usage&lt;/h1&gt;
+
+&lt;p&gt;Here are some things to keep in mind before making use of this new 
feature. At
+the time of writing this, pyarrow will not be installed automatically with
+pyspark and needs to be manually installed, see installation &lt;a 
href=&quot;https://github.com/apache/arrow/blob/master/site/install.md&quot;&gt;instructions&lt;/a&gt;.
+It is planned to add pyarrow as a pyspark dependency so that 
+&lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;gt; pip install 
pyspark&lt;/code&gt; will also install pyarrow.&lt;/p&gt;
+
+&lt;p&gt;Currently, the controlling SQLConf is disabled by default. This can 
be enabled
+programmatically as in the example above or by adding the line
+âspark.sql.execution.arrow.enable=trueâ to &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SPARK_HOME/conf/spark-defaults.conf&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;Also, not all Spark data types are currently supported and limited to 
primitive
+types. Expanded type support is in the works and expected to also be in the 
Spark
+2.3 release.&lt;/p&gt;
+
+&lt;h1 id=&quot;future-improvements&quot;&gt;Future Improvements&lt;/h1&gt;
+
+&lt;p&gt;As mentioned, this was just a first step in using Arrow to make life 
easier for
+Spark Python users. A few exciting initiatives in the works are to allow for
+vectorized UDF evaluation (&lt;a 
href=&quot;https://issues.apache.org/jira/browse/SPARK-21190&quot;&gt;SPARK-21190&lt;/a&gt;,
 &lt;a 
href=&quot;https://issues.apache.org/jira/browse/SPARK-21404&quot;&gt;SPARK-21404&lt;/a&gt;),
 and the ability
+to apply a function on grouped data using a Pandas DataFrame (&lt;a 
href=&quot;https://issues.apache.org/jira/browse/SPARK-20396&quot;&gt;SPARK-20396&lt;/a&gt;).
+Just as Arrow helped in converting a Spark to Pandas, it can also work in the
+other direction when creating a Spark DataFrame from an existing Pandas
+DataFrame (&lt;a 
href=&quot;https://issues.apache.org/jira/browse/SPARK-20791&quot;&gt;SPARK-20791&lt;/a&gt;).
 Stay tuned for more!&lt;/p&gt;
+
+&lt;h1 id=&quot;collaborators&quot;&gt;Collaborators&lt;/h1&gt;
+
+&lt;p&gt;Reaching this first milestone was a group effort from both the Apache 
Arrow and
+Spark communities. Thanks to the hard work of &lt;a 
href=&quot;https://github.com/wesm&quot;&gt;Wes McKinney&lt;/a&gt;, &lt;a 
href=&quot;https://github.com/icexelloss&quot;&gt;Li Jin&lt;/a&gt;,
+&lt;a href=&quot;https://github.com/holdenk&quot;&gt;Holden Karau&lt;/a&gt;, 
Reynold Xin, Wenchen Fan, Shane Knapp and many others that
+helped push this effort 
forwards.&lt;/p&gt;</content><author><name>BryanCutler</name></author></entry><entry><title
 type="html">Apache Arrow 0.5.0 Release</title><link 
href="/blog/2017/07/25/0.5.0-release/" rel="alternate" type="text/html" 
title="Apache Arrow 0.5.0 Release" 
/><published>2017-07-25T00:00:00-04:00</published><updated>2017-07-25T00:00:00-04:00</updated><id>/blog/2017/07/25/0.5.0-release</id><content
 type="html" xml:base="/blog/2017/07/25/0.5.0-release/">&lt;!--
 
 --&gt;

arrow-site git commit: Add Spark+Arrow blog post [Forced Update!]

Reply via email to