http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/2017/06/16/turbodbc-arrow/index.html ---------------------------------------------------------------------- diff --git a/build/blog/2017/06/16/turbodbc-arrow/index.html b/build/blog/2017/06/16/turbodbc-arrow/index.html new file mode 100644 index 0000000..8bf1f4e --- /dev/null +++ b/build/blog/2017/06/16/turbodbc-arrow/index.html @@ -0,0 +1,232 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + <li><a href="/docs/js">Javascript API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + <h2> + Connecting Relational Databases to the Apache Arrow World with turbodbc + <a href="/blog/2017/06/16/turbodbc-arrow/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 16 Jun 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://github.com/MathMagique"><i class="fa fa-user"></i> Michael König (MathMagique)</a> + </div> + </div> + </div> + + <!-- + +--> + +<p><em><a href="https://github.com/mathmagique">Michael König</a> is the lead developer of the <a href="https://github.com/blue-yonder/turbodbc">turbodbc project</a></em></p> + +<p>The <a href="https://arrow.apache.org/">Apache Arrow</a> project set out to become the universal data layer for +column-oriented data processing systems without incurring serialization costs +or compromising on performance on a more general level. While relational +databases still lag behind in Apache Arrow adoption, the Python database module +<a href="https://github.com/blue-yonder/turbodbc">turbodbc</a> brings Apache Arrow support to these databases using a much +older, more specialized data exchange layer: <a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity">ODBC</a>.</p> + +<p>ODBC is a database interface that offers developers the option to transfer data +either in row-wise or column-wise fashion. Previous Python ODBC modules typically +use the row-wise approach, and often trade repeated database roundtrips for simplified +buffer handling. This makes them less suited for data-intensive applications, +particularly when interfacing with modern columnar analytical databases.</p> + +<p>In contrast, turbodbc was designed to leverage columnar data processing from day +one. Naturally, this implies using the columnar portion of the ODBC API. Equally +important, however, is to find new ways of providing columnar data to Python users +that exceed the capabilities of the row-wise API mandated by Pythonâs <a href="https://www.python.org/dev/peps/pep-0249/">PEP 249</a>. +Turbodbc has adopted Apache Arrow for this very task with the recently released +version 2.0.0:</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">turbodbc</span> <span class="kn">import</span> <span class="n">connect</span> +<span class="o">>>></span> <span class="n">connection</span> <span class="o">=</span> <span class="n">connect</span><span class="p">(</span><span class="n">dsn</span><span class="o">=</span><span class="s">"My columnar database"</span><span class="p">)</span> +<span class="o">>>></span> <span class="n">cursor</span> <span class="o">=</span> <span class="n">connection</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span> +<span class="o">>>></span> <span class="n">cursor</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="s">"SELECT some_integers, some_strings FROM my_table"</span><span class="p">)</span> +<span class="o">>>></span> <span class="n">cursor</span><span class="o">.</span><span class="n">fetchallarrow</span><span class="p">()</span> +<span class="n">pyarrow</span><span class="o">.</span><span class="n">Table</span> +<span class="n">some_integers</span><span class="p">:</span> <span class="n">int64</span> +<span class="n">some_strings</span><span class="p">:</span> <span class="n">string</span> +</code></pre> +</div> + +<p>With this new addition, the data flow for a result set of a typical SELECT query +is like this:</p> +<ul> + <li>The database prepares the result set and exposes it to the ODBC driver using +either row-wise or column-wise storage.</li> + <li>Turbodbc has the ODBC driver write chunks of the result set into columnar buffers.</li> + <li>These buffers are exposed to turbodbcâs Apache Arrow frontend. This frontend +will create an Arrow table and fill in the buffered values.</li> + <li>The previous steps are repeated until the entire result set is retrieved.</li> +</ul> + +<p><img src="/img/turbodbc_arrow.png" alt="Data flow from relational databases to Python with turbodbc and the Apache Arrow frontend" class="img-responsive" width="75%" /></p> + +<p>In practice, it is possible to achieve the following ideal situation: A 64-bit integer +column is stored as one contiguous block of memory in a columnar database. A huge chunk +of 64-bit integers is transferred over the network and the ODBC driver directly writes +it to a turbodbc buffer of 64-bit integers. The Arrow frontend accumulates these values +by copying the entire 64-bit buffer into a free portion of an Arrow tableâs 64-bit +integer column.</p> + +<p>Moving data from the database to an Arrow table and, thus, providing it to the Python +user can be as simple as copying memory blocks around, megabytes equivalent to hundred +thousands of rows at a time. The absence of serialization and conversion logic renders +the process extremely efficient.</p> + +<p>Once the data is stored in an Arrow table, Python users can continue to do some +actual work. They can convert it into a <a href="https://arrow.apache.org/docs/python/pandas.html">Pandas DataFrame</a> for data analysis +(using a quick <code class="highlighter-rouge">table.to_pandas()</code>), pass it on to other data processing +systems such as <a href="http://spark.apache.org/">Apache Spark</a> or <a href="http://impala.apache.org/">Apache Impala (incubating)</a>, or store +it in the <a href="http://parquet.apache.org/">Apache Parquet</a> file format. This way, non-Python systems are +efficiently connected with relational databases.</p> + +<p>In the future, turbodbcâs Arrow support will be extended to use more +sophisticated features such as <a href="https://arrow.apache.org/docs/memory_layout.html#dictionary-encoding">dictionary-encoded</a> string fields. We also +plan to pick smaller than 64-bit <a href="https://arrow.apache.org/docs/metadata.html#integers">data types</a> where possible. Last but not +least, Arrow support will be extended to cover the reverse direction of data +flow, so that Python users can quickly insert Arrow tables into relational +databases.</p> + +<p>If you would like to learn more about turbodbc, check out the <a href="https://github.com/blue-yonder/turbodbc">GitHub project</a> and the +<a href="http://turbodbc.readthedocs.io/">project documentation</a>. If you want to learn more about how turbodbc implements the +nitty-gritty details, check out parts <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">one</a> and <a href="https://tech.blue-yonder.com/making-of-turbodbc-part-2-c-to-python/">two</a> of the +<a href="https://tech.blue-yonder.com/making-of-turbodbc-part-1-wrestling-with-the-side-effects-of-a-c-api/">âMaking of turbodbcâ</a> series at <a href="https://tech.blue-yonder.com/">Blue Yonderâs technology blog</a>.</p> + + + + <hr/> +<footer class="footer"> + <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> + <p>© 2017 Apache Software Foundation</p> +</footer> + + </div> +</body> +</html>
http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/2017/07/24/0.5.0-release/index.html ---------------------------------------------------------------------- diff --git a/build/blog/2017/07/24/0.5.0-release/index.html b/build/blog/2017/07/24/0.5.0-release/index.html new file mode 100644 index 0000000..8e99201 --- /dev/null +++ b/build/blog/2017/07/24/0.5.0-release/index.html @@ -0,0 +1,235 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + <h2> + Apache Arrow 0.5.0 Release + <a href="/blog/2017/07/24/0.5.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 24 Jul 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (wesm)</a> + </div> + </div> + </div> + + <!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.5.0 release. It includes +<a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.5.0"><strong>130 resolved JIRAs</strong></a> with some new features, expanded integration +testing between implementations, and bug fixes. The Arrow memory format remains +stable since the 0.3.x and 0.4.x releases.</p> + +<p>See the <a href="http://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="http://arrow.apache.org/release/0.5.0.html">complete changelog</a> is also available.</p> + +<h2 id="expanded-integration-testing">Expanded Integration Testing</h2> + +<p>In this release, we added compatibility tests for dictionary-encoded data +between Java and C++. This enables the distinct values (the <em>dictionary</em>) in a +vector to be transmitted as part of an Arrow schema while the record batches +contain integers which correspond to the dictionary.</p> + +<p>So we might have:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>data (string): ['foo', 'bar', 'foo', 'bar'] +</code></pre> +</div> + +<p>In dictionary-encoded form, this could be represented as:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>indices (int8): [0, 1, 0, 1] +dictionary (string): ['foo', 'bar'] +</code></pre> +</div> + +<p>In upcoming releases, we plan to complete integration testing for the remaining +data types (including some more complicated types like unions and decimals) on +the road to a 1.0.0 release in the future.</p> + +<h2 id="c-activity">C++ Activity</h2> + +<p>We completed a number of significant pieces of work in the C++ part of Apache +Arrow.</p> + +<h3 id="using-jemalloc-as-default-memory-allocator">Using jemalloc as default memory allocator</h3> + +<p>We decided to use <a href="https://github.com/jemalloc/jemalloc">jemalloc</a> as the default memory allocator unless it is +explicitly disabled. This memory allocator has significant performance +advantages in Arrow workloads over the default <code class="highlighter-rouge">malloc</code> implementation. We will +publish a blog post going into more detail about this and why you might care.</p> + +<h3 id="sharing-more-c-code-with-apache-parquet">Sharing more C++ code with Apache Parquet</h3> + +<p>We imported the compression library interfaces and dictionary encoding +algorithms from the <a href="http://github.com/apache/parquet-cpp">Apache Parquet C++ library</a>. The Parquet library now +depends on this code in Arrow, and we will be able to use it more easily for +data compression in Arrow use cases.</p> + +<p>As part of incorporating Parquetâs dictionary encoding utilities, we have +developed an <code class="highlighter-rouge">arrow::DictionaryBuilder</code> class to enable building +dictionary-encoded arrays iteratively. This can help save memory and yield +better performance when interacting with databases, Parquet files, or other +sources which may have columns having many duplicates.</p> + +<h3 id="support-for-lz4-and-zstd-compressors">Support for LZ4 and ZSTD compressors</h3> + +<p>We added LZ4 and ZSTD compression library support. In ARROW-300 and other +planned work, we intend to add some compression features for data sent via RPC.</p> + +<h2 id="python-activity">Python Activity</h2> + +<p>We fixed many bugs which were affecting Parquet and Feather users and fixed +several other rough edges with normal Arrow use. We also added some additional +Arrow type conversions: structs, lists embedded in pandas objects, and Arrow +time types (which deserialize to the <code class="highlighter-rouge">datetime.time</code> type).</p> + +<p>In upcoming releases we plan to continue to improve <a href="http://github.com/dask/dask">Dask</a> support and +performance for distributed processing of Apache Parquet files with pyarrow.</p> + +<h2 id="the-road-ahead">The Road Ahead</h2> + +<p>We have much work ahead of us to build out Arrow integrations in other data +systems to improve their processing performance and interoperability with other +systems.</p> + +<p>We are discussing the roadmap to a future 1.0.0 release on the <a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">developer +mailing list</a>. Please join the discussion there.</p> + + + + <hr/> +<footer class="footer"> + <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> + <p>© 2017 Apache Software Foundation</p> +</footer> + + </div> +</body> +</html> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/2017/07/25/0.5.0-release/index.html ---------------------------------------------------------------------- diff --git a/build/blog/2017/07/25/0.5.0-release/index.html b/build/blog/2017/07/25/0.5.0-release/index.html new file mode 100644 index 0000000..4b7dd39 --- /dev/null +++ b/build/blog/2017/07/25/0.5.0-release/index.html @@ -0,0 +1,236 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + <li><a href="/docs/js">Javascript API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + <h2> + Apache Arrow 0.5.0 Release + <a href="/blog/2017/07/25/0.5.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 25 Jul 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (wesm)</a> + </div> + </div> + </div> + + <!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.5.0 release. It includes +<a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.5.0"><strong>130 resolved JIRAs</strong></a> with some new features, expanded integration +testing between implementations, and bug fixes. The Arrow memory format remains +stable since the 0.3.x and 0.4.x releases.</p> + +<p>See the <a href="http://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="http://arrow.apache.org/release/0.5.0.html">complete changelog</a> is also available.</p> + +<h2 id="expanded-integration-testing">Expanded Integration Testing</h2> + +<p>In this release, we added compatibility tests for dictionary-encoded data +between Java and C++. This enables the distinct values (the <em>dictionary</em>) in a +vector to be transmitted as part of an Arrow schema while the record batches +contain integers which correspond to the dictionary.</p> + +<p>So we might have:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>data (string): ['foo', 'bar', 'foo', 'bar'] +</code></pre> +</div> + +<p>In dictionary-encoded form, this could be represented as:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>indices (int8): [0, 1, 0, 1] +dictionary (string): ['foo', 'bar'] +</code></pre> +</div> + +<p>In upcoming releases, we plan to complete integration testing for the remaining +data types (including some more complicated types like unions and decimals) on +the road to a 1.0.0 release in the future.</p> + +<h2 id="c-activity">C++ Activity</h2> + +<p>We completed a number of significant pieces of work in the C++ part of Apache +Arrow.</p> + +<h3 id="using-jemalloc-as-default-memory-allocator">Using jemalloc as default memory allocator</h3> + +<p>We decided to use <a href="https://github.com/jemalloc/jemalloc">jemalloc</a> as the default memory allocator unless it is +explicitly disabled. This memory allocator has significant performance +advantages in Arrow workloads over the default <code class="highlighter-rouge">malloc</code> implementation. We will +publish a blog post going into more detail about this and why you might care.</p> + +<h3 id="sharing-more-c-code-with-apache-parquet">Sharing more C++ code with Apache Parquet</h3> + +<p>We imported the compression library interfaces and dictionary encoding +algorithms from the <a href="http://github.com/apache/parquet-cpp">Apache Parquet C++ library</a>. The Parquet library now +depends on this code in Arrow, and we will be able to use it more easily for +data compression in Arrow use cases.</p> + +<p>As part of incorporating Parquetâs dictionary encoding utilities, we have +developed an <code class="highlighter-rouge">arrow::DictionaryBuilder</code> class to enable building +dictionary-encoded arrays iteratively. This can help save memory and yield +better performance when interacting with databases, Parquet files, or other +sources which may have columns having many duplicates.</p> + +<h3 id="support-for-lz4-and-zstd-compressors">Support for LZ4 and ZSTD compressors</h3> + +<p>We added LZ4 and ZSTD compression library support. In ARROW-300 and other +planned work, we intend to add some compression features for data sent via RPC.</p> + +<h2 id="python-activity">Python Activity</h2> + +<p>We fixed many bugs which were affecting Parquet and Feather users and fixed +several other rough edges with normal Arrow use. We also added some additional +Arrow type conversions: structs, lists embedded in pandas objects, and Arrow +time types (which deserialize to the <code class="highlighter-rouge">datetime.time</code> type).</p> + +<p>In upcoming releases we plan to continue to improve <a href="http://github.com/dask/dask">Dask</a> support and +performance for distributed processing of Apache Parquet files with pyarrow.</p> + +<h2 id="the-road-ahead">The Road Ahead</h2> + +<p>We have much work ahead of us to build out Arrow integrations in other data +systems to improve their processing performance and interoperability with other +systems.</p> + +<p>We are discussing the roadmap to a future 1.0.0 release on the <a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">developer +mailing list</a>. Please join the discussion there.</p> + + + + <hr/> +<footer class="footer"> + <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> + <p>© 2017 Apache Software Foundation</p> +</footer> + + </div> +</body> +</html> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/2017/07/26/spark-arrow/index.html ---------------------------------------------------------------------- diff --git a/build/blog/2017/07/26/spark-arrow/index.html b/build/blog/2017/07/26/spark-arrow/index.html new file mode 100644 index 0000000..1cbd2ea --- /dev/null +++ b/build/blog/2017/07/26/spark-arrow/index.html @@ -0,0 +1,272 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + <li><a href="/docs/js">Javascript API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + <h2> + Speeding up PySpark with Apache Arrow + <a href="/blog/2017/07/26/spark-arrow/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 26 Jul 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://people.apache.org/~BryanCutler"><i class="fa fa-user"></i> (BryanCutler)</a> + </div> + </div> + </div> + + <!-- + +--> + +<p><em><a href="https://github.com/BryanCutler">Bryan Cutler</a> is a software engineer at IBMâs Spark Technology Center <a href="http://www.spark.tc/">STC</a></em></p> + +<p>Beginning with <a href="https://spark.apache.org/">Apache Spark</a> version 2.3, <a href="https://arrow.apache.org/">Apache Arrow</a> will be a supported +dependency and begin to offer increased performance with columnar data transfer. +If you are a Spark user that prefers to work in Python and Pandas, this is a cause +to be excited over! The initial work is limited to collecting a Spark DataFrame +with <code class="highlighter-rouge">toPandas()</code>, which I will discuss below, however there are many additional +improvements that are currently <a href="https://issues.apache.org/jira/issues/?filter=12335725&jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22arrow%22%20ORDER%20BY%20createdDate%20DESC">underway</a>.</p> + +<h1 id="optimizing-spark-conversion-to-pandas">Optimizing Spark Conversion to Pandas</h1> + +<p>The previous way of converting a Spark DataFrame to Pandas with <code class="highlighter-rouge">DataFrame.toPandas()</code> +in PySpark was painfully inefficient. Basically, it worked by first collecting all +rows to the Spark driver. Next, each row would get serialized into Pythonâs pickle +format and sent to a Python worker process. This child process unpickles each row into +a huge list of tuples. Finally, a Pandas DataFrame is created from the list using +<code class="highlighter-rouge">pandas.DataFrame.from_records()</code>.</p> + +<p>This all might seem like standard procedure, but suffers from 2 glaring issues: 1) +even using CPickle, Python serialization is a slow process and 2) creating +a <code class="highlighter-rouge">pandas.DataFrame</code> using <code class="highlighter-rouge">from_records</code> must slowly iterate over the list of pure +Python data and convert each value to Pandas format. See <a href="https://gist.github.com/wesm/0cb5531b1c2e346a0007">here</a> for a detailed +analysis.</p> + +<p>Here is where Arrow really shines to help optimize these steps: 1) Once the data is +in Arrow memory format, there is no need to serialize/pickle anymore as Arrow data can +be sent directly to the Python process, 2) When the Arrow data is received in Python, +then pyarrow can utilize zero-copy methods to create a <code class="highlighter-rouge">pandas.DataFrame</code> from entire +chunks of data at once instead of processing individual scalar values. Additionally, +the conversion to Arrow data can be done on the JVM and pushed back for the Spark +executors to perform in parallel, drastically reducing the load on the driver.</p> + +<p>As of the merging of <a href="https://issues.apache.org/jira/browse/SPARK-13534">SPARK-13534</a>, the use of Arrow when calling <code class="highlighter-rouge">toPandas()</code> +needs to be enabled by setting the SQLConf âspark.sql.execution.arrow.enabledâ to +âtrueâ. Letâs look at a simple usage example.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /__ / .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT + /_/ + +Using Python version 2.7.13 (default, Dec 20 2016 23:09:15) +SparkSession available as 'spark'. + +In [1]: from pyspark.sql.functions import rand + ...: df = spark.range(1 << 22).toDF("id").withColumn("x", rand()) + ...: df.printSchema() + ...: +root + |-- id: long (nullable = false) + |-- x: double (nullable = false) + + +In [2]: %time pdf = df.toPandas() +CPU times: user 17.4 s, sys: 792 ms, total: 18.1 s +Wall time: 20.7 s + +In [3]: spark.conf.set("spark.sql.execution.arrow.enabled", "true") + +In [4]: %time pdf = df.toPandas() +CPU times: user 40 ms, sys: 32 ms, total: 72 ms +Wall time: 737 ms + +In [5]: pdf.describe() +Out[5]: + id x +count 4.194304e+06 4.194304e+06 +mean 2.097152e+06 4.998996e-01 +std 1.210791e+06 2.887247e-01 +min 0.000000e+00 8.291929e-07 +25% 1.048576e+06 2.498116e-01 +50% 2.097152e+06 4.999210e-01 +75% 3.145727e+06 7.498380e-01 +max 4.194303e+06 9.999996e-01 +</code></pre> +</div> + +<p>This example was run locally on my laptop using Spark defaults so the times +shown should not be taken precisely. Even though, it is clear there is a huge +performance boost and using Arrow took something that was excruciatingly slow +and speeds it up to be barely noticeable.</p> + +<h1 id="notes-on-usage">Notes on Usage</h1> + +<p>Here are some things to keep in mind before making use of this new feature. At +the time of writing this, pyarrow will not be installed automatically with +pyspark and needs to be manually installed, see installation <a href="https://github.com/apache/arrow/blob/master/site/install.md">instructions</a>. +It is planned to add pyarrow as a pyspark dependency so that +<code class="highlighter-rouge">> pip install pyspark</code> will also install pyarrow.</p> + +<p>Currently, the controlling SQLConf is disabled by default. This can be enabled +programmatically as in the example above or by adding the line +âspark.sql.execution.arrow.enabled=trueâ to <code class="highlighter-rouge">SPARK_HOME/conf/spark-defaults.conf</code>.</p> + +<p>Also, not all Spark data types are currently supported and limited to primitive +types. Expanded type support is in the works and expected to also be in the Spark +2.3 release.</p> + +<h1 id="future-improvements">Future Improvements</h1> + +<p>As mentioned, this was just a first step in using Arrow to make life easier for +Spark Python users. A few exciting initiatives in the works are to allow for +vectorized UDF evaluation (<a href="https://issues.apache.org/jira/browse/SPARK-21190">SPARK-21190</a>, <a href="https://issues.apache.org/jira/browse/SPARK-21404">SPARK-21404</a>), and the ability +to apply a function on grouped data using a Pandas DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20396">SPARK-20396</a>). +Just as Arrow helped in converting a Spark to Pandas, it can also work in the +other direction when creating a Spark DataFrame from an existing Pandas +DataFrame (<a href="https://issues.apache.org/jira/browse/SPARK-20791">SPARK-20791</a>). Stay tuned for more!</p> + +<h1 id="collaborators">Collaborators</h1> + +<p>Reaching this first milestone was a group effort from both the Apache Arrow and +Spark communities. Thanks to the hard work of <a href="https://github.com/wesm">Wes McKinney</a>, <a href="https://github.com/icexelloss">Li Jin</a>, +<a href="https://github.com/holdenk">Holden Karau</a>, Reynold Xin, Wenchen Fan, Shane Knapp and many others that +helped push this effort forwards.</p> + + + + <hr/> +<footer class="footer"> + <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> + <p>© 2017 Apache Software Foundation</p> +</footer> + + </div> +</body> +</html> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/2017/08/07/plasma-in-memory-object-store/index.html ---------------------------------------------------------------------- diff --git a/build/blog/2017/08/07/plasma-in-memory-object-store/index.html b/build/blog/2017/08/07/plasma-in-memory-object-store/index.html new file mode 100644 index 0000000..d2f25da --- /dev/null +++ b/build/blog/2017/08/07/plasma-in-memory-object-store/index.html @@ -0,0 +1,273 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + <h2> + Plasma In-Memory Object Store + <a href="/blog/2017/08/07/plasma-in-memory-object-store/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 07 Aug 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://people.apache.org/~Philipp Moritz and Robert Nishihara"><i class="fa fa-user"></i> (Philipp Moritz and Robert Nishihara)</a> + </div> + </div> + </div> + + <!-- + +--> + +<p><em><a href="https://people.eecs.berkeley.edu/~pcmoritz/">Philipp Moritz</a> and <a href="http://www.robertnishihara.com">Robert Nishihara</a> are graduate students at UC + Berkeley.</em></p> + +<h2 id="plasma-a-high-performance-shared-memory-object-store">Plasma: A High-Performance Shared-Memory Object Store</h2> + +<h3 id="motivating-plasma">Motivating Plasma</h3> + +<p>This blog post presents Plasma, an in-memory object store that is being +developed as part of Apache Arrow. <strong>Plasma holds immutable objects in shared +memory so that they can be accessed efficiently by many clients across process +boundaries.</strong> In light of the trend toward larger and larger multicore machines, +Plasma enables critical performance optimizations in the big data regime.</p> + +<p>Plasma was initially developed as part of <a href="https://github.com/ray-project/ray">Ray</a>, and has recently been moved +to Apache Arrow in the hopes that it will be broadly useful.</p> + +<p>One of the goals of Apache Arrow is to serve as a common data layer enabling +zero-copy data exchange between multiple frameworks. A key component of this +vision is the use of off-heap memory management (via Plasma) for storing and +sharing Arrow-serialized objects between applications.</p> + +<p><strong>Expensive serialization and deserialization as well as data copying are a +common performance bottleneck in distributed computing.</strong> For example, a +Python-based execution framework that wishes to distribute computation across +multiple Python âworkerâ processes and then aggregate the results in a single +âdriverâ process may choose to serialize data using the built-in <code class="highlighter-rouge">pickle</code> +library. Assuming one Python process per core, each worker process would have to +copy and deserialize the data, resulting in excessive memory usage. The driver +process would then have to deserialize results from each of the workers, +resulting in a bottleneck.</p> + +<p>Using Plasma plus Arrow, the data being operated on would be placed in the +Plasma store once, and all of the workers would read the data without copying or +deserializing it (the workers would map the relevant region of memory into their +own address spaces). The workers would then put the results of their computation +back into the Plasma store, which the driver could then read and aggregate +without copying or deserializing the data.</p> + +<h3 id="the-plasma-api">The Plasma API:</h3> + +<p>Below we illustrate a subset of the API. The C++ API is documented more fully +<a href="https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md">here</a>, and the Python API is documented <a href="https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst">here</a>.</p> + +<p><strong>Object IDs:</strong> Each object is associated with a string of bytes.</p> + +<p><strong>Creating an object:</strong> Objects are stored in Plasma in two stages. First, the +object store <em>creates</em> the object by allocating a buffer for it. At this point, +the client can write to the buffer and construct the object within the allocated +buffer. When the client is done, the client <em>seals</em> the buffer making the object +immutable and making it available to other Plasma clients.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Create an object.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="n">object_size</span> <span class="o">=</span> <span class="mi">1000</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">object_id</span><span class="p">,</span> <span class="n">object_size</span><span class="p">))</span> + +<span class="c"># Write to the buffer.</span> +<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span> + <span class="nb">buffer</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> + +<span class="c"># Seal the object making it immutable and available to other clients.</span> +<span class="n">client</span><span class="o">.</span><span class="n">seal</span><span class="p">(</span><span class="n">object_id</span><span class="p">)</span> +</code></pre> +</div> + +<p><strong>Getting an object:</strong> After an object has been sealed, any client who knows the +object ID can get the object.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Get the object from the store. This blocks until the object has been sealed.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="p">[</span><span class="n">buff</span><span class="p">]</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">get</span><span class="p">([</span><span class="n">object_id</span><span class="p">])</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">buff</span><span class="p">)</span> +</code></pre> +</div> + +<p>If the object has not been sealed yet, then the call to <code class="highlighter-rouge">client.get</code> will block +until the object has been sealed.</p> + +<h3 id="a-sorting-application">A sorting application</h3> + +<p>To illustrate the benefits of Plasma, we demonstrate an <strong>11x speedup</strong> (on a +machine with 20 physical cores) for sorting a large pandas DataFrame (one +billion entries). The baseline is the built-in pandas sort function, which sorts +the DataFrame in 477 seconds. To leverage multiple cores, we implement the +following standard distributed sorting scheme.</p> + +<ul> + <li>We assume that the data is partitioned across K pandas DataFrames and that +each one already lives in the Plasma store.</li> + <li>We subsample the data, sort the subsampled data, and use the result to define +L non-overlapping buckets.</li> + <li>For each of the K data partitions and each of the L buckets, we find the +subset of the data partition that falls in the bucket, and we sort that +subset.</li> + <li>For each of the L buckets, we gather all of the K sorted subsets that fall in +that bucket.</li> + <li>For each of the L buckets, we merge the corresponding K sorted subsets.</li> + <li>We turn each bucket into a pandas DataFrame and place it in the Plasma store.</li> +</ul> + +<p>Using this scheme, we can sort the DataFrame (the data starts and ends in the +Plasma store), in 44 seconds, giving an 11x speedup over the baseline.</p> + +<h3 id="design">Design</h3> + +<p>The Plasma store runs as a separate process. It is written in C++ and is +designed as a single-threaded event loop based on the <a href="https://redis.io/">Redis</a> event loop library. +The plasma client library can be linked into applications. Clients communicate +with the Plasma store via messages serialized using <a href="https://google.github.io/flatbuffers/">Google Flatbuffers</a>.</p> + +<h3 id="call-for-contributions">Call for contributions</h3> + +<p>Plasma is a work in progress, and the API is currently unstable. Today Plasma is +primarily used in <a href="https://github.com/ray-project/ray">Ray</a> as an in-memory cache for Arrow serialized objects. +We are looking for a broader set of use cases to help refine Plasmaâs API. In +addition, we are looking for contributions in a variety of areas including +improving performance and building other language bindings. Please let us know +if you are interested in getting involved with the project.</p> + + + + <hr/> +<footer class="footer"> + <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> + <p>© 2017 Apache Software Foundation</p> +</footer> + + </div> +</body> +</html> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/2017/08/08/plasma-in-memory-object-store/index.html ---------------------------------------------------------------------- diff --git a/build/blog/2017/08/08/plasma-in-memory-object-store/index.html b/build/blog/2017/08/08/plasma-in-memory-object-store/index.html new file mode 100644 index 0000000..528ad5f --- /dev/null +++ b/build/blog/2017/08/08/plasma-in-memory-object-store/index.html @@ -0,0 +1,274 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + <li><a href="/docs/js">Javascript API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + <h2> + Plasma In-Memory Object Store + <a href="/blog/2017/08/08/plasma-in-memory-object-store/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 08 Aug 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://people.apache.org/~Philipp Moritz and Robert Nishihara"><i class="fa fa-user"></i> (Philipp Moritz and Robert Nishihara)</a> + </div> + </div> + </div> + + <!-- + +--> + +<p><em><a href="https://people.eecs.berkeley.edu/~pcmoritz/">Philipp Moritz</a> and <a href="http://www.robertnishihara.com">Robert Nishihara</a> are graduate students at UC + Berkeley.</em></p> + +<h2 id="plasma-a-high-performance-shared-memory-object-store">Plasma: A High-Performance Shared-Memory Object Store</h2> + +<h3 id="motivating-plasma">Motivating Plasma</h3> + +<p>This blog post presents Plasma, an in-memory object store that is being +developed as part of Apache Arrow. <strong>Plasma holds immutable objects in shared +memory so that they can be accessed efficiently by many clients across process +boundaries.</strong> In light of the trend toward larger and larger multicore machines, +Plasma enables critical performance optimizations in the big data regime.</p> + +<p>Plasma was initially developed as part of <a href="https://github.com/ray-project/ray">Ray</a>, and has recently been moved +to Apache Arrow in the hopes that it will be broadly useful.</p> + +<p>One of the goals of Apache Arrow is to serve as a common data layer enabling +zero-copy data exchange between multiple frameworks. A key component of this +vision is the use of off-heap memory management (via Plasma) for storing and +sharing Arrow-serialized objects between applications.</p> + +<p><strong>Expensive serialization and deserialization as well as data copying are a +common performance bottleneck in distributed computing.</strong> For example, a +Python-based execution framework that wishes to distribute computation across +multiple Python âworkerâ processes and then aggregate the results in a single +âdriverâ process may choose to serialize data using the built-in <code class="highlighter-rouge">pickle</code> +library. Assuming one Python process per core, each worker process would have to +copy and deserialize the data, resulting in excessive memory usage. The driver +process would then have to deserialize results from each of the workers, +resulting in a bottleneck.</p> + +<p>Using Plasma plus Arrow, the data being operated on would be placed in the +Plasma store once, and all of the workers would read the data without copying or +deserializing it (the workers would map the relevant region of memory into their +own address spaces). The workers would then put the results of their computation +back into the Plasma store, which the driver could then read and aggregate +without copying or deserializing the data.</p> + +<h3 id="the-plasma-api">The Plasma API:</h3> + +<p>Below we illustrate a subset of the API. The C++ API is documented more fully +<a href="https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md">here</a>, and the Python API is documented <a href="https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst">here</a>.</p> + +<p><strong>Object IDs:</strong> Each object is associated with a string of bytes.</p> + +<p><strong>Creating an object:</strong> Objects are stored in Plasma in two stages. First, the +object store <em>creates</em> the object by allocating a buffer for it. At this point, +the client can write to the buffer and construct the object within the allocated +buffer. When the client is done, the client <em>seals</em> the buffer making the object +immutable and making it available to other Plasma clients.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Create an object.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="n">object_size</span> <span class="o">=</span> <span class="mi">1000</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">object_id</span><span class="p">,</span> <span class="n">object_size</span><span class="p">))</span> + +<span class="c"># Write to the buffer.</span> +<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span> + <span class="nb">buffer</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> + +<span class="c"># Seal the object making it immutable and available to other clients.</span> +<span class="n">client</span><span class="o">.</span><span class="n">seal</span><span class="p">(</span><span class="n">object_id</span><span class="p">)</span> +</code></pre> +</div> + +<p><strong>Getting an object:</strong> After an object has been sealed, any client who knows the +object ID can get the object.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Get the object from the store. This blocks until the object has been sealed.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="p">[</span><span class="n">buff</span><span class="p">]</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">get</span><span class="p">([</span><span class="n">object_id</span><span class="p">])</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">buff</span><span class="p">)</span> +</code></pre> +</div> + +<p>If the object has not been sealed yet, then the call to <code class="highlighter-rouge">client.get</code> will block +until the object has been sealed.</p> + +<h3 id="a-sorting-application">A sorting application</h3> + +<p>To illustrate the benefits of Plasma, we demonstrate an <strong>11x speedup</strong> (on a +machine with 20 physical cores) for sorting a large pandas DataFrame (one +billion entries). The baseline is the built-in pandas sort function, which sorts +the DataFrame in 477 seconds. To leverage multiple cores, we implement the +following standard distributed sorting scheme.</p> + +<ul> + <li>We assume that the data is partitioned across K pandas DataFrames and that +each one already lives in the Plasma store.</li> + <li>We subsample the data, sort the subsampled data, and use the result to define +L non-overlapping buckets.</li> + <li>For each of the K data partitions and each of the L buckets, we find the +subset of the data partition that falls in the bucket, and we sort that +subset.</li> + <li>For each of the L buckets, we gather all of the K sorted subsets that fall in +that bucket.</li> + <li>For each of the L buckets, we merge the corresponding K sorted subsets.</li> + <li>We turn each bucket into a pandas DataFrame and place it in the Plasma store.</li> +</ul> + +<p>Using this scheme, we can sort the DataFrame (the data starts and ends in the +Plasma store), in 44 seconds, giving an 11x speedup over the baseline.</p> + +<h3 id="design">Design</h3> + +<p>The Plasma store runs as a separate process. It is written in C++ and is +designed as a single-threaded event loop based on the <a href="https://redis.io/">Redis</a> event loop library. +The plasma client library can be linked into applications. Clients communicate +with the Plasma store via messages serialized using <a href="https://google.github.io/flatbuffers/">Google Flatbuffers</a>.</p> + +<h3 id="call-for-contributions">Call for contributions</h3> + +<p>Plasma is a work in progress, and the API is currently unstable. Today Plasma is +primarily used in <a href="https://github.com/ray-project/ray">Ray</a> as an in-memory cache for Arrow serialized objects. +We are looking for a broader set of use cases to help refine Plasmaâs API. In +addition, we are looking for contributions in a variety of areas including +improving performance and building other language bindings. Please let us know +if you are interested in getting involved with the project.</p> + + + + <hr/> +<footer class="footer"> + <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> + <p>© 2017 Apache Software Foundation</p> +</footer> + + </div> +</body> +</html> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3cd84682/build/blog/2017/08/15/0.6.0-release/index.html ---------------------------------------------------------------------- diff --git a/build/blog/2017/08/15/0.6.0-release/index.html b/build/blog/2017/08/15/0.6.0-release/index.html new file mode 100644 index 0000000..4276b8c --- /dev/null +++ b/build/blog/2017/08/15/0.6.0-release/index.html @@ -0,0 +1,234 @@ +<!DOCTYPE html> +<html lang="en-US"> + <head> + <meta charset="UTF-8"> + <title>Apache Arrow Homepage</title> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="generator" content="Jekyll v3.4.3"> + <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags --> + <link rel="icon" type="image/x-icon" href="/favicon.ico"> + + <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Lato:300,300italic,400,400italic,700,700italic,900"> + + <link href="/css/main.css" rel="stylesheet"> + <link href="/css/syntax.css" rel="stylesheet"> + <script src="https://code.jquery.com/jquery-3.2.1.min.js" + integrity="sha256-hwg4gsxgFZhOsEEamdOYGBf13FyQuiTwlAQgxVSNgt4=" + crossorigin="anonymous"></script> + <script src="/assets/javascripts/bootstrap.min.js"></script> + + <!-- Global Site Tag (gtag.js) - Google Analytics --> +<script async src="https://www.googletagmanager.com/gtag/js?id=UA-107500873-1"></script> +<script> + window.dataLayer = window.dataLayer || []; + function gtag(){dataLayer.push(arguments)}; + gtag('js', new Date()); + + gtag('config', 'UA-107500873-1'); +</script> + + + </head> + + + +<body class="wrap"> + <div class="container"> + <nav class="navbar navbar-default"> + <div class="container-fluid"> + <div class="navbar-header"> + <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#arrow-navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + <a class="navbar-brand" href="/">Apache Arrow™ </a> + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="arrow-navbar"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Project Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/install/">Install</a></li> + <li><a href="/blog/">Blog</a></li> + <li><a href="/release/">Releases</a></li> + <li><a href="https://issues.apache.org/jira/browse/ARROW">Issue Tracker</a></li> + <li><a href="https://github.com/apache/arrow">Source Code</a></li> + <li><a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">Mailing List</a></li> + <li><a href="https://apachearrowslackin.herokuapp.com">Slack Channel</a></li> + <li><a href="/committers/">Committers</a></li> + <li><a href="/powered_by/">Powered By</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Specification<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/memory_layout.html">Memory Layout</a></li> + <li><a href="/docs/metadata.html">Metadata</a></li> + <li><a href="/docs/ipc.html">Messaging / IPC</a></li> + </ul> + </li> + + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">Documentation<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="/docs/python">Python</a></li> + <li><a href="/docs/cpp">C++ API</a></li> + <li><a href="/docs/java">Java API</a></li> + <li><a href="/docs/c_glib">C GLib API</a></li> + </ul> + </li> + <!-- <li><a href="/blog">Blog</a></li> --> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" + role="button" aria-haspopup="true" + aria-expanded="false">ASF Links<span class="caret"></span> + </a> + <ul class="dropdown-menu"> + <li><a href="http://www.apache.org/">ASF Website</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Donate</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + </ul> + </li> + </ul> + <a href="http://www.apache.org/"> + <img style="float:right;" src="/img/asf_logo.svg" width="120px"/> + </a> + </div><!-- /.navbar-collapse --> + </div> + </nav> + + + <h2> + Apache Arrow 0.6.0 Release + <a href="/blog/2017/08/15/0.6.0-release/" class="permalink" title="Permalink">â</a> + </h2> + + + + <div class="panel"> + <div class="panel-body"> + <div> + <span class="label label-default">Published</span> + <span class="published"> + <i class="fa fa-calendar"></i> + 15 Aug 2017 + </span> + </div> + <div> + <span class="label label-default">By</span> + <a href="http://wesmckinney.com"><i class="fa fa-user"></i> Wes McKinney (wesm)</a> + </div> + </div> + </div> + + <!-- + +--> + +<p>The Apache Arrow team is pleased to announce the 0.6.0 release. It includes +<a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.6.0"><strong>90 resolved JIRAs</strong></a> with the new Plasma shared memory object store, and +improvements and bug fixes to the various language implementations. The Arrow +memory format remains stable since the 0.3.x release.</p> + +<p>See the <a href="http://arrow.apache.org/install">Install Page</a> to learn how to get the libraries for your +platform. The <a href="http://arrow.apache.org/release/0.6.0.html">complete changelog</a> is also available.</p> + +<h2 id="plasma-shared-memory-object-store">Plasma Shared Memory Object Store</h2> + +<p>This release includes the <a href="http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/">Plasma Store</a>, which you can read more about in +the linked blog post. This system was originally developed as part of the <a href="https://ray-project.github.io/ray/">Ray +Project</a> at the <a href="https://rise.cs.berkeley.edu/">UC Berkeley RISELab</a>. We recognized that Plasma would be +highly valuable to the Arrow community as a tool for shared memory management +and zero-copy deserialization. Additionally, we believe we will be able to +develop a stronger software stack through sharing of IO and buffer management +code.</p> + +<p>The Plasma store is a server application which runs as a separate process. A +reference C++ client, with Python bindings, is made available in this +release. Clients can be developed in Java or other languages in the future to +enable simple sharing of complex datasets through shared memory.</p> + +<h2 id="arrow-format-addition-map-type">Arrow Format Addition: Map type</h2> + +<p>We added a Map logical type to represent ordered and unordered maps +in-memory. This corresponds to the <code class="highlighter-rouge">MAP</code> logical type annotation in the Parquet +format (where maps are represented as repeated structs).</p> + +<p>Map is represented as a list of structs. It is the first example of a logical +type whose physical representation is a nested type. We have not yet created +implementations of Map containers in any of the implementations, but this can +be done in a future release.</p> + +<p>As an example, the Python data:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>data = [{'a': 1, 'bb': 2, 'cc': 3}, {'dddd': 4}] +</code></pre> +</div> + +<p>Could be represented in an Arrow <code class="highlighter-rouge">Map<String, Int32></code> as:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code>Map<String, Int32> = List<Struct<keys: String, values: Int32>> + is_valid: [true, true] + offsets: [0, 3, 4] + values: Struct<keys: String, values: Int32> + children: + - keys: String + is_valid: [true, true, true, true] + offsets: [0, 1, 3, 5, 9] + data: abbccdddd + - values: Int32 + is_valid: [true, true, true, true] + data: [1, 2, 3, 4] +</code></pre> +</div> +<h2 id="python-changes">Python Changes</h2> + +<p>Some highlights of Python development outside of bug fixes and general API +improvements include:</p> + +<ul> + <li>New <code class="highlighter-rouge">strings_to_categorical=True</code> option when calling <code class="highlighter-rouge">Table.to_pandas</code> will +yield pandas <code class="highlighter-rouge">Categorical</code> types from Arrow binary and string columns</li> + <li>Expanded Hadoop Filesystem (HDFS) functionality to improve compatibility with +Dask and other HDFS-aware Python libraries.</li> + <li>s3fs and other Dask-oriented filesystems can now be used with +<code class="highlighter-rouge">pyarrow.parquet.ParquetDataset</code></li> + <li>More graceful handling of pandasâs nanosecond timestamps when writing to +Parquet format. You can now pass <code class="highlighter-rouge">coerce_timestamps='ms'</code> to cast to +milliseconds, or <code class="highlighter-rouge">'us'</code> for microseconds.</li> +</ul> + +<h2 id="toward-arrow-100-and-beyond">Toward Arrow 1.0.0 and Beyond</h2> + +<p>We are still discussing the roadmap to 1.0.0 release on the <a href="http://mail-archives.apache.org/mod_mbox/arrow-dev/">developer mailing +list</a>. The focus of the 1.0.0 release will likely be memory format stability +and hardening integration tests across the remaining data types implemented in +Java and C++. Please join the discussion there.</p> + + + + <hr/> +<footer class="footer"> + <p>Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p> + <p>© 2017 Apache Software Foundation</p> +</footer> + + </div> +</body> +</html>
