Repository: arrow-site Updated Branches: refs/heads/asf-site b286da84c -> 3b67853c5
http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3b67853c/docs/ipc.html ---------------------------------------------------------------------- diff --git a/docs/ipc.html b/docs/ipc.html index ffbe491..69bfa36 100644 --- a/docs/ipc.html +++ b/docs/ipc.html @@ -106,17 +106,22 @@ --> <!--- - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. See accompanying LICENSE file. + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. --> <h1 id="interprocess-messaging--communication-ipc">Interprocess messaging / communication (IPC)</h1> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3b67853c/docs/memory_layout.html ---------------------------------------------------------------------- diff --git a/docs/memory_layout.html b/docs/memory_layout.html index 7703a15..98cb556 100644 --- a/docs/memory_layout.html +++ b/docs/memory_layout.html @@ -106,17 +106,22 @@ --> <!--- - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. See accompanying LICENSE file. + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. --> <h1 id="arrow-physical-memory-layout">Arrow: Physical memory layout</h1> @@ -167,7 +172,11 @@ proprietary systems that utilize the open source components.</li> linearly in the nesting level</li> <li>Capable of representing fully-materialized and decoded / decompressed <a href="https://parquet.apache.org/documentation/latest/">Parquet</a> data</li> - <li>All contiguous memory buffers are aligned at 64-byte boundaries and padded to a multiple of 64 bytes.</li> + <li>It is required to have all the contiguous memory buffers in an IPC payload +aligned at 8-byte boundaries. In other words, each buffer must start at +an aligned 8-byte offset.</li> + <li>The general recommendation is to align the buffers at 64-byte boundary, but +this is not absolutely necessary.</li> <li>Any relative type can have null slots</li> <li>Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to @@ -218,9 +227,9 @@ via byte swapping.</p> <h2 id="alignment-and-padding">Alignment and Padding</h2> -<p>As noted above, all buffers are intended to be aligned in memory at 64 byte -boundaries and padded to a length that is a multiple of 64 bytes. The alignment -requirement follows best practices for optimized memory access:</p> +<p>As noted above, all buffers must be aligned in memory at 8-byte boundaries and padded +to a length that is a multiple of 8 bytes. The alignment requirement follows best +practices for optimized memory access:</p> <ul> <li>Elements in numeric arrays will be guaranteed to be retrieved via aligned access.</li> @@ -229,12 +238,14 @@ requirement follows best practices for optimized memory access:</p> data-structures over 64 bytes (which will be a common case for Arrow Arrays).</li> </ul> -<p>Requiring padding to a multiple of 64 bytes allows for using <a href="https://software.intel.com/en-us/node/600110">SIMD</a> instructions +<p>Recommending padding to a multiple of 64 bytes allows for using <a href="https://software.intel.com/en-us/node/600110">SIMD</a> instructions consistently in loops without additional conditional checks. -This should allow for simpler and more efficient code. +This should allow for simpler, efficient and CPU cache-friendly code. The specific padding length was chosen because it matches the largest known -SIMD instruction registers available as of April 2016 (Intel AVX-512). -Guaranteed padding can also allow certain compilers +SIMD instruction registers available as of April 2016 (Intel AVX-512). In other +words, we can load the entire 64-byte buffer into a 512-bit wide SIMD register +and get data-level parallelism on all the columnar values packed into the 64-byte +buffer. Guaranteed padding can also allow certain compilers to generate more optimized code directly (e.g. One can safely use Intelâs <code class="highlighter-rouge">-qopt-assume-safe-padding</code>).</p> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3b67853c/docs/metadata.html ---------------------------------------------------------------------- diff --git a/docs/metadata.html b/docs/metadata.html index 76da9eb..7382193 100644 --- a/docs/metadata.html +++ b/docs/metadata.html @@ -106,17 +106,22 @@ --> <!--- - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. See accompanying LICENSE file. + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. --> <h1 id="metadata-logical-types-schemas-data-headers">Metadata: Logical types, schemas, data headers</h1> http://git-wip-us.apache.org/repos/asf/arrow-site/blob/3b67853c/feed.xml ---------------------------------------------------------------------- diff --git a/feed.xml b/feed.xml index f01301e..453eee8 100644 --- a/feed.xml +++ b/feed.xml @@ -1,4 +1,125 @@ -<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2017-07-27T11:28:36-04:00</updated><id>/</id><entry><title type="html">Speeding up PySpark with Apache Arrow</title><link href="/blog/2017/07/26/spark-arrow/" rel="alternate" type="text/html" title="Speeding up PySpark with Apache Arrow" /><published>2017-07-26T12:00:00-04:00</published><updated>2017-07-26T12:00:00-04:00</updated><id>/blog/2017/07/26/spark-arrow</id><content type="html" xml:base="/blog/2017/07/26/spark-arrow/"><!-- +<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.4.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2017-08-08T10:25:08-04:00</updated><id>/</id><entry><title type="html">Plasma In-Memory Object Store</title><link href="/blog/2017/08/08/plasma-in-memory-object-store/" rel="alternate" type="text/html" title="Plasma In-Memory Object Store" /><published>2017-08-08T00:00:00-04:00</published><updated>2017-08-08T00:00:00-04:00</updated><id>/blog/2017/08/08/plasma-in-memory-object-store</id><content type="html" xml:base="/blog/2017/08/08/plasma-in-memory-object-store/"><!-- + +--> + +<p><em><a href="https://people.eecs.berkeley.edu/~pcmoritz/">Philipp Moritz</a> and <a href="http://www.robertnishihara.com">Robert Nishihara</a> are graduate students at UC + Berkeley.</em></p> + +<h2 id="plasma-a-high-performance-shared-memory-object-store">Plasma: A High-Performance Shared-Memory Object Store</h2> + +<h3 id="motivating-plasma">Motivating Plasma</h3> + +<p>This blog post presents Plasma, an in-memory object store that is being +developed as part of Apache Arrow. <strong>Plasma holds immutable objects in shared +memory so that they can be accessed efficiently by many clients across process +boundaries.</strong> In light of the trend toward larger and larger multicore machines, +Plasma enables critical performance optimizations in the big data regime.</p> + +<p>Plasma was initially developed as part of <a href="https://github.com/ray-project/ray">Ray</a>, and has recently been moved +to Apache Arrow in the hopes that it will be broadly useful.</p> + +<p>One of the goals of Apache Arrow is to serve as a common data layer enabling +zero-copy data exchange between multiple frameworks. A key component of this +vision is the use of off-heap memory management (via Plasma) for storing and +sharing Arrow-serialized objects between applications.</p> + +<p><strong>Expensive serialization and deserialization as well as data copying are a +common performance bottleneck in distributed computing.</strong> For example, a +Python-based execution framework that wishes to distribute computation across +multiple Python âworkerâ processes and then aggregate the results in a single +âdriverâ process may choose to serialize data using the built-in <code class="highlighter-rouge">pickle</code> +library. Assuming one Python process per core, each worker process would have to +copy and deserialize the data, resulting in excessive memory usage. The driver +process would then have to deserialize results from each of the workers, +resulting in a bottleneck.</p> + +<p>Using Plasma plus Arrow, the data being operated on would be placed in the +Plasma store once, and all of the workers would read the data without copying or +deserializing it (the workers would map the relevant region of memory into their +own address spaces). The workers would then put the results of their computation +back into the Plasma store, which the driver could then read and aggregate +without copying or deserializing the data.</p> + +<h3 id="the-plasma-api">The Plasma API:</h3> + +<p>Below we illustrate a subset of the API. The C++ API is documented more fully +<a href="https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md">here</a>, and the Python API is documented <a href="https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst">here</a>.</p> + +<p><strong>Object IDs:</strong> Each object is associated with a string of bytes.</p> + +<p><strong>Creating an object:</strong> Objects are stored in Plasma in two stages. First, the +object store <em>creates</em> the object by allocating a buffer for it. At this point, +the client can write to the buffer and construct the object within the allocated +buffer. When the client is done, the client <em>seals</em> the buffer making the object +immutable and making it available to other Plasma clients.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Create an object.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="n">object_size</span> <span class="o">=</span> <span class="mi">1000</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">object_id</span><span class="p">,</span> <span class="n">object_size</span><span class="p">))</span> + +<span class="c"># Write to the buffer.</span> +<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span> + <span class="nb">buffer</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> + +<span class="c"># Seal the object making it immutable and available to other clients.</span> +<span class="n">client</span><span class="o">.</span><span class="n">seal</span><span class="p">(</span><span class="n">object_id</span><span class="p">)</span> +</code></pre> +</div> + +<p><strong>Getting an object:</strong> After an object has been sealed, any client who knows the +object ID can get the object.</p> + +<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># Get the object from the store. This blocks until the object has been sealed.</span> +<span class="n">object_id</span> <span class="o">=</span> <span class="n">pyarrow</span><span class="o">.</span><span class="n">plasma</span><span class="o">.</span><span class="n">ObjectID</span><span class="p">(</span><span class="mi">20</span> <span class="o">*</span> <span class="n">b</span><span class="s">'a'</span><span class="p">)</span> +<span class="p">[</span><span class="n">buff</span><span class="p">]</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">get</span><span class="p">([</span><span class="n">object_id</span><span class="p">])</span> +<span class="nb">buffer</span> <span class="o">=</span> <span class="n">memoryview</span><span class="p">(</span><span class="n">buff</span><span class="p">)</span> +</code></pre> +</div> + +<p>If the object has not been sealed yet, then the call to <code class="highlighter-rouge">client.get</code> will block +until the object has been sealed.</p> + +<h3 id="a-sorting-application">A sorting application</h3> + +<p>To illustrate the benefits of Plasma, we demonstrate an <strong>11x speedup</strong> (on a +machine with 20 physical cores) for sorting a large pandas DataFrame (one +billion entries). The baseline is the built-in pandas sort function, which sorts +the DataFrame in 477 seconds. To leverage multiple cores, we implement the +following standard distributed sorting scheme.</p> + +<ul> + <li>We assume that the data is partitioned across K pandas DataFrames and that +each one already lives in the Plasma store.</li> + <li>We subsample the data, sort the subsampled data, and use the result to define +L non-overlapping buckets.</li> + <li>For each of the K data partitions and each of the L buckets, we find the +subset of the data partition that falls in the bucket, and we sort that +subset.</li> + <li>For each of the L buckets, we gather all of the K sorted subsets that fall in +that bucket.</li> + <li>For each of the L buckets, we merge the corresponding K sorted subsets.</li> + <li>We turn each bucket into a pandas DataFrame and place it in the Plasma store.</li> +</ul> + +<p>Using this scheme, we can sort the DataFrame (the data starts and ends in the +Plasma store), in 44 seconds, giving an 11x speedup over the baseline.</p> + +<h3 id="design">Design</h3> + +<p>The Plasma store runs as a separate process. It is written in C++ and is +designed as a single-threaded event loop based on the <a href="https://redis.io/">Redis</a> event loop library. +The plasma client library can be linked into applications. Clients communicate +with the Plasma store via messages serialized using <a href="https://google.github.io/flatbuffers/">Google Flatbuffers</a>.</p> + +<h3 id="call-for-contributions">Call for contributions</h3> + +<p>Plasma is a work in progress, and the API is currently unstable. Today Plasma is +primarily used in <a href="https://github.com/ray-project/ray">Ray</a> as an in-memory cache for Arrow serialized objects. +We are looking for a broader set of use cases to help refine Plasmaâs API. In +addition, we are looking for contributions in a variety of areas including +improving performance and building other language bindings. Please let us know +if you are interested in getting involved with the project.</p></content><author><name>Philipp Moritz and Robert Nishihara</name></author></entry><entry><title type="html">Speeding up PySpark with Apache Arrow</title><link href="/blog/2017/07/26/spark-arrow/" rel="alternate" type="text/html" title="Speeding up PySpark with Apache Arrow" /><published>2017-07-26T12:00:00-04:00</published><updated>2017-07-26T12:00:00-04:00</updated><id>/blog/2017/07/26/spark-arrow</id><content type="html" xml:base="/blog/2017/07/26/spark-arrow/"><!-- -->