[arrow-site] branch main updated: [Website] A journey with Apache Arrow - Part 1 - POST (#340)

alamb Tue, 11 Apr 2023 04:08:43 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-site.git



The following commit(s) were added to refs/heads/main by this push:
     new fd7fc44a608 [Website] A journey with Apache Arrow - Part 1 - POST  
(#340)
fd7fc44a608 is described below

commit fd7fc44a608539fc8671bad716968b122fa97207
Author: Laurent Quérel <[email protected]>
AuthorDate: Tue Apr 11 04:08:31 2023 -0700

    [Website] A journey with Apache Arrow - Part 1 - POST  (#340)
    
    This PR is a markdown version of the article proposed on the mailing
    list (see
    https://lists.apache.org/thread/jxpypxwjh4jhpk2xvj0z3woy7yr0z0sk).
    
    The `author` field currently contains my full name and not the
    `apacheId` because I was unable to get `jekyll` to take into account the
    change I made in `contributors.yml`.
    
    ---------
    
    Co-authored-by: Andrew Lamb <[email protected]>
---
 .DS_Store                                          | Bin 0 -> 6148 bytes
 _data/contributors.yml                             |   3 +
 ...1-our-journey-at-f5-with-apache-arrow-part-1.md | 382 +++++++++++++++++++++
 img/journey-apache-arrow/data-types.svg            |   1 +
 img/journey-apache-arrow/dictionary-encoding.svg   |   1 +
 .../hierarchical-data-model.svg                    |   1 +
 img/journey-apache-arrow/performance.svg           |   1 +
 img/journey-apache-arrow/row-vs-columnar.svg       |   1 +
 img/journey-apache-arrow/schema-optim-process.svg  |   1 +
 .../simple-vs-complex-data-model.svg               |   1 +
 10 files changed, 392 insertions(+)

diff --git a/.DS_Store b/.DS_Store
new file mode 100644
index 00000000000..940e942902c
Binary files /dev/null and b/.DS_Store differ
diff --git a/_data/contributors.yml b/_data/contributors.yml
index 0a01adb2555..a7d7b210c4b 100644
--- a/_data/contributors.yml
+++ b/_data/contributors.yml
@@ -58,4 +58,7 @@
 - name: Stephanie Hazlitt
   apacheId: stephhazlitt
   githubId: stephhazlitt
+- name: Laurent Querel
+  apacheId: lquerel # Not a real apacheId
+  githubId: lquerel
 # End contributors.yml
diff --git a/_posts/2023-04-11-our-journey-at-f5-with-apache-arrow-part-1.md 
b/_posts/2023-04-11-our-journey-at-f5-with-apache-arrow-part-1.md
new file mode 100644
index 00000000000..045bf457826
--- /dev/null
+++ b/_posts/2023-04-11-our-journey-at-f5-with-apache-arrow-part-1.md
@@ -0,0 +1,382 @@
+---
+layout: post
+title: "Our journey at F5 with Apache Arrow (part 1)"
+date: "2023-04-11 00:00:00"
+author: Laurent Quérel
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Apache Arrow is a technology widely adopted in big data, analytics, and 
machine learning applications. In this article, we share 
[F5](https://www.f5.com/)'s experience with Arrow, specifically its application 
to telemetry, and the challenges we encountered while optimizing the 
OpenTelemetry protocol to significantly reduce bandwidth costs. The promising 
results we achieved inspired us to share our insights. This article 
specifically focuses on transforming relatively complex data structu [...]
+
+## What is Apache Arrow
+
+[Apache Arrow](https://arrow.apache.org/docs/index.html) is an open-source 
project offering a standardized, language-agnostic in-memory format for 
representing structured and semi-structured data. This enables data sharing and 
zero-copy data access between systems, eliminating the need for serialization 
and deserialization when exchanging datasets between varying CPU architectures 
and programming languages. Furthermore, Arrow libraries feature an extensive 
set of high-performance, parall [...]
+
+Very often people ask about the differences between Arrow and [Apache 
Parquet](https://parquet.apache.org/) or other columnar file formats. Arrow is 
designed and optimized for in-memory processing, while Parquet is tailored for 
disk-based storage. In reality, these technologies are complementary, with 
bridges existing between them to simplify interoperability. In both cases, data 
is represented in columns to optimize access, data locality and 
compressibility. However, the tradeoffs diffe [...]
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/journey-apache-arrow/row-vs-columnar.svg" 
width="100%" class="img-responsive" alt="Memory representations: row vs 
columnar data.">
+  <figcaption>Fig 1: Memory representations: row vs columnar data.</figcaption>
+</figure>
+
+Figure 1 illustrates the differences in memory representation between 
row-oriented and column-oriented approaches. The column-oriented approach 
groups data from the same column in a continuous memory area, which facilitates 
parallel processing (SIMD) and enhances compression performance.
+
+## Why are we interested in Apache Arrow
+
+At [F5](https://www.f5.com/), we’ve adopted 
[OpenTelemetry](https://opentelemetry.io/) (OTel) as the standard for all 
telemetry across our products, such as BIGIP and NGINX. These products may 
generate large volumes of metrics and logs for various reasons, from 
performance evaluation to forensic purposes. The data produced by these systems 
is typically centralized and processed in dedicated systems. Transporting and 
processing this data accounts for a significant portion of the cost asso [...]
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/journey-apache-arrow/performance.svg" 
width="100%" class="img-responsive" alt="Performance improvement in the 
OpenTelemetry Arrow experimental project.">
+  <figcaption>Fig 2: Performance improvement in the OpenTelemetry Arrow 
experimental project.</figcaption>
+</figure>
+
+This project has been divided into two phases. The first phase, which is 
nearing completion, aims to enhance the protocol's compression ratio. The 
second phase, planned for the future, focuses on improving end-to-end 
performance by incorporating Apache Arrow throughout all levels, eliminating 
the need for conversion between old and new protocols. The results so far are 
promising, with our benchmarks showing compression ratio improvements ranging 
from x1.5 to x5, depending on the data typ [...]
+
+Arrow relies on a schema to define the structure of data batches that it 
processes and transports. The subsequent sections will discuss various 
techniques that can be employed to optimize the creation of these schemas.
+
+## How to leverage Arrow to optimize network transport cost
+
+Apache Arrow is a complex project with a rapidly evolving ecosystem, which can 
sometimes be overwhelming for newcomers. Fortunately the Arrow community has 
published three introductory articles 
[1](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/), 
[2](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/), 
and 
[3](https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/) 
that we recommend for those interested in exploring this technology.
+
+This article primarily focuses on transforming data from an XYZ format into an 
efficient Arrow representation that optimizes both compression ratio and data 
processing. There are numerous approaches to this transformation, and we will 
examine how these methods can impact compression ratio, CPU usage, and memory 
consumption during the conversion process, among other factors.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl 
}}/img/journey-apache-arrow/schema-optim-process.svg" width="100%" 
class="img-responsive" alt="Fig 3: Optimization process for the definition of 
an Arrow schema.">
+  <figcaption>Fig 3: Optimization process for the definition of an Arrow 
schema.</figcaption>
+</figure>
+
+The complexity of your initial model significantly impacts the Arrow mapping 
choices you need to make. To begin, it's essential to identify the properties 
you want to optimize for your specific context. Compression rate, conversion 
speed, memory consumption, speed and ease of use of the final model, 
compatibility, and extensibility are all factors that can influence your final 
mapping decisions. From there, you must explore multiple alternative schemas.
+
+The choice of the Arrow type and data encoding for each individual field will 
affect the performance of your schema. There are various ways to represent 
hierarchical data or highly dynamic data models, and multiple options need to 
be evaluated in coordination with the configuration of the transport layer. 
This transport layer should also be carefully considered. Arrow supports 
compression mechanisms and dictionary deltas that may not be active by default.
+
+After several iterations of this process, you should arrive at an optimized 
schema that meets the goals you initially set. It's crucial to compare the 
performance of your different approaches using real data, as the distribution 
of data in each individual field may influence whether you use dictionary 
encoding or not. We will now examine these choices in greater detail throughout 
the remainder of this article. 
+
+## Arrow data type selection
+
+The principles of selecting an Arrow data type are quite similar to those used 
when defining a data model for databases. Arrow supports a wide range of data 
types. Some of these types are supported by all implementations, while others 
are only available for languages with the strongest Arrow community support 
(see this [page](https://arrow.apache.org/docs/status.html) for a comparison 
matrix of the different implementations). For primitive types, it is generally 
preferable to choose the  [...]
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/journey-apache-arrow/data-types.svg" 
width="100%" class="img-responsive" alt="Fig 4: Data types supported by Apache 
Arrow.">
+  <figcaption>Fig 4: Data types supported by Apache Arrow.</figcaption>
+</figure>
+
+When selecting the Arrow data type, it's important to consider the size of the 
data before and after compression. It's quite possible that the size after 
compression is the same for two different types, but the actual size in memory 
may be two, four, or even eight times larger (e.g., uint8 vs. uint64). This 
difference will impact your ability to process large batches of data and will 
also significantly influence the speed of processing these data in memory 
(e.g., cache optimization, SIMD [...]
+
+It's also possible to extend these types using an [extension 
type](https://arrow.apache.org/docs/format/Columnar.html#extension-types) 
mechanism that builds upon one of the currently supported primitive types while 
adding specific semantics. This extension mechanism can simplify the use of 
this data in your own project, while remaining transparent to intermediate 
systems that will interpret this data as a basic primitive type.
+
+There are some variations in the encoding of primitive types, which we will 
explore next.
+
+## Data encoding
+
+Another crucial aspect of optimizing your Arrow schema is analyzing the 
cardinality of your data. Fields that can have only a limited number of values 
will typically be more efficiently represented with a dictionary encoding.
+
+The maximum cardinality of a field determines the data type characteristics of 
your dictionary. For instance, for a field representing the status code of an 
HTTP transaction, it's preferable to use a dictionary with an index of type 
‘uint8’ and a value of type ‘uint16’ (notation: ‘Dictionary<uint8, uint16>’). 
This consumes less memory because the main array will be of type ‘[]uint8’. 
Even if the range of possible values is greater than 255, as long as the number 
of distinct values does n [...]
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl 
}}/img/journey-apache-arrow/dictionary-encoding.svg" width="90%" 
class="img-responsive" alt="Fig 5: Dictionary encoding.">
+  <figcaption>Fig 5: Dictionary encoding.</figcaption>
+</figure>
+
+Dictionary encoding is highly flexible in Apache Arrow, allowing the creation 
of encodings for any Arrow primitive type. The size of the indices can also be 
configured based on the context. 
+
+In general, it is advisable to use dictionaries in the following cases:
+* Representation of enumerations
+* Representation of textual or binary fields with a high probability of having 
redundant values.
+* Representation of fields with cardinalities known to be below 2^16 or 2^32.
+
+Sometimes, the cardinality of a field is not known a priori. For example, a 
proxy that transforms a data stream from a row-oriented format into a series of 
columnar-encoded batches (e.g., OpenTelemetry collector) may not be able to 
predict in advance whether a field will have a fixed number of distinct values. 
Two approaches are possible: 
+1) a conservative approach using the largest data type (e.g., ‘int64’, 
‘string’, etc., instead of dictionary), 
+2) an adaptive approach that modifies the schema on the fly based on the 
observed cardinality of the field(s). In this second approach, without 
cardinality information, you can optimistically start by using a 
‘Dictionary<uint8, original-field-type>’ dictionary, then detect a potential 
dictionary overflow during conversion, and change the schema to a 
‘Dictionary<uint16, original-field-type>’ in case of an overflow. This 
technique of automatic management of dictionary overflows will be pre [...]
+
+Recent advancements in Apache Arrow include the implementation of [run-end 
encoding](https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout),
 a technique that efficiently represents data with sequences of repeated 
values. This encoding method is particularly beneficial for handling data sets 
containing long stretches of identical values, as it offers a more compact and 
optimized representation.
+
+In conclusion, dictionary encoding not only occupies less space in memory and 
during transfers but also significantly improves the compression ratio and data 
processing speed. However, this type of representation requires indirection 
when extracting the initial values (although this isn’t always necessary, even 
during some data processing operations). Additionally, it is important to 
manage dictionary index overflow, especially when the encoded field doesn't 
have a well-defined cardinality. 
+
+## Hierarchical data
+
+Basic hierarchical data structures translate relatively well into Arrow. 
However, as we will see, there are some complications to handle in more  
general cases (see figure 6). While Arrow schemas do support nested structures, 
maps, and unions, some components of the Arrow ecosystem do not fully support 
them, which can make these Arrow data types unsuitable for certain scenarios. 
Additionally, unlike most languages and formats, such as Protobuf, Arrow 
doesn’t support the concept of a recu [...]
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl 
}}/img/journey-apache-arrow/simple-vs-complex-data-model.svg" width="100%" 
class="img-responsive" alt="Fig 6: simple vs complex data model.">
+  <figcaption>Fig 6: simple vs complex data model.</figcaption>
+</figure>
+
+###  Natural representation
+
+The most straightforward and intuitive approach to representing a simple 
hierarchical data model is to use Arrow's list, map, and union data types. 
However, it's important to note that some of these data types are not fully 
supported throughout the entire Arrow ecosystem. For example, the conversion of 
unions to Parquet is [not directly 
supported](https://issues.apache.org/jira/browse/PARQUET-756) and requires a 
transformation step (see [denormalization & flattening representation](https 
[...]
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl 
}}/img/journey-apache-arrow/hierarchical-data-model.svg" width="80%" 
class="img-responsive" alt="Fig 7: initial data model.">
+  <figcaption>Fig 7: initial data model.</figcaption>
+</figure>
+
+The following example is a Go program snippet of an Arrow schema using these 
different data types to represent the model above.
+
+```go
+import "github.com/apache/arrow/go/v11/arrow"
+
+
+const (
+  GaugeMetricCode arrow.UnionTypeCode = 0
+  SumMetricCode   arrow.UnionTypeCode = 1
+)
+
+
+var (
+  // uint8Dictionary represent a Dictionary<Uint8, String>
+  uint8Dictionary = &arrow.DictionaryType{
+     IndexType: arrow.PrimitiveTypes.Uint8,
+     ValueType: arrow.BinaryTypes.String,
+  }
+  // uint16Dictionary represent a Dictionary<Uint16, String>
+  uint16Dictionary = &arrow.DictionaryType{
+     IndexType: arrow.PrimitiveTypes.Uint16,
+     ValueType: arrow.BinaryTypes.String,
+  }
+
+
+  Schema = arrow.NewSchema([]arrow.Field{
+     {Name: "resource_metrics", Type: 
arrow.ListOf(arrow.StructOf([]arrow.Field{
+        {Name: "scope", Type: arrow.StructOf([]arrow.Field{
+           // Name and Version are declared as dictionaries 
(Dictionary<Uint16, String>)).
+           {Name: "name", Type: uint16Dictionary},
+           {Name: "version", Type: uint16Dictionary},
+        }...)},
+        {Name: "metrics", Type: arrow.ListOf(arrow.StructOf([]arrow.Field{
+           {Name: "name", Type: uint16Dictionary},
+           {Name: "unit", Type: uint8Dictionary},
+           {Name: "timestamp", Type: arrow.TIMESTAMP},
+           {Name: "metric_type", Type: arrow.UINT8},
+           {Name: "data_point", Type: 
arrow.ListOf(arrow.StructOf([]arrow.Field{
+              {Name: "metric", Type: arrow.DenseUnionOf(
+                 []arrow.Field{
+                    {Name: "gauge", Type: arrow.StructOf([]arrow.Field{
+                       {Name: "data_point", Type: arrow.FLOAT64},
+                    }...)},
+                    {Name: "sum", Type: arrow.StructOf([]arrow.Field{
+                       {Name: "data_point", Type: arrow.FLOAT64},
+                       {Name: "is_monotonic", Type: arrow.BOOL},
+                    }...)},
+                 },
+                 []arrow.UnionTypeCode{GaugeMetricCode, SumMetricCode},
+              )},
+           }...))},
+        }...))},
+     }...))},
+  }, nil)
+)
+```
+
+In this pattern, we use a union type to represent an inheritance relationship. 
There are two types of Arrow union that are optimized for different cases. The 
dense union type has a relatively succinct memory representation but doesn’t 
support vectorizable operations, making it less efficient during the processing 
phase. Conversely, a sparse union supports vectorization operations, but comes 
with a memory overhead directly proportional to the number of variants in the 
union. Dense and spa [...]
+
+In certain scenarios, it may be more idiomatic to represent the inheritance 
relationship using multiple schemas (i.e., one schema per subtype), thereby 
avoiding the use of the union type. However, applying this approach to the 
aforementioned model may not be optimal, as the data preceding the inheritance 
relationship (i.e., `ResourceMetrics`, `Scope`, and `Metrics`) could 
potentially be duplicated numerous times. If the relationships between 
`ResourceMetrics`, `Metrics`, and `DataPoint`  [...]
+
+### Denormalization & Flattening representations
+
+If the `List` type is not supported in your telemetry pipeline, you can 
denormalize your data model. This process is often used in the database world 
to remove a join between two tables for optimization purposes. In the Arrow 
world, denormalization is employed to eliminate the `List` type by duplicating 
some data. Once transformed, the previous Arrow schema becomes.
+
+```go
+Schema = arrow.NewSchema([]arrow.Field{
+  {Name: "resource_metrics", Type: arrow.StructOf([]arrow.Field{
+     {Name: "scope", Type: arrow.StructOf([]arrow.Field{
+        // Name and Version are declared as dictionaries (Dictionary<Uint16, 
String>)).
+        {Name: "name", Type: uint16Dictionary},
+        {Name: "version", Type: uint16Dictionary},
+     }...)},
+     {Name: "metrics", Type: arrow.StructOf([]arrow.Field{
+        {Name: "name", Type: uint16Dictionary},
+        {Name: "unit", Type: uint8Dictionary},
+        {Name: "timestamp", Type: arrow.TIMESTAMP},
+        {Name: "metric_type", Type: arrow.UINT8},
+        {Name: "data_point", Type: arrow.StructOf([]arrow.Field{
+           {Name: "metric", Type: arrow.DenseUnionOf(
+              []arrow.Field{
+                 {Name: "gauge", Type: arrow.StructOf([]arrow.Field{
+                    {Name: "value", Type: arrow.FLOAT64},
+                 }...)},
+                 {Name: "sum", Type: arrow.StructOf([]arrow.Field{
+                    {Name: "value", Type: arrow.FLOAT64},
+                    {Name: "is_monotonic", Type: arrow.BOOL},
+                 }...)},
+              },
+              []arrow.UnionTypeCode{GaugeMetricCode, SumMetricCode},
+           )},
+        }...)},
+     }...)},
+  }...)},
+}, nil)
+```
+
+List types are eliminated at all levels. The initial semantics of the model 
are preserved by duplicating the data of the levels below each data point 
value. The memory representation will generally be much larger than the 
previous one, but a query engine that does not support the `List` type will 
still be able to process this data. Interestingly, once compressed, this way of 
representing data may not necessarily be larger than the previous approach. 
This is because the columnar represent [...]
+
+If the union type is not supported by some components of your pipeline, it is 
also possible to eliminate them by merging the union variants (the nested 
structure ‘metric’ is removed, see below).
+
+```go
+Schema = arrow.NewSchema([]arrow.Field{
+  {Name: "resource_metrics", Type: arrow.StructOf([]arrow.Field{
+     {Name: "scope", Type: arrow.StructOf([]arrow.Field{
+        // Name and Version are declared as dictionaries (Dictionary<Uint16, 
String>)).
+        {Name: "name", Type: uint16Dictionary},
+        {Name: "version", Type: uint16Dictionary},
+     }...)},
+     {Name: "metrics", Type: arrow.StructOf([]arrow.Field{
+        {Name: "name", Type: uint16Dictionary},
+        {Name: "unit", Type: uint8Dictionary},
+        {Name: "timestamp", Type: arrow.TIMESTAMP},
+        {Name: "metric_type", Type: arrow.UINT8},
+        {Name: "data_point", Type: arrow.StructOf([]arrow.Field{
+           {Name: "value", Type: arrow.FLOAT64},
+           {Name: "is_monotonic", Type: arrow.BOOL},
+        }...)},
+     }...)},
+  }...)},
+}, nil)
+```
+
+The final schema has evolved into a series of nested structures, where the 
fields of the union variants are merged into one structure. The trade-off of 
this approach is similar to that of sparse union - the more variants, the 
higher the memory occupation. Arrow supports the concept of bitmap validity to 
identify null values (1 bit per entry) for various data types, including those 
that do not have a unique null representation (e.g., primitive types). The use 
of bitmap validity makes the  [...]
+
+In some extreme situations where nested structures are not supported, a 
flattening approach can be used to address this problem.
+
+```go
+Schema = arrow.NewSchema([]arrow.Field{
+  {Name: "scope_name", Type: uint16Dictionary},
+  {Name: "scope_version", Type: uint16Dictionary},
+  {Name: "metrics_name", Type: uint16Dictionary},
+  {Name: "metrics_unit", Type: uint8Dictionary},
+  {Name: "metrics_timestamp", Type: arrow.TIMESTAMP},
+  {Name: "metrics_metric_type", Type: arrow.UINT8},
+  {Name: "metrics_data_point_value", Type: arrow.FLOAT64},
+  {Name: "metrics_data_point_is_monotonic", Type: arrow.BOOL},
+}, nil)
+```
+
+The terminal fields (leaves) are renamed by concatenating the names of the 
parent structures to provide proper scoping. This type of structure is 
supported by all components of the Arrow ecosystem. This approach can be useful 
if compatibility is a crucial criterion for your system. However, it shares the 
same drawbacks as other alternative denormalization models. 
+
+The Arrow ecosystem is evolving rapidly, so it is likely that support for 
List, Map, and Union data types in query engines will improve quickly. If 
kernel functions are sufficient or preferable for your application, it is 
usually possible to utilize these nested types.
+
+### Adaptive/Dynamic representation
+
+Some data models can be more challenging to translate into an Arrow schema, 
such as the following Protobuf example. In this example, a collection of 
attributes is added to each data point. These attributes are defined using a 
recursive definition that most languages and formats, like Protobuf, support 
(see the ‘AnyValue’ definition below). Unfortunately, Arrow (like most 
classical database schemas) does not support such recursive definition within 
schemas.
+
+```protobuf
+syntax = "proto3";
+
+
+message Metric {
+ message DataPoint {
+   repeated Attribute attributes = 1;
+   oneof value {
+     int64 int_value = 2;
+     double double_value = 3;
+   }
+ }
+
+
+ enum MetricType {
+   UNSPECIFIED = 0;
+   GAUGE = 1;
+   SUM = 2;
+ }
+
+
+ message Gauge {
+   DataPoint data_point = 1;
+ }
+
+
+ message Sum {
+   DataPoint data_point = 1;
+   bool is_monotonic = 2;
+ }
+
+
+ string name = 1;
+ int64 timestamp = 2;
+ string unit = 3;
+ MetricType type = 4;
+ oneof metric {
+   Gauge gauge = 5;
+   Sum sum = 6;
+ }
+}
+
+
+message Attribute {
+ string name = 1;
+ AnyValue value = 2;
+}
+
+
+// Recursive definition of AnyValue. AnyValue can be a primitive value, a list
+// of AnyValues, or a list of key-value pairs where the key is a string and
+// the value is an AnyValue.
+message AnyValue {
+ message ArrayValue {
+   repeated AnyValue values = 1;
+ }
+ message KeyValueList {
+   message KeyValue {
+     string key = 1;
+     AnyValue value = 2;
+   }
+   repeated KeyValue values = 1;
+ }
+
+
+ oneof value {
+   int64 int_value = 1;
+   double double_value = 2;
+   string string_value = 3;
+   ArrayValue list_value = 4;
+   KeyValueList kvlist_value = 5;
+ }
+}
+```
+
+If the definition of the attributes were non-recursive, it would have been 
possible to directly translate them into an Arrow Map type.
+
+To address this kind of issue and further optimize Arrow schema definitions, 
you can employ an adaptive and iterative method that automatically constructs 
the Arrow schema based on the data being translated. With this approach, fields 
are automatically dictionary-encoded according to their cardinalities, unused 
fields are eliminated, and recursive structures are represented in a specific 
manner. Another solution involves using a multi-schema approach, in which 
attributes are depicted in  [...]
+
+## Data transport
+
+Unlike to Protobuf, an Arrow schema is generally not known a priori by the two 
parties participating in an exchange. Before being able to exchange data in 
Arrow format, the sender must first communicate the schema to the receiver, as 
well as the contents of the dictionaries used in the data. Only after this 
initialization phase has been completed can the sender transmit batches of data 
in Arrow format. This process, known as [Arrow IPC 
Stream](https://wesmckinney.com/blog/arrow-streaming [...]
+
+Using a stateless protocol is possible for large batches because the overhead 
of the schema will be negligible compared to the compression gains achieved 
using dictionary encoding and columnar representation. However, dictionaries 
will have to be communicated for each batch, making this approach generally 
less efficient than a stream-oriented approach.
+
+Arrow IPC Stream also supports the concept of "delta dictionaries," which 
allows for further optimization of batch transport. When a batch adds data to 
an existing dictionary (at the sender's end), Arrow IPC enables sending the 
delta dictionary followed by the batch that references it. On the receiver 
side, this delta is used to update the existing dictionary, eliminating the 
need to retransmit the entire dictionary when changes occur. This optimization 
is only possible with a stateful p [...]
+
+To fully leverage the column-oriented format of Apache Arrow, it is essential 
to consider sorting and compression. If your data model is simple (i.e., flat) 
and has one or more columns representing a natural order for your data (e.g., 
timestamp), it might be beneficial to sort your data to optimize the final 
compression ratio. Before implementing this optimization, it is recommended to 
perform tests on real data since the benefits may vary. In any case, using a 
compression algorithm when [...]
+
+Lastly, some implementations (e.g., Arrow Go) are not configured by default to 
support delta dictionaries and compression algorithms. Therefore, it is crucial 
to ensure that your code employs these options to maximize data transport 
efficiency.
+
+## Experiments
+
+If your initial data is complex, it is advisable to conduct your own 
experiments to optimize the Arrow representation according to your data and 
goals (e.g., optimizing the compression ratio or enhancing the query-ability of 
your data in Arrow format). In our case, we developed an overlay for Apache 
Arrow that enables us to carry out these experiments with ease, without having 
to deal with the intrinsic complexity of Arrow APIs. However, this comes at the 
expense of a slower conversion p [...]
+
+We also employed a "black box optimization" approach, which automatically 
finds the best combination to meet the objectives we aimed to optimize (refer 
to "[Optimize your applications using Google Vertex AI 
Vizier](https://cloud.google.com/blog/products/ai-machine-learning/optimize-your-applications-using-google-vertex-ai-vizier)"
 for a description of this approach).
+
+## Conclusion and next steps
+
+Essentially, the key concept behind Apache Arrow is that it eliminates the 
need for serialization and deserialization, enabling zero-copy data sharing. 
Arrow achieves this by defining a language-agnostic, in-memory format that 
remains consistent across various implementations. Consequently, raw memory 
bytes can be transmitted directly over a network without requiring any 
serialization or deserialization, significantly enhancing data processing 
efficiency.
+
+Converting a data model to Apache Arrow necessitates adaptation and 
optimization work, as we have begun to describe in this article. Many 
parameters must be considered, and it is recommended to perform a series of 
experiments to validate the various choices made during this process.
+
+Handling highly dynamic data with Arrow can be challenging. Arrow requires the 
definition of a static schema, which can sometimes make representing this type 
of data complex or suboptimal, especially when the initial schema contains 
recursive definitions. This article has discussed several approaches to address 
this issue. The next article will be dedicated to a hybrid strategy that 
involves adapting the Arrow schema on-the-fly to optimize memory usage, 
compression ratio, and processing  [...]
diff --git a/img/journey-apache-arrow/data-types.svg 
b/img/journey-apache-arrow/data-types.svg
new file mode 100644
index 00000000000..7ded543a04b
--- /dev/null
+++ b/img/journey-apache-arrow/data-types.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:lucid="lucid" width="1025.16" 
height="316.67"><g transform="translate(-6.19020715634764 -10.811466834104493)" 
lucid:page-tab-id="C9CI3RRDGR7e"><path d="M0 0h1760v1360H0z" fill="#fff"/><g 
filter="url(#a)"><path d="M27.2 37.8a6 6 0 0 1 6-6H490.5a6 6 0 0 1 6 6V300.5a6 
6 0 0 1-6 6H33.2a6 6 0 0 1-6-6z" stroke="#000" stroke-width="2" 
fill="#fff"/><use xlink:href="#b" transform="matrix(1,0,0,1,39.1902071563 [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/dictionary-encoding.svg 
b/img/journey-apache-arrow/dictionary-encoding.svg
new file mode 100644
index 00000000000..c448e28c9c7
--- /dev/null
+++ b/img/journey-apache-arrow/dictionary-encoding.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:lucid="lucid" width="800.36" 
height="1164.33"><g transform="translate(-12.616853856282205 
-3.6666666666276697)" lucid:page-tab-id="re8KJjouTQd5"><path d="M0 
0h1760v1360H0z" fill="#fff"/><path d="M32.62 627.67a6 6 0 0 1 6-6h748.36a6 6 0 
0 1 6 6v468a6 6 0 0 1-6 6H38.62a6 6 0 0 1-6-6z" fill="#fff"/><path d="M36.33 
622.14l.43-.18.92-.22.94-.07h2m3.97 0h3.97m3.98 0h3.98m3.98 0h4m3.97 
0h3.98m3.98 0h3.98m4  [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/hierarchical-data-model.svg 
b/img/journey-apache-arrow/hierarchical-data-model.svg
new file mode 100644
index 00000000000..e0c3a5e4972
--- /dev/null
+++ b/img/journey-apache-arrow/hierarchical-data-model.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:lucid="lucid" width="707.33" 
height="614.17"><g transform="translate(-5.666666666666629 
-0.6666666667055097)" lucid:page-tab-id=".aGI6iY1UkG4"><path d="M0 
0h1760v1360H0z" fill="#fff"/><g filter="url(#a)"><path d="M166 27.67a6 6 0 0 1 
6-6h206.67a6 6 0 0 1 6 6v51a6 6 0 0 1-6 6H172a6 6 0 0 1-6-6z" stroke="#3a414a" 
fill="#fff"/><use xlink:href="#b" 
transform="matrix(1,0,0,1,173.9999999999999,29.666666666 [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/performance.svg 
b/img/journey-apache-arrow/performance.svg
new file mode 100644
index 00000000000..d34522ae591
--- /dev/null
+++ b/img/journey-apache-arrow/performance.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:lucid="lucid" width="1090.67" 
height="334.67"><g transform="translate(-4.0000000000388525 
-4.333333333333371)" lucid:page-tab-id="0_0"><path d="M0 0h1760v1360H0z" 
fill="#fff"/><path d="M24 172.33a6 6 0 0 1 6-6h461.33a6 6 0 0 1 6 6V313a6 6 0 0 
1-6 6H30a6 6 0 0 1-6-6z" stroke="#000" stroke-opacity="0" fill="#fff" 
fill-opacity="0"/><use xlink:href="#a" 
transform="matrix(1,0,0,1,29.000000000038852,171.33 [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/row-vs-columnar.svg 
b/img/journey-apache-arrow/row-vs-columnar.svg
new file mode 100644
index 00000000000..44ae428c601
--- /dev/null
+++ b/img/journey-apache-arrow/row-vs-columnar.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:lucid="lucid" width="820.96" 
height="724.86"><g transform="translate(-59.314749856245044 
-55.31644444882227)" lucid:page-tab-id="hpDIo6YbIs.u"><path d="M0 
0h1760v1360H0z" fill="#fff"/><path d="M636.6 524.92h223.67V738.1H636.6z" 
fill="url(#a)"/><path d="M146.44 366.8a1 1 0 0 1 1-1h106.34a1 1 0 0 1 1 
1v22.6a1 1 0 0 1-1 1H147.44a1 1 0 0 1-1-1z" stroke="#000" stroke-width="2" 
fill="#1071e5"/><use xlink:h [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/schema-optim-process.svg 
b/img/journey-apache-arrow/schema-optim-process.svg
new file mode 100644
index 00000000000..bec27a07d41
--- /dev/null
+++ b/img/journey-apache-arrow/schema-optim-process.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:lucid="lucid" width="1478" 
height="482.33"><g transform="translate(-10.333333333343532 
-3.500000000000327)" lucid:page-tab-id="m6NIKMxDM3Yz"><path d="M0 
0h1760v1360H0z" fill="#fff"/><g stroke="#000" stroke-opacity="0" 
stroke-width="1.5"><path d="M1089.7 396.85c.13-.85.67-1.35 
1.24-2.13l31.82-23.53c1.35-1.26 3.42-.98 4.38.6 1 1.28.73 3.25-.58 4.22l-29.25 
21.87 22.1 28.86c1.22 1.32.95 3.3-.4 4.55-1.53. [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/simple-vs-complex-data-model.svg 
b/img/journey-apache-arrow/simple-vs-complex-data-model.svg
new file mode 100644
index 00000000000..2f2379dc0fb
--- /dev/null
+++ b/img/journey-apache-arrow/simple-vs-complex-data-model.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"; 
xmlns:xlink="http://www.w3.org/1999/xlink"; xmlns:lucid="lucid" width="1025.01" 
height="636.5"><g transform="translate(0.8803084331024138 -0.6666666667054812)" 
lucid:page-tab-id="Zg2IS9ZMycfS"><path d="M0 0h1760v1360H0z" fill="#fff"/><path 
d="M19.12 453.17a6 6 0 0 1 6-6h358.54a6 6 0 0 1 6 6v62a6 6 0 0 1-6 6H25.12a6 6 
0 0 1-6-6z" stroke="#000" stroke-opacity="0" fill="#fff" fill-opacity="0"/><use 
xlink:href="#a" transform="matrix(1,0,0,1,24.119691566 [...]
\ No newline at end of file

[arrow-site] branch main updated: [Website] A journey with Apache Arrow - Part 1 - POST (#340)

Reply via email to