This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/main by this push:
new fd7fc44a608 [Website] A journey with Apache Arrow - Part 1 - POST
(#340)
fd7fc44a608 is described below
commit fd7fc44a608539fc8671bad716968b122fa97207
Author: Laurent Quérel <[email protected]>
AuthorDate: Tue Apr 11 04:08:31 2023 -0700
[Website] A journey with Apache Arrow - Part 1 - POST (#340)
This PR is a markdown version of the article proposed on the mailing
list (see
https://lists.apache.org/thread/jxpypxwjh4jhpk2xvj0z3woy7yr0z0sk).
The `author` field currently contains my full name and not the
`apacheId` because I was unable to get `jekyll` to take into account the
change I made in `contributors.yml`.
---------
Co-authored-by: Andrew Lamb <[email protected]>
---
.DS_Store | Bin 0 -> 6148 bytes
_data/contributors.yml | 3 +
...1-our-journey-at-f5-with-apache-arrow-part-1.md | 382 +++++++++++++++++++++
img/journey-apache-arrow/data-types.svg | 1 +
img/journey-apache-arrow/dictionary-encoding.svg | 1 +
.../hierarchical-data-model.svg | 1 +
img/journey-apache-arrow/performance.svg | 1 +
img/journey-apache-arrow/row-vs-columnar.svg | 1 +
img/journey-apache-arrow/schema-optim-process.svg | 1 +
.../simple-vs-complex-data-model.svg | 1 +
10 files changed, 392 insertions(+)
diff --git a/.DS_Store b/.DS_Store
new file mode 100644
index 00000000000..940e942902c
Binary files /dev/null and b/.DS_Store differ
diff --git a/_data/contributors.yml b/_data/contributors.yml
index 0a01adb2555..a7d7b210c4b 100644
--- a/_data/contributors.yml
+++ b/_data/contributors.yml
@@ -58,4 +58,7 @@
- name: Stephanie Hazlitt
apacheId: stephhazlitt
githubId: stephhazlitt
+- name: Laurent Querel
+ apacheId: lquerel # Not a real apacheId
+ githubId: lquerel
# End contributors.yml
diff --git a/_posts/2023-04-11-our-journey-at-f5-with-apache-arrow-part-1.md
b/_posts/2023-04-11-our-journey-at-f5-with-apache-arrow-part-1.md
new file mode 100644
index 00000000000..045bf457826
--- /dev/null
+++ b/_posts/2023-04-11-our-journey-at-f5-with-apache-arrow-part-1.md
@@ -0,0 +1,382 @@
+---
+layout: post
+title: "Our journey at F5 with Apache Arrow (part 1)"
+date: "2023-04-11 00:00:00"
+author: Laurent Quérel
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Apache Arrow is a technology widely adopted in big data, analytics, and
machine learning applications. In this article, we share
[F5](https://www.f5.com/)'s experience with Arrow, specifically its application
to telemetry, and the challenges we encountered while optimizing the
OpenTelemetry protocol to significantly reduce bandwidth costs. The promising
results we achieved inspired us to share our insights. This article
specifically focuses on transforming relatively complex data structu [...]
+
+## What is Apache Arrow
+
+[Apache Arrow](https://arrow.apache.org/docs/index.html) is an open-source
project offering a standardized, language-agnostic in-memory format for
representing structured and semi-structured data. This enables data sharing and
zero-copy data access between systems, eliminating the need for serialization
and deserialization when exchanging datasets between varying CPU architectures
and programming languages. Furthermore, Arrow libraries feature an extensive
set of high-performance, parall [...]
+
+Very often people ask about the differences between Arrow and [Apache
Parquet](https://parquet.apache.org/) or other columnar file formats. Arrow is
designed and optimized for in-memory processing, while Parquet is tailored for
disk-based storage. In reality, these technologies are complementary, with
bridges existing between them to simplify interoperability. In both cases, data
is represented in columns to optimize access, data locality and
compressibility. However, the tradeoffs diffe [...]
+
+<figure style="text-align: center;">
+ <img src="{{ site.baseurl }}/img/journey-apache-arrow/row-vs-columnar.svg"
width="100%" class="img-responsive" alt="Memory representations: row vs
columnar data.">
+ <figcaption>Fig 1: Memory representations: row vs columnar data.</figcaption>
+</figure>
+
+Figure 1 illustrates the differences in memory representation between
row-oriented and column-oriented approaches. The column-oriented approach
groups data from the same column in a continuous memory area, which facilitates
parallel processing (SIMD) and enhances compression performance.
+
+## Why are we interested in Apache Arrow
+
+At [F5](https://www.f5.com/), we’ve adopted
[OpenTelemetry](https://opentelemetry.io/) (OTel) as the standard for all
telemetry across our products, such as BIGIP and NGINX. These products may
generate large volumes of metrics and logs for various reasons, from
performance evaluation to forensic purposes. The data produced by these systems
is typically centralized and processed in dedicated systems. Transporting and
processing this data accounts for a significant portion of the cost asso [...]
+
+<figure style="text-align: center;">
+ <img src="{{ site.baseurl }}/img/journey-apache-arrow/performance.svg"
width="100%" class="img-responsive" alt="Performance improvement in the
OpenTelemetry Arrow experimental project.">
+ <figcaption>Fig 2: Performance improvement in the OpenTelemetry Arrow
experimental project.</figcaption>
+</figure>
+
+This project has been divided into two phases. The first phase, which is
nearing completion, aims to enhance the protocol's compression ratio. The
second phase, planned for the future, focuses on improving end-to-end
performance by incorporating Apache Arrow throughout all levels, eliminating
the need for conversion between old and new protocols. The results so far are
promising, with our benchmarks showing compression ratio improvements ranging
from x1.5 to x5, depending on the data typ [...]
+
+Arrow relies on a schema to define the structure of data batches that it
processes and transports. The subsequent sections will discuss various
techniques that can be employed to optimize the creation of these schemas.
+
+## How to leverage Arrow to optimize network transport cost
+
+Apache Arrow is a complex project with a rapidly evolving ecosystem, which can
sometimes be overwhelming for newcomers. Fortunately the Arrow community has
published three introductory articles
[1](https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/),
[2](https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/),
and
[3](https://arrow.apache.org/blog/2022/10/17/arrow-parquet-encoding-part-3/)
that we recommend for those interested in exploring this technology.
+
+This article primarily focuses on transforming data from an XYZ format into an
efficient Arrow representation that optimizes both compression ratio and data
processing. There are numerous approaches to this transformation, and we will
examine how these methods can impact compression ratio, CPU usage, and memory
consumption during the conversion process, among other factors.
+
+<figure style="text-align: center;">
+ <img src="{{ site.baseurl
}}/img/journey-apache-arrow/schema-optim-process.svg" width="100%"
class="img-responsive" alt="Fig 3: Optimization process for the definition of
an Arrow schema.">
+ <figcaption>Fig 3: Optimization process for the definition of an Arrow
schema.</figcaption>
+</figure>
+
+The complexity of your initial model significantly impacts the Arrow mapping
choices you need to make. To begin, it's essential to identify the properties
you want to optimize for your specific context. Compression rate, conversion
speed, memory consumption, speed and ease of use of the final model,
compatibility, and extensibility are all factors that can influence your final
mapping decisions. From there, you must explore multiple alternative schemas.
+
+The choice of the Arrow type and data encoding for each individual field will
affect the performance of your schema. There are various ways to represent
hierarchical data or highly dynamic data models, and multiple options need to
be evaluated in coordination with the configuration of the transport layer.
This transport layer should also be carefully considered. Arrow supports
compression mechanisms and dictionary deltas that may not be active by default.
+
+After several iterations of this process, you should arrive at an optimized
schema that meets the goals you initially set. It's crucial to compare the
performance of your different approaches using real data, as the distribution
of data in each individual field may influence whether you use dictionary
encoding or not. We will now examine these choices in greater detail throughout
the remainder of this article.
+
+## Arrow data type selection
+
+The principles of selecting an Arrow data type are quite similar to those used
when defining a data model for databases. Arrow supports a wide range of data
types. Some of these types are supported by all implementations, while others
are only available for languages with the strongest Arrow community support
(see this [page](https://arrow.apache.org/docs/status.html) for a comparison
matrix of the different implementations). For primitive types, it is generally
preferable to choose the [...]
+
+<figure style="text-align: center;">
+ <img src="{{ site.baseurl }}/img/journey-apache-arrow/data-types.svg"
width="100%" class="img-responsive" alt="Fig 4: Data types supported by Apache
Arrow.">
+ <figcaption>Fig 4: Data types supported by Apache Arrow.</figcaption>
+</figure>
+
+When selecting the Arrow data type, it's important to consider the size of the
data before and after compression. It's quite possible that the size after
compression is the same for two different types, but the actual size in memory
may be two, four, or even eight times larger (e.g., uint8 vs. uint64). This
difference will impact your ability to process large batches of data and will
also significantly influence the speed of processing these data in memory
(e.g., cache optimization, SIMD [...]
+
+It's also possible to extend these types using an [extension
type](https://arrow.apache.org/docs/format/Columnar.html#extension-types)
mechanism that builds upon one of the currently supported primitive types while
adding specific semantics. This extension mechanism can simplify the use of
this data in your own project, while remaining transparent to intermediate
systems that will interpret this data as a basic primitive type.
+
+There are some variations in the encoding of primitive types, which we will
explore next.
+
+## Data encoding
+
+Another crucial aspect of optimizing your Arrow schema is analyzing the
cardinality of your data. Fields that can have only a limited number of values
will typically be more efficiently represented with a dictionary encoding.
+
+The maximum cardinality of a field determines the data type characteristics of
your dictionary. For instance, for a field representing the status code of an
HTTP transaction, it's preferable to use a dictionary with an index of type
‘uint8’ and a value of type ‘uint16’ (notation: ‘Dictionary<uint8, uint16>’).
This consumes less memory because the main array will be of type ‘[]uint8’.
Even if the range of possible values is greater than 255, as long as the number
of distinct values does n [...]
+
+<figure style="text-align: center;">
+ <img src="{{ site.baseurl
}}/img/journey-apache-arrow/dictionary-encoding.svg" width="90%"
class="img-responsive" alt="Fig 5: Dictionary encoding.">
+ <figcaption>Fig 5: Dictionary encoding.</figcaption>
+</figure>
+
+Dictionary encoding is highly flexible in Apache Arrow, allowing the creation
of encodings for any Arrow primitive type. The size of the indices can also be
configured based on the context.
+
+In general, it is advisable to use dictionaries in the following cases:
+* Representation of enumerations
+* Representation of textual or binary fields with a high probability of having
redundant values.
+* Representation of fields with cardinalities known to be below 2^16 or 2^32.
+
+Sometimes, the cardinality of a field is not known a priori. For example, a
proxy that transforms a data stream from a row-oriented format into a series of
columnar-encoded batches (e.g., OpenTelemetry collector) may not be able to
predict in advance whether a field will have a fixed number of distinct values.
Two approaches are possible:
+1) a conservative approach using the largest data type (e.g., ‘int64’,
‘string’, etc., instead of dictionary),
+2) an adaptive approach that modifies the schema on the fly based on the
observed cardinality of the field(s). In this second approach, without
cardinality information, you can optimistically start by using a
‘Dictionary<uint8, original-field-type>’ dictionary, then detect a potential
dictionary overflow during conversion, and change the schema to a
‘Dictionary<uint16, original-field-type>’ in case of an overflow. This
technique of automatic management of dictionary overflows will be pre [...]
+
+Recent advancements in Apache Arrow include the implementation of [run-end
encoding](https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout),
a technique that efficiently represents data with sequences of repeated
values. This encoding method is particularly beneficial for handling data sets
containing long stretches of identical values, as it offers a more compact and
optimized representation.
+
+In conclusion, dictionary encoding not only occupies less space in memory and
during transfers but also significantly improves the compression ratio and data
processing speed. However, this type of representation requires indirection
when extracting the initial values (although this isn’t always necessary, even
during some data processing operations). Additionally, it is important to
manage dictionary index overflow, especially when the encoded field doesn't
have a well-defined cardinality.
+
+## Hierarchical data
+
+Basic hierarchical data structures translate relatively well into Arrow.
However, as we will see, there are some complications to handle in more
general cases (see figure 6). While Arrow schemas do support nested structures,
maps, and unions, some components of the Arrow ecosystem do not fully support
them, which can make these Arrow data types unsuitable for certain scenarios.
Additionally, unlike most languages and formats, such as Protobuf, Arrow
doesn’t support the concept of a recu [...]
+
+<figure style="text-align: center;">
+ <img src="{{ site.baseurl
}}/img/journey-apache-arrow/simple-vs-complex-data-model.svg" width="100%"
class="img-responsive" alt="Fig 6: simple vs complex data model.">
+ <figcaption>Fig 6: simple vs complex data model.</figcaption>
+</figure>
+
+### Natural representation
+
+The most straightforward and intuitive approach to representing a simple
hierarchical data model is to use Arrow's list, map, and union data types.
However, it's important to note that some of these data types are not fully
supported throughout the entire Arrow ecosystem. For example, the conversion of
unions to Parquet is [not directly
supported](https://issues.apache.org/jira/browse/PARQUET-756) and requires a
transformation step (see [denormalization & flattening representation](https
[...]
+
+<figure style="text-align: center;">
+ <img src="{{ site.baseurl
}}/img/journey-apache-arrow/hierarchical-data-model.svg" width="80%"
class="img-responsive" alt="Fig 7: initial data model.">
+ <figcaption>Fig 7: initial data model.</figcaption>
+</figure>
+
+The following example is a Go program snippet of an Arrow schema using these
different data types to represent the model above.
+
+```go
+import "github.com/apache/arrow/go/v11/arrow"
+
+
+const (
+ GaugeMetricCode arrow.UnionTypeCode = 0
+ SumMetricCode arrow.UnionTypeCode = 1
+)
+
+
+var (
+ // uint8Dictionary represent a Dictionary<Uint8, String>
+ uint8Dictionary = &arrow.DictionaryType{
+ IndexType: arrow.PrimitiveTypes.Uint8,
+ ValueType: arrow.BinaryTypes.String,
+ }
+ // uint16Dictionary represent a Dictionary<Uint16, String>
+ uint16Dictionary = &arrow.DictionaryType{
+ IndexType: arrow.PrimitiveTypes.Uint16,
+ ValueType: arrow.BinaryTypes.String,
+ }
+
+
+ Schema = arrow.NewSchema([]arrow.Field{
+ {Name: "resource_metrics", Type:
arrow.ListOf(arrow.StructOf([]arrow.Field{
+ {Name: "scope", Type: arrow.StructOf([]arrow.Field{
+ // Name and Version are declared as dictionaries
(Dictionary<Uint16, String>)).
+ {Name: "name", Type: uint16Dictionary},
+ {Name: "version", Type: uint16Dictionary},
+ }...)},
+ {Name: "metrics", Type: arrow.ListOf(arrow.StructOf([]arrow.Field{
+ {Name: "name", Type: uint16Dictionary},
+ {Name: "unit", Type: uint8Dictionary},
+ {Name: "timestamp", Type: arrow.TIMESTAMP},
+ {Name: "metric_type", Type: arrow.UINT8},
+ {Name: "data_point", Type:
arrow.ListOf(arrow.StructOf([]arrow.Field{
+ {Name: "metric", Type: arrow.DenseUnionOf(
+ []arrow.Field{
+ {Name: "gauge", Type: arrow.StructOf([]arrow.Field{
+ {Name: "data_point", Type: arrow.FLOAT64},
+ }...)},
+ {Name: "sum", Type: arrow.StructOf([]arrow.Field{
+ {Name: "data_point", Type: arrow.FLOAT64},
+ {Name: "is_monotonic", Type: arrow.BOOL},
+ }...)},
+ },
+ []arrow.UnionTypeCode{GaugeMetricCode, SumMetricCode},
+ )},
+ }...))},
+ }...))},
+ }...))},
+ }, nil)
+)
+```
+
+In this pattern, we use a union type to represent an inheritance relationship.
There are two types of Arrow union that are optimized for different cases. The
dense union type has a relatively succinct memory representation but doesn’t
support vectorizable operations, making it less efficient during the processing
phase. Conversely, a sparse union supports vectorization operations, but comes
with a memory overhead directly proportional to the number of variants in the
union. Dense and spa [...]
+
+In certain scenarios, it may be more idiomatic to represent the inheritance
relationship using multiple schemas (i.e., one schema per subtype), thereby
avoiding the use of the union type. However, applying this approach to the
aforementioned model may not be optimal, as the data preceding the inheritance
relationship (i.e., `ResourceMetrics`, `Scope`, and `Metrics`) could
potentially be duplicated numerous times. If the relationships between
`ResourceMetrics`, `Metrics`, and `DataPoint` [...]
+
+### Denormalization & Flattening representations
+
+If the `List` type is not supported in your telemetry pipeline, you can
denormalize your data model. This process is often used in the database world
to remove a join between two tables for optimization purposes. In the Arrow
world, denormalization is employed to eliminate the `List` type by duplicating
some data. Once transformed, the previous Arrow schema becomes.
+
+```go
+Schema = arrow.NewSchema([]arrow.Field{
+ {Name: "resource_metrics", Type: arrow.StructOf([]arrow.Field{
+ {Name: "scope", Type: arrow.StructOf([]arrow.Field{
+ // Name and Version are declared as dictionaries (Dictionary<Uint16,
String>)).
+ {Name: "name", Type: uint16Dictionary},
+ {Name: "version", Type: uint16Dictionary},
+ }...)},
+ {Name: "metrics", Type: arrow.StructOf([]arrow.Field{
+ {Name: "name", Type: uint16Dictionary},
+ {Name: "unit", Type: uint8Dictionary},
+ {Name: "timestamp", Type: arrow.TIMESTAMP},
+ {Name: "metric_type", Type: arrow.UINT8},
+ {Name: "data_point", Type: arrow.StructOf([]arrow.Field{
+ {Name: "metric", Type: arrow.DenseUnionOf(
+ []arrow.Field{
+ {Name: "gauge", Type: arrow.StructOf([]arrow.Field{
+ {Name: "value", Type: arrow.FLOAT64},
+ }...)},
+ {Name: "sum", Type: arrow.StructOf([]arrow.Field{
+ {Name: "value", Type: arrow.FLOAT64},
+ {Name: "is_monotonic", Type: arrow.BOOL},
+ }...)},
+ },
+ []arrow.UnionTypeCode{GaugeMetricCode, SumMetricCode},
+ )},
+ }...)},
+ }...)},
+ }...)},
+}, nil)
+```
+
+List types are eliminated at all levels. The initial semantics of the model
are preserved by duplicating the data of the levels below each data point
value. The memory representation will generally be much larger than the
previous one, but a query engine that does not support the `List` type will
still be able to process this data. Interestingly, once compressed, this way of
representing data may not necessarily be larger than the previous approach.
This is because the columnar represent [...]
+
+If the union type is not supported by some components of your pipeline, it is
also possible to eliminate them by merging the union variants (the nested
structure ‘metric’ is removed, see below).
+
+```go
+Schema = arrow.NewSchema([]arrow.Field{
+ {Name: "resource_metrics", Type: arrow.StructOf([]arrow.Field{
+ {Name: "scope", Type: arrow.StructOf([]arrow.Field{
+ // Name and Version are declared as dictionaries (Dictionary<Uint16,
String>)).
+ {Name: "name", Type: uint16Dictionary},
+ {Name: "version", Type: uint16Dictionary},
+ }...)},
+ {Name: "metrics", Type: arrow.StructOf([]arrow.Field{
+ {Name: "name", Type: uint16Dictionary},
+ {Name: "unit", Type: uint8Dictionary},
+ {Name: "timestamp", Type: arrow.TIMESTAMP},
+ {Name: "metric_type", Type: arrow.UINT8},
+ {Name: "data_point", Type: arrow.StructOf([]arrow.Field{
+ {Name: "value", Type: arrow.FLOAT64},
+ {Name: "is_monotonic", Type: arrow.BOOL},
+ }...)},
+ }...)},
+ }...)},
+}, nil)
+```
+
+The final schema has evolved into a series of nested structures, where the
fields of the union variants are merged into one structure. The trade-off of
this approach is similar to that of sparse union - the more variants, the
higher the memory occupation. Arrow supports the concept of bitmap validity to
identify null values (1 bit per entry) for various data types, including those
that do not have a unique null representation (e.g., primitive types). The use
of bitmap validity makes the [...]
+
+In some extreme situations where nested structures are not supported, a
flattening approach can be used to address this problem.
+
+```go
+Schema = arrow.NewSchema([]arrow.Field{
+ {Name: "scope_name", Type: uint16Dictionary},
+ {Name: "scope_version", Type: uint16Dictionary},
+ {Name: "metrics_name", Type: uint16Dictionary},
+ {Name: "metrics_unit", Type: uint8Dictionary},
+ {Name: "metrics_timestamp", Type: arrow.TIMESTAMP},
+ {Name: "metrics_metric_type", Type: arrow.UINT8},
+ {Name: "metrics_data_point_value", Type: arrow.FLOAT64},
+ {Name: "metrics_data_point_is_monotonic", Type: arrow.BOOL},
+}, nil)
+```
+
+The terminal fields (leaves) are renamed by concatenating the names of the
parent structures to provide proper scoping. This type of structure is
supported by all components of the Arrow ecosystem. This approach can be useful
if compatibility is a crucial criterion for your system. However, it shares the
same drawbacks as other alternative denormalization models.
+
+The Arrow ecosystem is evolving rapidly, so it is likely that support for
List, Map, and Union data types in query engines will improve quickly. If
kernel functions are sufficient or preferable for your application, it is
usually possible to utilize these nested types.
+
+### Adaptive/Dynamic representation
+
+Some data models can be more challenging to translate into an Arrow schema,
such as the following Protobuf example. In this example, a collection of
attributes is added to each data point. These attributes are defined using a
recursive definition that most languages and formats, like Protobuf, support
(see the ‘AnyValue’ definition below). Unfortunately, Arrow (like most
classical database schemas) does not support such recursive definition within
schemas.
+
+```protobuf
+syntax = "proto3";
+
+
+message Metric {
+ message DataPoint {
+ repeated Attribute attributes = 1;
+ oneof value {
+ int64 int_value = 2;
+ double double_value = 3;
+ }
+ }
+
+
+ enum MetricType {
+ UNSPECIFIED = 0;
+ GAUGE = 1;
+ SUM = 2;
+ }
+
+
+ message Gauge {
+ DataPoint data_point = 1;
+ }
+
+
+ message Sum {
+ DataPoint data_point = 1;
+ bool is_monotonic = 2;
+ }
+
+
+ string name = 1;
+ int64 timestamp = 2;
+ string unit = 3;
+ MetricType type = 4;
+ oneof metric {
+ Gauge gauge = 5;
+ Sum sum = 6;
+ }
+}
+
+
+message Attribute {
+ string name = 1;
+ AnyValue value = 2;
+}
+
+
+// Recursive definition of AnyValue. AnyValue can be a primitive value, a list
+// of AnyValues, or a list of key-value pairs where the key is a string and
+// the value is an AnyValue.
+message AnyValue {
+ message ArrayValue {
+ repeated AnyValue values = 1;
+ }
+ message KeyValueList {
+ message KeyValue {
+ string key = 1;
+ AnyValue value = 2;
+ }
+ repeated KeyValue values = 1;
+ }
+
+
+ oneof value {
+ int64 int_value = 1;
+ double double_value = 2;
+ string string_value = 3;
+ ArrayValue list_value = 4;
+ KeyValueList kvlist_value = 5;
+ }
+}
+```
+
+If the definition of the attributes were non-recursive, it would have been
possible to directly translate them into an Arrow Map type.
+
+To address this kind of issue and further optimize Arrow schema definitions,
you can employ an adaptive and iterative method that automatically constructs
the Arrow schema based on the data being translated. With this approach, fields
are automatically dictionary-encoded according to their cardinalities, unused
fields are eliminated, and recursive structures are represented in a specific
manner. Another solution involves using a multi-schema approach, in which
attributes are depicted in [...]
+
+## Data transport
+
+Unlike to Protobuf, an Arrow schema is generally not known a priori by the two
parties participating in an exchange. Before being able to exchange data in
Arrow format, the sender must first communicate the schema to the receiver, as
well as the contents of the dictionaries used in the data. Only after this
initialization phase has been completed can the sender transmit batches of data
in Arrow format. This process, known as [Arrow IPC
Stream](https://wesmckinney.com/blog/arrow-streaming [...]
+
+Using a stateless protocol is possible for large batches because the overhead
of the schema will be negligible compared to the compression gains achieved
using dictionary encoding and columnar representation. However, dictionaries
will have to be communicated for each batch, making this approach generally
less efficient than a stream-oriented approach.
+
+Arrow IPC Stream also supports the concept of "delta dictionaries," which
allows for further optimization of batch transport. When a batch adds data to
an existing dictionary (at the sender's end), Arrow IPC enables sending the
delta dictionary followed by the batch that references it. On the receiver
side, this delta is used to update the existing dictionary, eliminating the
need to retransmit the entire dictionary when changes occur. This optimization
is only possible with a stateful p [...]
+
+To fully leverage the column-oriented format of Apache Arrow, it is essential
to consider sorting and compression. If your data model is simple (i.e., flat)
and has one or more columns representing a natural order for your data (e.g.,
timestamp), it might be beneficial to sort your data to optimize the final
compression ratio. Before implementing this optimization, it is recommended to
perform tests on real data since the benefits may vary. In any case, using a
compression algorithm when [...]
+
+Lastly, some implementations (e.g., Arrow Go) are not configured by default to
support delta dictionaries and compression algorithms. Therefore, it is crucial
to ensure that your code employs these options to maximize data transport
efficiency.
+
+## Experiments
+
+If your initial data is complex, it is advisable to conduct your own
experiments to optimize the Arrow representation according to your data and
goals (e.g., optimizing the compression ratio or enhancing the query-ability of
your data in Arrow format). In our case, we developed an overlay for Apache
Arrow that enables us to carry out these experiments with ease, without having
to deal with the intrinsic complexity of Arrow APIs. However, this comes at the
expense of a slower conversion p [...]
+
+We also employed a "black box optimization" approach, which automatically
finds the best combination to meet the objectives we aimed to optimize (refer
to "[Optimize your applications using Google Vertex AI
Vizier](https://cloud.google.com/blog/products/ai-machine-learning/optimize-your-applications-using-google-vertex-ai-vizier)"
for a description of this approach).
+
+## Conclusion and next steps
+
+Essentially, the key concept behind Apache Arrow is that it eliminates the
need for serialization and deserialization, enabling zero-copy data sharing.
Arrow achieves this by defining a language-agnostic, in-memory format that
remains consistent across various implementations. Consequently, raw memory
bytes can be transmitted directly over a network without requiring any
serialization or deserialization, significantly enhancing data processing
efficiency.
+
+Converting a data model to Apache Arrow necessitates adaptation and
optimization work, as we have begun to describe in this article. Many
parameters must be considered, and it is recommended to perform a series of
experiments to validate the various choices made during this process.
+
+Handling highly dynamic data with Arrow can be challenging. Arrow requires the
definition of a static schema, which can sometimes make representing this type
of data complex or suboptimal, especially when the initial schema contains
recursive definitions. This article has discussed several approaches to address
this issue. The next article will be dedicated to a hybrid strategy that
involves adapting the Arrow schema on-the-fly to optimize memory usage,
compression ratio, and processing [...]
diff --git a/img/journey-apache-arrow/data-types.svg
b/img/journey-apache-arrow/data-types.svg
new file mode 100644
index 00000000000..7ded543a04b
--- /dev/null
+++ b/img/journey-apache-arrow/data-types.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:lucid="lucid" width="1025.16"
height="316.67"><g transform="translate(-6.19020715634764 -10.811466834104493)"
lucid:page-tab-id="C9CI3RRDGR7e"><path d="M0 0h1760v1360H0z" fill="#fff"/><g
filter="url(#a)"><path d="M27.2 37.8a6 6 0 0 1 6-6H490.5a6 6 0 0 1 6 6V300.5a6
6 0 0 1-6 6H33.2a6 6 0 0 1-6-6z" stroke="#000" stroke-width="2"
fill="#fff"/><use xlink:href="#b" transform="matrix(1,0,0,1,39.1902071563 [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/dictionary-encoding.svg
b/img/journey-apache-arrow/dictionary-encoding.svg
new file mode 100644
index 00000000000..c448e28c9c7
--- /dev/null
+++ b/img/journey-apache-arrow/dictionary-encoding.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:lucid="lucid" width="800.36"
height="1164.33"><g transform="translate(-12.616853856282205
-3.6666666666276697)" lucid:page-tab-id="re8KJjouTQd5"><path d="M0
0h1760v1360H0z" fill="#fff"/><path d="M32.62 627.67a6 6 0 0 1 6-6h748.36a6 6 0
0 1 6 6v468a6 6 0 0 1-6 6H38.62a6 6 0 0 1-6-6z" fill="#fff"/><path d="M36.33
622.14l.43-.18.92-.22.94-.07h2m3.97 0h3.97m3.98 0h3.98m3.98 0h4m3.97
0h3.98m3.98 0h3.98m4 [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/hierarchical-data-model.svg
b/img/journey-apache-arrow/hierarchical-data-model.svg
new file mode 100644
index 00000000000..e0c3a5e4972
--- /dev/null
+++ b/img/journey-apache-arrow/hierarchical-data-model.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:lucid="lucid" width="707.33"
height="614.17"><g transform="translate(-5.666666666666629
-0.6666666667055097)" lucid:page-tab-id=".aGI6iY1UkG4"><path d="M0
0h1760v1360H0z" fill="#fff"/><g filter="url(#a)"><path d="M166 27.67a6 6 0 0 1
6-6h206.67a6 6 0 0 1 6 6v51a6 6 0 0 1-6 6H172a6 6 0 0 1-6-6z" stroke="#3a414a"
fill="#fff"/><use xlink:href="#b"
transform="matrix(1,0,0,1,173.9999999999999,29.666666666 [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/performance.svg
b/img/journey-apache-arrow/performance.svg
new file mode 100644
index 00000000000..d34522ae591
--- /dev/null
+++ b/img/journey-apache-arrow/performance.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:lucid="lucid" width="1090.67"
height="334.67"><g transform="translate(-4.0000000000388525
-4.333333333333371)" lucid:page-tab-id="0_0"><path d="M0 0h1760v1360H0z"
fill="#fff"/><path d="M24 172.33a6 6 0 0 1 6-6h461.33a6 6 0 0 1 6 6V313a6 6 0 0
1-6 6H30a6 6 0 0 1-6-6z" stroke="#000" stroke-opacity="0" fill="#fff"
fill-opacity="0"/><use xlink:href="#a"
transform="matrix(1,0,0,1,29.000000000038852,171.33 [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/row-vs-columnar.svg
b/img/journey-apache-arrow/row-vs-columnar.svg
new file mode 100644
index 00000000000..44ae428c601
--- /dev/null
+++ b/img/journey-apache-arrow/row-vs-columnar.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:lucid="lucid" width="820.96"
height="724.86"><g transform="translate(-59.314749856245044
-55.31644444882227)" lucid:page-tab-id="hpDIo6YbIs.u"><path d="M0
0h1760v1360H0z" fill="#fff"/><path d="M636.6 524.92h223.67V738.1H636.6z"
fill="url(#a)"/><path d="M146.44 366.8a1 1 0 0 1 1-1h106.34a1 1 0 0 1 1
1v22.6a1 1 0 0 1-1 1H147.44a1 1 0 0 1-1-1z" stroke="#000" stroke-width="2"
fill="#1071e5"/><use xlink:h [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/schema-optim-process.svg
b/img/journey-apache-arrow/schema-optim-process.svg
new file mode 100644
index 00000000000..bec27a07d41
--- /dev/null
+++ b/img/journey-apache-arrow/schema-optim-process.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:lucid="lucid" width="1478"
height="482.33"><g transform="translate(-10.333333333343532
-3.500000000000327)" lucid:page-tab-id="m6NIKMxDM3Yz"><path d="M0
0h1760v1360H0z" fill="#fff"/><g stroke="#000" stroke-opacity="0"
stroke-width="1.5"><path d="M1089.7 396.85c.13-.85.67-1.35
1.24-2.13l31.82-23.53c1.35-1.26 3.42-.98 4.38.6 1 1.28.73 3.25-.58 4.22l-29.25
21.87 22.1 28.86c1.22 1.32.95 3.3-.4 4.55-1.53. [...]
\ No newline at end of file
diff --git a/img/journey-apache-arrow/simple-vs-complex-data-model.svg
b/img/journey-apache-arrow/simple-vs-complex-data-model.svg
new file mode 100644
index 00000000000..2f2379dc0fb
--- /dev/null
+++ b/img/journey-apache-arrow/simple-vs-complex-data-model.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:lucid="lucid" width="1025.01"
height="636.5"><g transform="translate(0.8803084331024138 -0.6666666667054812)"
lucid:page-tab-id="Zg2IS9ZMycfS"><path d="M0 0h1760v1360H0z" fill="#fff"/><path
d="M19.12 453.17a6 6 0 0 1 6-6h358.54a6 6 0 0 1 6 6v62a6 6 0 0 1-6 6H25.12a6 6
0 0 1-6-6z" stroke="#000" stroke-opacity="0" fill="#fff" fill-opacity="0"/><use
xlink:href="#a" transform="matrix(1,0,0,1,24.119691566 [...]
\ No newline at end of file