[
https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826677#comment-16826677
]
Liya Fan edited comment on FLINK-10929 at 4/30/19 3:01 AM:
---
Hi [Stephan
Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen]
and [Kurt
Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836],
thanks a lot for your valuable comments.
It seems there is not much debate for point 1).
For point 2), I want to list a few advantages of using Arrow as internal data
structures for Flink. These conclusions are based on our investigations of
research papers, as well as our engineering experiences in recent development
on Blink.
# Arrow adopts columnar data format, which makes it possible to apply SIMD
([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).].
SIMD means higher performance. Java starts to support SIMD in Java 8, and it
is expected that high versions of Java will provide better support for SIMD.
# Arrow provides more compact memory layout. To make it simple, encoding a
batch of rows as Arrow vectors will require much less memory space, compared
with encoding by BinaryRow. This, in turn, leads to two results:
## With the same amount of memory space, more data can be loaded into memory,
and fewer spills are required.
## The cost of shuffling can be reduced significantly. For example, the amount
of shuffled data for TPC-H Q4 reduces to almost the half.
# Columnar format makes it much easier for data compression (see
[paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example).
Our initial experiments have shown some positive results.
There are more advantages for Arrow, but I just list a few, due to space
constraints. The TPC-H performance in the attachment speaks for itself.
We understand and respect that Blink is currently being merged, and some
changes should be postponed. We would like to provide our devoted help in this
process.
However, too much change is not a good reason to refuse Arrow, if we want to
make Flink the world-top-level computing engine. All these difficulties can be
overcome. We have a plan to carry out the changes incrementally.
was (Author: fan_li_ya):
Hi [Stephan
Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen]
and [Kurt
Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836],
thanks a lot for your valuable comments.
It seems there is not much debate for point 1).
For point 2), I want to list a few advantages of using Arrow as internal data
structures for Flink. These conclusions are based on our investigations of
research papers, as well as our engineering experiences in secondary
development on Blink.
# Arrow adopts columnar data format, which makes it possible to apply SIMD
([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).].
SIMD means higher performance. Java starts to support SIMD in Java 8, and it
is expected that high versions of Java will provide better support for SIMD.
# Arrow provides more compact memory layout. To make it simple, encoding a
batch of rows as Arrow vectors will require much less memory space, compared
with encoding by BinaryRow. This, in turn, leads to two results:
## With the same amount of memory space, more data can be loaded into memory,
and fewer spills are required.
## The cost of shuffling can be reduced significantly. For example, the amount
of shuffled data for TPC-H Q4 reduces to almost the half.
# Columnar format makes it much easier for data compression (see
[paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example).
Our initial experiments have shown some positive results.
There are more advantages for Arrow, but I just list a few, due to space
constraints. The TPC-H performance in the attachment speaks for itself.
We understand and respect that Blink is currently being merged, and some
changes should be postponed. We would like to provide our devoted help in this
process.
However, too much change is not a good reason to refuse Arrow, if we want to
make Flink the world-top-level computing engine. All these difficulties can be
overcome. We have a plan to carry out the changes incrementally.
> Add support for Apache Arrow
>
>
> Key: FLINK-10929
> URL: https://issues.apache.org/jira/browse/FLINK-10929
> Project: Flink
> Issue Type: Wish
> Components: Runtime / State Backends
>Reporter: Pedro Cardoso Silva
>Priority: Minor
> Attachments: image-2019-04-10-13-43-08-107.png
>
>
> Investigate the possibility of adding support for Apache Arrow as a
> standardized columnar, memory format for data.
> Given the activity that