[jira] [Comment Edited] (FLINK-10929) Add support for Apache Arrow

2019-04-29 Thread Liya Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826677#comment-16826677
 ] 

Liya Fan edited comment on FLINK-10929 at 4/30/19 3:01 AM:
---

Hi [Stephan 
Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen] 
and [Kurt 
Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836], 
thanks a lot for your valuable comments. 

It seems there is not much debate for point 1). 

For point 2), I want to list a few advantages of using Arrow as internal data 
structures for Flink. These conclusions are based on our investigations of 
research papers, as well as our engineering experiences in recent development 
on Blink.
 # Arrow adopts columnar data format, which makes it possible to apply SIMD 
([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).].
 SIMD means higher performance. Java starts to support SIMD in Java 8, and it 
is expected that high versions of Java will provide better support for SIMD. 
 # Arrow provides more compact memory layout. To make it simple, encoding a 
batch of rows as Arrow vectors will require much less memory space, compared 
with encoding by BinaryRow. This, in turn, leads to two results:
 ## With the same amount of memory space, more data can be loaded into memory, 
and fewer spills are required.
 ## The cost of shuffling can be reduced significantly. For example, the amount 
of shuffled data for TPC-H Q4 reduces to almost the half. 
 # Columnar format makes it much easier for data compression (see 
[paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example). 
Our initial experiments have shown some positive results.

There are more advantages for Arrow, but I just list a few, due to space 
constraints. The TPC-H performance in the attachment speaks for itself.

We understand and respect that Blink is currently being merged, and some 
changes should be postponed. We would like to provide our devoted help in this 
process.

However, too much change is not a good reason to refuse Arrow, if we want to 
make Flink the world-top-level computing engine. All these difficulties can be 
overcome. We have a plan to carry out the changes incrementally.

 

 


was (Author: fan_li_ya):
Hi [Stephan 
Ewen|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=StephanEwen] 
and [Kurt 
Young|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ykt836], 
thanks a lot for your valuable comments. 

It seems there is not much debate for point 1). 

For point 2), I want to list a few advantages of using Arrow as internal data 
structures for Flink. These conclusions are based on our investigations of 
research papers, as well as our engineering experiences in secondary 
development on Blink.
 # Arrow adopts columnar data format, which makes it possible to apply SIMD 
([link|[http://prestodb.rocks/code/simd/])|http://prestodb.rocks/code/simd/%5d).].
 SIMD means higher performance. Java starts to support SIMD in Java 8, and it 
is expected that high versions of Java will provide better support for SIMD. 
 # Arrow provides more compact memory layout. To make it simple, encoding a 
batch of rows as Arrow vectors will require much less memory space, compared 
with encoding by BinaryRow. This, in turn, leads to two results:
 ## With the same amount of memory space, more data can be loaded into memory, 
and fewer spills are required.
 ## The cost of shuffling can be reduced significantly. For example, the amount 
of shuffled data for TPC-H Q4 reduces to almost the half. 
 # Columnar format makes it much easier for data compression (see 
[paper|http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf], for example). 
Our initial experiments have shown some positive results.

There are more advantages for Arrow, but I just list a few, due to space 
constraints. The TPC-H performance in the attachment speaks for itself.

We understand and respect that Blink is currently being merged, and some 
changes should be postponed. We would like to provide our devoted help in this 
process.

However, too much change is not a good reason to refuse Arrow, if we want to 
make Flink the world-top-level computing engine. All these difficulties can be 
overcome. We have a plan to carry out the changes incrementally.

 

 

> Add support for Apache Arrow
> 
>
> Key: FLINK-10929
> URL: https://issues.apache.org/jira/browse/FLINK-10929
> Project: Flink
>  Issue Type: Wish
>  Components: Runtime / State Backends
>Reporter: Pedro Cardoso Silva
>Priority: Minor
> Attachments: image-2019-04-10-13-43-08-107.png
>
>
> Investigate the possibility of adding support for Apache Arrow as a 
> standardized columnar, memory format for data.
> Given the activity that 

[jira] [Comment Edited] (FLINK-10929) Add support for Apache Arrow

2019-04-11 Thread Liya Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815155#comment-16815155
 ] 

Liya Fan edited comment on FLINK-10929 at 4/11/19 7:45 AM:
---

Hi [~yanghua], thank you so much for your feedback. Below is an initial draft 
of our design document:

[https://docs.google.com/document/d/1LstaiGzlzTdGUmyG_-9GSWleKpLRid8SJWXQCTqcQB4/edit?usp=sharing]

Please give your valuable comments.


was (Author: fan_li_ya):
Hi [~yanghua], thank you so much for your feedback. Below is an initial draft 
of our design document:

https://docs.google.com/document/d/1LstaiGzlzTdGUmyG_-9GSWleKpLRid8SJWXQCTqcQB4/edit?usp=sharing
 

> Add support for Apache Arrow
> 
>
> Key: FLINK-10929
> URL: https://issues.apache.org/jira/browse/FLINK-10929
> Project: Flink
>  Issue Type: Wish
>  Components: Runtime / State Backends
>Reporter: Pedro Cardoso Silva
>Priority: Minor
> Attachments: image-2019-04-10-13-43-08-107.png
>
>
> Investigate the possibility of adding support for Apache Arrow as a 
> standardized columnar, memory format for data.
> Given the activity that [https://github.com/apache/arrow] is currently 
> getting and its claims objective of providing a zero-copy, standardized data 
> format across platforms, I think it makes sense for Flink to look into 
> supporting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10929) Add support for Apache Arrow

2019-04-10 Thread Fabian Hueske (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699093#comment-16699093
 ] 

Fabian Hueske edited comment on FLINK-10929 at 4/10/19 8:14 AM:


Arrow is a columnar in-memory data storage / exchange format. This means it was 
not designed with point updates / queries in mind which is the access pattern 
for a state backend in Flink. I'm not saying that there is no use case for 
Arrow in streaming, but IMO there is no obvious one. 

For batch SQL, yes, it might make sense but it would be a huge change 
(basically completely rewriting the data processing engine). 
One could also think of adding interfaces to consume or export Arrow data. 
There are probably more scenarios in which Arrow could be used.

I think we need more detail to decide whether this proposal is something that 
is useful (and achievable) for Flink or not.
[~pcless], what was the use case you had in mind when opening this Jira?





was (Author: fhueske):
Arrow is a columnar in-memory data storage / exchange format. This means it was 
not designed with point updates / queries in mind which is the access pattern 
for a state backend in Flink. I'm not saying that there is no use case for 
Arrow in streaming, but IMO there is obvious one. 

For batch SQL, yes, it might make sense but it would be a huge change 
(basically completely rewriting the data processing engine). 
One could also think of adding interfaces to consume or export Arrow data. 
There are probably more scenarios in which Arrow could be used.

I think we need more detail to decide whether this proposal is something that 
is useful (and achievable) for Flink or not.
[~pcless], what was the use case you had in mind when opening this Jira?




> Add support for Apache Arrow
> 
>
> Key: FLINK-10929
> URL: https://issues.apache.org/jira/browse/FLINK-10929
> Project: Flink
>  Issue Type: Wish
>  Components: Runtime / State Backends
>Reporter: Pedro Cardoso Silva
>Priority: Minor
> Attachments: image-2019-04-10-13-43-08-107.png
>
>
> Investigate the possibility of adding support for Apache Arrow as a 
> standardized columnar, memory format for data.
> Given the activity that [https://github.com/apache/arrow] is currently 
> getting and its claims objective of providing a zero-copy, standardized data 
> format across platforms, I think it makes sense for Flink to look into 
> supporting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)