Apache Arrow support for Apache Spark

Subash Prabakar Sun, 16 Feb 2020 23:42:08 -0800

Hi Team,

I have two questions regarding Arrow and Spark integration,


1. I am joining two huge tables (1PB) each - will the performance be huge
when I use Arrow format before shuffling ? Will the
serialization/deserialization cost have significant improvement?

2. Can we store the final data in Arrow format to HDFS and read them back
in another Spark application? If so how could I do that ?
Note: The dataset is transient  - separation of responsibility is for
easier management. Though resiliency inside spark - we use different
language (in our case Java and Python)

Thanks,
Subash

Apache Arrow support for Apache Spark

Reply via email to