Hi Team,

I have two questions regarding Arrow and Spark integration,

1. I am joining two huge tables (1PB) each - will the performance be huge
when I use Arrow format before shuffling ? Will the
serialization/deserialization cost have significant improvement?

2. Can we store the final data in Arrow format to HDFS and read them back
in another Spark application? If so how could I do that ?
Note: The dataset is transient  - separation of responsibility is for
easier management. Though resiliency inside spark - we use different
language (in our case Java and Python)

Thanks,
Subash

Reply via email to