Hi Team, I have two questions regarding Arrow and Spark integration,
1. I am joining two huge tables (1PB) each - will the performance be huge when I use Arrow format before shuffling ? Will the serialization/deserialization cost have significant improvement? 2. Can we store the final data in Arrow format to HDFS and read them back in another Spark application? If so how could I do that ? Note: The dataset is transient - separation of responsibility is for easier management. Though resiliency inside spark - we use different language (in our case Java and Python) Thanks, Subash