1. I'd also consider how you're structuring the data before applying the join, naively doing the join could be expensive so doing a bit of data preparation may be necessary to improve join performance. Try to get a baseline as well. Arrow would help improve it.
2. Try storing it back as Parquet but in a way the next application can take advantage of predicate pushdown. On Mon, 17 Feb 2020, 6:41 pm Subash Prabakar, <subashpraba...@gmail.com> wrote: > Hi Team, > > I have two questions regarding Arrow and Spark integration, > > 1. I am joining two huge tables (1PB) each - will the performance be huge > when I use Arrow format before shuffling ? Will the > serialization/deserialization cost have significant improvement? > > 2. Can we store the final data in Arrow format to HDFS and read them back > in another Spark application? If so how could I do that ? > Note: The dataset is transient - separation of responsibility is for > easier management. Though resiliency inside spark - we use different > language (in our case Java and Python) > > Thanks, > Subash > >