Apache Arrow support for Apache Spark

2020-02-16 Thread Subash Prabakar
Hi Team, I have two questions regarding Arrow and Spark integration, 1. I am joining two huge tables (1PB) each - will the performance be huge when I use Arrow format before shuffling ? Will the serialization/deserialization cost have significant improvement? 2. Can we store the final data in

Connected components using GraphFrames is significantly slower than GraphX?

2020-02-16 Thread kant kodali
Hi All, Trying to understand why connected components algorithms runs much slower than the graphX equivalent? Graphx code creates 16 stages. GraphFrame graphFrame = GraphFrame.fromEdges(edges); Dataset connectedComponents = graphFrame.connectedComponents().setAlgorithm("graphx").run(); and the