Hi, I'm struggling with a join of two large DataFrames. The join is extremely slow because it is only executed on one worker. At the first checkpoint spark uses all four workers, but at the second it only uses one. I first thought it might have something to do with that spark wants to load the netlib libraries in this stages, but I have no idea if that has even anything to with this problem at all. 19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPAC
Does anyone has a hint for me where to look for the bottleneck. taxidataFiltered .withColumn("time_taxi", col("time_utc").cast(DoubleType)) .select(col("time_taxi"), col("x_longitude_wgs84"), col("y_latitude_wgs84"), col("imsi_hash")) .checkpoint() .join(df, col("time_taxi") === df.col("time") && taxidataFiltered.col("hash") === df.col("hash"), "OUTER") .checkpoint() .... [cid:image001.jpg@01D4EE32.3F6EABA0] Thanks in advance, Paul