Hi,
I'm struggling with a join of two large DataFrames. The join is extremely slow 
because it is only executed on one worker.  At the first checkpoint spark uses 
all four workers, but at the second it only uses one.
I first thought it might have something to do with that spark wants to load the 
netlib libraries in this stages, but I have no idea if that has even anything 
to with this problem at all.
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemLAPACK
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefLAPAC

Does anyone has a hint for me where to look for the bottleneck.

taxidataFiltered
     .withColumn("time_taxi", col("time_utc").cast(DoubleType))
     .select(col("time_taxi"),
       col("x_longitude_wgs84"),
       col("y_latitude_wgs84"),
       col("imsi_hash"))
     .checkpoint()
     .join(df,
       col("time_taxi") === df.col("time")
         && taxidataFiltered.col("hash") === df.col("hash"),
       "OUTER")
     .checkpoint()
    ....

[cid:image001.jpg@01D4EE32.3F6EABA0]

Thanks in advance,
Paul

Reply via email to