Parallelize Join Problem

Paul.Bauriegel Mon, 08 Apr 2019 08:41:50 -0700

Hi,
I'm struggling with a join of two large DataFrames. The join is extremely slow 
because it is only executed on one worker.  At the first checkpoint spark uses 
all four workers, but at the second it only uses one.
I first thought it might have something to do with that spark wants to load the 
netlib libraries in this stages, but I have no idea if that has even anything 
to with this problem at all.
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemLAPACK
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefLAPAC


Does anyone has a hint for me where to look for the bottleneck.

taxidataFiltered
     .withColumn("time_taxi", col("time_utc").cast(DoubleType))
     .select(col("time_taxi"),
       col("x_longitude_wgs84"),
       col("y_latitude_wgs84"),
       col("imsi_hash"))
     .checkpoint()
     .join(df,
       col("time_taxi") === df.col("time")
         && taxidataFiltered.col("hash") === df.col("hash"),
       "OUTER")
     .checkpoint()
    ....

[cid:image001.jpg@01D4EE32.3F6EABA0]

Thanks in advance,
Paul

Parallelize Join Problem

Reply via email to