Re: Parallelize Join Problem

2019-04-17 Thread asma zgolli
How can I figure out if the data is skewed ? are there some statistics i
can check ?

Le mer. 17 avr. 2019 à 20:12, Yeikel  a écrit :

> It is hard to tell , but your data may be skewed
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Asma ZGOLLI

PhD student in data engineering - computer science


Re: Parallelize Join Problem

2019-04-17 Thread Yeikel
It is hard to tell , but your data may be skewed



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Parallelize Join Problem

2019-04-08 Thread Paul.Bauriegel
Hi,
I'm struggling with a join of two large DataFrames. The join is extremely slow 
because it is only executed on one worker.  At the first checkpoint spark uses 
all four workers, but at the second it only uses one.
I first thought it might have something to do with that spark wants to load the 
netlib libraries in this stages, but I have no idea if that has even anything 
to with this problem at all.
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
19-04-08 15:31:50 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemLAPACK
19-04-08 15:31:50 WARN LAPACK: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefLAPAC

Does anyone has a hint for me where to look for the bottleneck.

taxidataFiltered
 .withColumn("time_taxi", col("time_utc").cast(DoubleType))
 .select(col("time_taxi"),
   col("x_longitude_wgs84"),
   col("y_latitude_wgs84"),
   col("imsi_hash"))
 .checkpoint()
 .join(df,
   col("time_taxi") === df.col("time")
 && taxidataFiltered.col("hash") === df.col("hash"),
   "OUTER")
 .checkpoint()


[cid:image001.jpg@01D4EE32.3F6EABA0]

Thanks in advance,
Paul