Hi,
I am facing a weird behaviour while running a python script. Here is what
the code looks like mostly:
def fn1(ip):
some code...
...
def fn2(row):
...
some operations
...
return row1
udf_fn1 = udf(fn1)
cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with
~4500 partitions
ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
.drop("colz") \
.withColumnRenamed("colz", "coly")
edf = ddf \
.filter(ddf.colp == 'some_value') \
.rdd.map(lambda row: fn2(row)) \
.toDF()
print edf.count() // simple way for the performance test in both platforms
Now when I run the same code in a brand new jupyter notebook it runs 6x
faster than when I run this python script using spark-submit. The
configurations are printed and compared from both the platforms and they
are exact same. I even tried to run this script in a single cell of jupyter
notebook and still have the same performance. I need to understand if I am
missing something in the spark-submit which is causing the issue. I tried
to minimise the script to reproduce the same error without much code.
Both are run in client mode on a yarn based spark cluster. The machines
from which both are executed are also the same and from same user.
What i found is the the quantile values for median for one ran with
jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not
able to figure out why this is happening.
Any one faced this kind of issue before or know how to resolve this?
*Regards,*
*Dhrub*