It's not obvious from what you pasted, but perhaps the juypter notebook already is connected to a running spark context, while spark-submit needs to get a new spot in the (YARN?) queue.
I would check the cluster job IDs for both to ensure you're getting new cluster tasks for each. On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com> wrote: > Hi, > > I am facing a weird behaviour while running a python script. Here is what > the code looks like mostly: > > def fn1(ip): > some code... > ... > > def fn2(row): > ... > some operations > ... > return row1 > > > udf_fn1 = udf(fn1) > cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with > ~4500 partitions > ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ > .drop("colz") \ > .withColumnRenamed("colz", "coly") > > edf = ddf \ > .filter(ddf.colp == 'some_value') \ > .rdd.map(lambda row: fn2(row)) \ > .toDF() > > print edf.count() // simple way for the performance test in both platforms > > Now when I run the same code in a brand new jupyter notebook it runs 6x > faster than when I run this python script using spark-submit. The > configurations are printed and compared from both the platforms and they > are exact same. I even tried to run this script in a single cell of jupyter > notebook and still have the same performance. I need to understand if I am > missing something in the spark-submit which is causing the issue. I tried > to minimise the script to reproduce the same error without much code. > > Both are run in client mode on a yarn based spark cluster. The machines > from which both are executed are also the same and from same user. > > What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not > able to figure out why this is happening. > > Any one faced this kind of issue before or know how to resolve this? > > *Regards,* > *Dhrub* > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016