Maybe you can try running it in a python shell or jupyter-console/ipython
instead of a spark-submit and check how much time it takes too.

Compare the env variables to check that no additional env configuration is
present in either environment.

Also is the python environment for both the exact same? I ask because it
looks like you're using a UDF and if the Jupyter python has (let's say)
numpy compiled with blas it would be faster than a numpy without it. Etc.
I.E. Some library you use may be using pure python and another may be using
a faster C extension...

What python libraries are you using in the UDFs? It you don't use UDFs at
all and use some very simple pure spark functions does the time difference
still exist?

Also are you using dynamic allocation or some similar spark config which
could vary performance between runs because the same resources we're not
utilized on Jupyter / spark-submit?


On Wed, Sep 11, 2019, 08:43 Stephen Boesch <java...@gmail.com> wrote:

> Sounds like you have done your homework to properly compare .   I'm
> guessing the answer to the following is yes .. but in any case:  are they
> both running against the same spark cluster with the same configuration
> parameters especially executor memory and number of workers?
>
> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
> dhruba.w...@gmail.com>:
>
>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>> compressed base64 encoded text data from a hive table and decompressing and
>> decoding in one of the udfs. Also the time compared is from Spark UI not
>> how long the job actually takes after submission. Its just the running time
>> i am comparing/mentioning.
>>
>> As mentioned earlier, all the spark conf params even match in two scripts
>> and that's why i am puzzled what going on.
>>
>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <pmccar...@dstillery.com>
>> wrote:
>>
>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>> already is connected to a running spark context, while spark-submit needs
>>> to get a new spot in the (YARN?) queue.
>>>
>>> I would check the cluster job IDs for both to ensure you're getting new
>>> cluster tasks for each.
>>>
>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am facing a weird behaviour while running a python script. Here is
>>>> what the code looks like mostly:
>>>>
>>>> def fn1(ip):
>>>>    some code...
>>>>     ...
>>>>
>>>> def fn2(row):
>>>>     ...
>>>>     some operations
>>>>     ...
>>>>     return row1
>>>>
>>>>
>>>> udf_fn1 = udf(fn1)
>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with
>>>> ~4500 partitions
>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>     .drop("colz") \
>>>>     .withColumnRenamed("colz", "coly")
>>>>
>>>> edf = ddf \
>>>>     .filter(ddf.colp == 'some_value') \
>>>>     .rdd.map(lambda row: fn2(row)) \
>>>>     .toDF()
>>>>
>>>> print edf.count() // simple way for the performance test in both
>>>> platforms
>>>>
>>>> Now when I run the same code in a brand new jupyter notebook it runs 6x
>>>> faster than when I run this python script using spark-submit. The
>>>> configurations are printed and  compared from both the platforms and they
>>>> are exact same. I even tried to run this script in a single cell of jupyter
>>>> notebook and still have the same performance. I need to understand if I am
>>>> missing something in the spark-submit which is causing the issue.  I tried
>>>> to minimise the script to reproduce the same error without much code.
>>>>
>>>> Both are run in client mode on a yarn based spark cluster. The machines
>>>> from which both are executed are also the same and from same user.
>>>>
>>>> What i found is the  the quantile values for median for one ran with
>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
>>>> able to figure out why this is happening.
>>>>
>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>
>>>> *Regards,*
>>>> *Dhrub*
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Patrick McCarthy  *
>>>
>>> Senior Data Scientist, Machine Learning Engineering
>>>
>>> Dstillery
>>>
>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>
>>

Reply via email to