Just checked from where the script is submitted i.e. wrt Driver, the python env are different. Jupyter one is running within a the virtual environment which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the executors have the same python version right? I tried doing a spark-submit from jupyter shell, it fails to find python 2.7 which is not there hence throws error.
Here is the udf which might take time: import base64 import zlib def decompress(data): bytecode = base64.b64decode(data) d = zlib.decompressobj(32 + zlib.MAX_WBITS) decompressed_data = d.decompress(bytecode ) return(decompressed_data.decode('utf-8')) Could this because of the two python environment mismatch from Driver side? But the processing happens in the executor side? *Regards,Dhrub* On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari <abdealikoth...@gmail.com> wrote: > Maybe you can try running it in a python shell or jupyter-console/ipython > instead of a spark-submit and check how much time it takes too. > > Compare the env variables to check that no additional env configuration is > present in either environment. > > Also is the python environment for both the exact same? I ask because it > looks like you're using a UDF and if the Jupyter python has (let's say) > numpy compiled with blas it would be faster than a numpy without it. Etc. > I.E. Some library you use may be using pure python and another may be using > a faster C extension... > > What python libraries are you using in the UDFs? It you don't use UDFs at > all and use some very simple pure spark functions does the time difference > still exist? > > Also are you using dynamic allocation or some similar spark config which > could vary performance between runs because the same resources we're not > utilized on Jupyter / spark-submit? > > > On Wed, Sep 11, 2019, 08:43 Stephen Boesch <java...@gmail.com> wrote: > >> Sounds like you have done your homework to properly compare . I'm >> guessing the answer to the following is yes .. but in any case: are they >> both running against the same spark cluster with the same configuration >> parameters especially executor memory and number of workers? >> >> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >> dhruba.w...@gmail.com>: >> >>> No, i checked for that, hence written "brand new" jupyter notebook. Also >>> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >>> compressed base64 encoded text data from a hive table and decompressing and >>> decoding in one of the udfs. Also the time compared is from Spark UI not >>> how long the job actually takes after submission. Its just the running time >>> i am comparing/mentioning. >>> >>> As mentioned earlier, all the spark conf params even match in two >>> scripts and that's why i am puzzled what going on. >>> >>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, < >>> pmccar...@dstillery.com> wrote: >>> >>>> It's not obvious from what you pasted, but perhaps the juypter notebook >>>> already is connected to a running spark context, while spark-submit needs >>>> to get a new spot in the (YARN?) queue. >>>> >>>> I would check the cluster job IDs for both to ensure you're getting new >>>> cluster tasks for each. >>>> >>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am facing a weird behaviour while running a python script. Here is >>>>> what the code looks like mostly: >>>>> >>>>> def fn1(ip): >>>>> some code... >>>>> ... >>>>> >>>>> def fn2(row): >>>>> ... >>>>> some operations >>>>> ... >>>>> return row1 >>>>> >>>>> >>>>> udf_fn1 = udf(fn1) >>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with >>>>> ~4500 partitions >>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >>>>> .drop("colz") \ >>>>> .withColumnRenamed("colz", "coly") >>>>> >>>>> edf = ddf \ >>>>> .filter(ddf.colp == 'some_value') \ >>>>> .rdd.map(lambda row: fn2(row)) \ >>>>> .toDF() >>>>> >>>>> print edf.count() // simple way for the performance test in both >>>>> platforms >>>>> >>>>> Now when I run the same code in a brand new jupyter notebook it runs >>>>> 6x faster than when I run this python script using spark-submit. The >>>>> configurations are printed and compared from both the platforms and they >>>>> are exact same. I even tried to run this script in a single cell of >>>>> jupyter >>>>> notebook and still have the same performance. I need to understand if I am >>>>> missing something in the spark-submit which is causing the issue. I tried >>>>> to minimise the script to reproduce the same error without much code. >>>>> >>>>> Both are run in client mode on a yarn based spark cluster. The >>>>> machines from which both are executed are also the same and from same >>>>> user. >>>>> >>>>> What i found is the the quantile values for median for one ran with >>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am >>>>> not >>>>> able to figure out why this is happening. >>>>> >>>>> Any one faced this kind of issue before or know how to resolve this? >>>>> >>>>> *Regards,* >>>>> *Dhrub* >>>>> >>>> >>>> >>>> -- >>>> >>>> >>>> *Patrick McCarthy * >>>> >>>> Senior Data Scientist, Machine Learning Engineering >>>> >>>> Dstillery >>>> >>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>> >>>