Just checked from where the script is submitted i.e. wrt Driver, the python
env are different. Jupyter one is running within a the virtual environment
which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the
executors have the same python version right? I tried doing a spark-submit
from jupyter shell, it fails to find python 2.7 which is not there hence
throws error.
Here is the udf which might take time:
import base64
import zlib
def decompress(data):
bytecode = base64.b64decode(data)
d = zlib.decompressobj(32 + zlib.MAX_WBITS)
decompressed_data = d.decompress(bytecode )
return(decompressed_data.decode('utf-8'))
Could this because of the two python environment mismatch from Driver
side? But the processing
happens in the executor side?
*Regards,Dhrub*
On Wed, Sep 11, 2019 at 8:59 AM Abdeali Kothari <[email protected]>
wrote:
> Maybe you can try running it in a python shell or jupyter-console/ipython
> instead of a spark-submit and check how much time it takes too.
>
> Compare the env variables to check that no additional env configuration is
> present in either environment.
>
> Also is the python environment for both the exact same? I ask because it
> looks like you're using a UDF and if the Jupyter python has (let's say)
> numpy compiled with blas it would be faster than a numpy without it. Etc.
> I.E. Some library you use may be using pure python and another may be using
> a faster C extension...
>
> What python libraries are you using in the UDFs? It you don't use UDFs at
> all and use some very simple pure spark functions does the time difference
> still exist?
>
> Also are you using dynamic allocation or some similar spark config which
> could vary performance between runs because the same resources we're not
> utilized on Jupyter / spark-submit?
>
>
> On Wed, Sep 11, 2019, 08:43 Stephen Boesch <[email protected]> wrote:
>
>> Sounds like you have done your homework to properly compare . I'm
>> guessing the answer to the following is yes .. but in any case: are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> [email protected]>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> [email protected]> wrote:
>>>
>>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>>> already is connected to a running spark context, while spark-submit needs
>>>> to get a new spot in the (YARN?) queue.
>>>>
>>>> I would check the cluster job IDs for both to ensure you're getting new
>>>> cluster tasks for each.
>>>>
>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing a weird behaviour while running a python script. Here is
>>>>> what the code looks like mostly:
>>>>>
>>>>> def fn1(ip):
>>>>> some code...
>>>>> ...
>>>>>
>>>>> def fn2(row):
>>>>> ...
>>>>> some operations
>>>>> ...
>>>>> return row1
>>>>>
>>>>>
>>>>> udf_fn1 = udf(fn1)
>>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with
>>>>> ~4500 partitions
>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>> .drop("colz") \
>>>>> .withColumnRenamed("colz", "coly")
>>>>>
>>>>> edf = ddf \
>>>>> .filter(ddf.colp == 'some_value') \
>>>>> .rdd.map(lambda row: fn2(row)) \
>>>>> .toDF()
>>>>>
>>>>> print edf.count() // simple way for the performance test in both
>>>>> platforms
>>>>>
>>>>> Now when I run the same code in a brand new jupyter notebook it runs
>>>>> 6x faster than when I run this python script using spark-submit. The
>>>>> configurations are printed and compared from both the platforms and they
>>>>> are exact same. I even tried to run this script in a single cell of
>>>>> jupyter
>>>>> notebook and still have the same performance. I need to understand if I am
>>>>> missing something in the spark-submit which is causing the issue. I tried
>>>>> to minimise the script to reproduce the same error without much code.
>>>>>
>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>> machines from which both are executed are also the same and from same
>>>>> user.
>>>>>
>>>>> What i found is the the quantile values for median for one ran with
>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am
>>>>> not
>>>>> able to figure out why this is happening.
>>>>>
>>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>>
>>>>> *Regards,*
>>>>> *Dhrub*
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>