Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
In a bash terminal, can you do: *export PYSPARK_DRIVER_PYTHON=/path/to/venv/bin/python* and then: run the *spark-shell* script ? This should mimic the behaviour of jupyter in spark-shell and should be fast (1-2mins similar to jupyter notebook) This would confirm the guess that the python2.7 venv

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Also the performance remains identical when running the same script from jupyter terminal instead or normal terminal. In the script the spark context is created by spark = SparkSession \ .builder \ .. .. getOrCreate() command On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati wrote: > If

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
If you say that libraries are not transferred by default and in my case I haven't used any --py-files then just because the driver python is different I have facing 6x speed difference ? I am using client mode to submit the program but the udfs and all are executed in the executors, then why is

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
The driver python may not always be the same as the executor python. You can set these using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON The dependent libraries are not transferred by spark in any way unless you do a --py-files or .addPyFile() Could you try this: *import sys; print(sys.prefix)* on

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
But would it be the case for multiple tasks running on the same worker and also both the tasks are running in client mode, so the one true is true for both or for neither. As mentioned earlier all the confs are same. I have checked and compared each conf. As Abdeali mentioned it must be because

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Patrick McCarthy
Are you running in cluster mode? A large virtualenv zip for the driver sent into the cluster on a slow pipe could account for much of that eight minutes. On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati wrote: > Hi, > > I just ran the same script in a shell in jupyter notebook and find the >

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Hi, I just ran the same script in a shell in jupyter notebook and find the performance to be similar. So I can confirm this is because the libraries used jupyter notebook python is different than the spark-submit python this is happening. But now I have a following question. Are the dependent

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
Just checked from where the script is submitted i.e. wrt Driver, the python env are different. Jupyter one is running within a the virtual environment which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the executors have the same python version right? I tried doing a spark-submit from

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Abdeali Kothari
Maybe you can try running it in a python shell or jupyter-console/ipython instead of a spark-submit and check how much time it takes too. Compare the env variables to check that no additional env configuration is present in either environment. Also is the python environment for both the exact

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Ok. Can't think of why that would happen. Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati < dhruba.w...@gmail.com>: > As mentioned in the very first mail: > * same cluster it is submitted. > * from same machine they are submitted and also from same user > * each of them has 128

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
As mentioned in the very first mail: * same cluster it is submitted. * from same machine they are submitted and also from same user * each of them has 128 executors and 2 cores per executor with 8Gigs of memory each and both of them are getting that while running to clarify more let me quote what

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers? Am Di., 10. Sept. 2019

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
No, i checked for that, hence written "brand new" jupyter notebook. Also the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs compressed base64 encoded text data from a hive table and decompressing and decoding in one of the udfs. Also the time compared is from Spark UI not how

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Patrick McCarthy
It's not obvious from what you pasted, but perhaps the juypter notebook already is connected to a running spark context, while spark-submit needs to get a new spot in the (YARN?) queue. I would check the cluster job IDs for both to ensure you're getting new cluster tasks for each. On Tue, Sep

script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
Hi, I am facing a weird behaviour while running a python script. Here is what the code looks like mostly: def fn1(ip): some code... ... def fn2(row): ... some operations ... return row1 udf_fn1 = udf(fn1) cdf = spark.read.table("") //hive table is of size > 500 Gigs