Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
Just checked from where the script is submitted i.e. wrt Driver, the python env are different. Jupyter one is running within a the virtual environment which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the executors have the same python version right? I tried doing a spark-submit from

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Abdeali Kothari
Maybe you can try running it in a python shell or jupyter-console/ipython instead of a spark-submit and check how much time it takes too. Compare the env variables to check that no additional env configuration is present in either environment. Also is the python environment for both the exact

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Ok. Can't think of why that would happen. Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati < dhruba.w...@gmail.com>: > As mentioned in the very first mail: > * same cluster it is submitted. > * from same machine they are submitted and also from same user > * each of them has 128

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
As mentioned in the very first mail: * same cluster it is submitted. * from same machine they are submitted and also from same user * each of them has 128 executors and 2 cores per executor with 8Gigs of memory each and both of them are getting that while running to clarify more let me quote what

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare . I'm guessing the answer to the following is yes .. but in any case: are they both running against the same spark cluster with the same configuration parameters especially executor memory and number of workers? Am Di., 10. Sept. 2019

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
No, i checked for that, hence written "brand new" jupyter notebook. Also the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs compressed base64 encoded text data from a hive table and decompressing and decoding in one of the udfs. Also the time compared is from Spark UI not how

Access all of the custom streaming query listeners that were registered to spark session

2019-09-10 Thread Natalie Ruiz
Hello, Is there a way to access all of the custom listeners that have been registered to a spark session? I want to remove the listeners that I am no longer using, except I don't know what they were saved as, I just see testing output messages on my streaming query. I created a stack

Re: question about pyarrow.Table to pyspark.DataFrame conversion

2019-09-10 Thread Bryan Cutler
Hi Artem, I don't believe this is currently possible, but it could be a great addition to PySpark since this would offer a convenient and efficient way to parallelize nested column data. I created the JIRA https://issues.apache.org/jira/browse/SPARK-29040 for this. On Tue, Aug 27, 2019 at 7:55

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Patrick McCarthy
It's not obvious from what you pasted, but perhaps the juypter notebook already is connected to a running spark context, while spark-submit needs to get a new spot in the (YARN?) queue. I would check the cluster job IDs for both to ensure you're getting new cluster tasks for each. On Tue, Sep

script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
Hi, I am facing a weird behaviour while running a python script. Here is what the code looks like mostly: def fn1(ip): some code... ... def fn2(row): ... some operations ... return row1 udf_fn1 = udf(fn1) cdf = spark.read.table("") //hive table is of size > 500 Gigs

Deadlock using Barrier Execution

2019-09-10 Thread csmith
I'm using barrier execution in my spark job but am occasionally seeing deadlocks where the task scheduler is unable to place all the tasks. The failure is logged but the job hangs indefinitely. I have 2 executors with 16 cores each, using standalone mode (I think? I'm using databricks). The

Custom encoders and udf's

2019-09-10 Thread jelmer
Hi, I am using a org.apache.spark.sql.Encoder to serialize a custom object. I now want to pass this column to a udf so it can do some operations on it but this gives me the error : Caused by: java.lang.ClassCastException: [B cannot be cast to The code included at the problem demonstrates the