Re: Thoughts on Spark 3 release, or a preview release

2019-09-11 Thread Jean Georges Perrin
As a user/non committer, +1 I love the idea of an early 3.0.0 so we can test current dev against it, I know the final 3.x will probably need another round of testing when it gets out, but less for sure... I know I could checkout and compile, but having a “packaged” preversion is great if it

Re: Thoughts on Spark 3 release, or a preview release

2019-09-11 Thread Michael Heuer
I would love to see Spark + Hadoop + Parquet + Avro compatibility problems resolved, e.g. https://issues.apache.org/jira/browse/SPARK-25588 https://issues.apache.org/jira/browse/SPARK-27781

Thoughts on Spark 3 release, or a preview release

2019-09-11 Thread Sean Owen
I'm curious what current feelings are about ramping down towards a Spark 3 release. It feels close to ready. There is no fixed date, though in the past we had informally tossed around "back end of 2019". For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect Spark 2 to last longer,

Re: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-11 Thread Bryan Cutler
+1 (non-binding), looks good! On Wed, Sep 11, 2019 at 10:05 AM Ryan Blue wrote: > +1 > > This is going to be really useful. Thanks for working on it! > > On Wed, Sep 11, 2019 at 9:38 AM Felix Cheung > wrote: > >> +1 >> >> -- >> *From:* Thomas graves >> *Sent:*

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
In a bash terminal, can you do: *export PYSPARK_DRIVER_PYTHON=/path/to/venv/bin/python* and then: run the *spark-shell* script ? This should mimic the behaviour of jupyter in spark-shell and should be fast (1-2mins similar to jupyter notebook) This would confirm the guess that the python2.7 venv

Re: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-11 Thread Ryan Blue
+1 This is going to be really useful. Thanks for working on it! On Wed, Sep 11, 2019 at 9:38 AM Felix Cheung wrote: > +1 > > -- > *From:* Thomas graves > *Sent:* Wednesday, September 4, 2019 7:24:26 AM > *To:* dev > *Subject:* [VOTE] [SPARK-27495] SPIP: Support

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Also the performance remains identical when running the same script from jupyter terminal instead or normal terminal. In the script the spark context is created by spark = SparkSession \ .builder \ .. .. getOrCreate() command On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati wrote: > If

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
If you say that libraries are not transferred by default and in my case I haven't used any --py-files then just because the driver python is different I have facing 6x speed difference ? I am using client mode to submit the program but the udfs and all are executed in the executors, then why is

Re: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-11 Thread Felix Cheung
+1 From: Thomas graves Sent: Wednesday, September 4, 2019 7:24:26 AM To: dev Subject: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling Hey everyone, I'd like to call for a vote on SPARK-27495 SPIP: Support Stage level

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Abdeali Kothari
The driver python may not always be the same as the executor python. You can set these using PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON The dependent libraries are not transferred by spark in any way unless you do a --py-files or .addPyFile() Could you try this: *import sys; print(sys.prefix)* on

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
But would it be the case for multiple tasks running on the same worker and also both the tasks are running in client mode, so the one true is true for both or for neither. As mentioned earlier all the confs are same. I have checked and compared each conf. As Abdeali mentioned it must be because

Re: Python API for mapGroupsWithState

2019-09-11 Thread Stavros Kontopoulos
+1 I was looking at this today, so any idea why this was not added before? On Sat, Aug 3, 2019 at 1:57 AM Nicholas Chammas wrote: > Can someone succinctly describe the challenge in adding the > `mapGroupsWithState()` API to PySpark? > > I was hoping for some suboptimal but nonetheless working

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Hi, I just ran the same script in a shell in jupyter notebook and find the performance to be similar. So I can confirm this is because the libraries used jupyter notebook python is different than the spark-submit python this is happening. But now I have a following question. Are the dependent