Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/22/20 3:45 AM, Hyukjin Kwon wrote: > > Yeah, I tend to be positive about leveraging the Python type hints in > general. > > However, just to clarify, I don’t think we should just port the type > hints into the main codes yet but maybe think about > having/porting Maciej's work, pyi files as s

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/21/20 9:40 PM, Holden Karau wrote: > Yeah I think this could be a great project now that we're only Python > 3.5+. One potential is making this an Outreachy project to get more > folks from different backgrounds involved in Spark. I am honestly not sure if that's really the case. At the mom

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/22/20 3:45 AM, Hyukjin Kwon wrote: > For now, I tend to think adding type hints to the codes make it > difficult to backport or revert and > more difficult to discuss about typing only especially considering > typing is arguably premature yet. About being premature ‒ since typing ecosystem e

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Driesprong, Fokko
That's probably one-time overhead so it is not a big issue. In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great. This is not true (anymore). With Python 3.6 you can add strin

[DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

2020-07-22 Thread Kun H .
Hi Spark developers, My team has an internal storage format. It already has an implementaion of data source v2. Now we want to adapt catalog support for it. I expect each partition can be stored in this format and spark catalog can manage partition columns which is just like using ORC and Par

Re: [DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

2020-07-22 Thread Russell Spitzer
There is now a full catalog API you can implement which should give you the control you are looking for. It is in Spark 3.0 and here is an example implementation for supporting Cassandra. https://github.com/datastax/spark-cassandra-connector/blob/master/connector/src/main/scala/com/datastax/spark/

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
W dniu środa, 22 lipca 2020 Driesprong, Fokko napisał(a): > That's probably one-time overhead so it is not a big issue. In my > opinion, a bigger one is possible complexity. Annotations tend to introduce > a lot of cyclic dependencies in Spark codebase. This can be addressed, but > don't look gr

Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-22 Thread Imran Rashid
Hi Holden, thanks for leading this discussion, I'm in favor in general. I have one specific question -- these two sections seem to contradict each other slightly: > If there is a -1 from a non-committer, multiple committers or the PMC should be consulted before moving forward. > >If the original

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Steve Loughran
On Wed, 22 Jul 2020 at 00:51, Holden Karau wrote: > Hi Folks, > > In Spark SQL there is the ability to have Spark do it's partition > discovery/file listing in parallel on the worker nodes and also avoid > locality lookups. I'd like to expose this in core, but given the Hadoop > APIs it's a bit m

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Holden Karau
Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way. On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran wrote: > > > On Wed, 22 Jul 2020 at 00:51, Holden Karau wrote: > >> Hi Folks, >> >> In Spark SQL there is the abili

Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-22 Thread Holden Karau
On Wed, Jul 22, 2020 at 7:39 AM Imran Rashid < iras...@apache.org > wrote: > Hi Holden, > > thanks for leading this discussion, I'm in favor in general. I have one > specific question -- these two sections seem to contradict each other > slightly: > > > If there is a -1 from a non-committer, mult

[DISCUSS] [Spark confs] Making spark.jars conf take precedence over spark default classpath

2020-07-22 Thread nupurshukla
Hello, I am prototyping a change in the behavior of spark.jars conf for my use-case. spark.jars conf is used to specify a list of jars to include on the driver and executor classpaths. *Current behavior:* spark.jars conf value is not read until after the JVM has already started and the system

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Felix Cheung
+1 From: Holden Karau Sent: Wednesday, July 22, 2020 10:49:49 AM To: Steve Loughran Cc: dev Subject: Re: Exposing Spark parallelized directory listing & non-locality listing in core Wonderful. To be clear the patch is more to start the discussion about how we