Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Felix Cheung
+1 From: Holden Karau Sent: Wednesday, July 22, 2020 10:49:49 AM To: Steve Loughran Cc: dev Subject: Re: Exposing Spark parallelized directory listing & non-locality listing in core Wonderful. To be clear the patch is more to start the discussion about how we

[DISCUSS] [Spark confs] Making spark.jars conf take precedence over spark default classpath

2020-07-22 Thread nupurshukla
Hello, I am prototyping a change in the behavior of spark.jars conf for my use-case. spark.jars conf is used to specify a list of jars to include on the driver and executor classpaths. *Current behavior:* spark.jars conf value is not read until after the JVM has already started and the

Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-22 Thread Holden Karau
On Wed, Jul 22, 2020 at 7:39 AM Imran Rashid < iras...@apache.org > wrote: > Hi Holden, > > thanks for leading this discussion, I'm in favor in general. I have one > specific question -- these two sections seem to contradict each other > slightly: > > > If there is a -1 from a non-committer,

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Holden Karau
Wonderful. To be clear the patch is more to start the discussion about how we want to do it and less what I think is the right way. On Wed, Jul 22, 2020 at 10:47 AM Steve Loughran wrote: > > > On Wed, 22 Jul 2020 at 00:51, Holden Karau wrote: > >> Hi Folks, >> >> In Spark SQL there is the

Re: Exposing Spark parallelized directory listing & non-locality listing in core

2020-07-22 Thread Steve Loughran
On Wed, 22 Jul 2020 at 00:51, Holden Karau wrote: > Hi Folks, > > In Spark SQL there is the ability to have Spark do it's partition > discovery/file listing in parallel on the worker nodes and also avoid > locality lookups. I'd like to expose this in core, but given the Hadoop > APIs it's a bit

Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-22 Thread Imran Rashid
Hi Holden, thanks for leading this discussion, I'm in favor in general. I have one specific question -- these two sections seem to contradict each other slightly: > If there is a -1 from a non-committer, multiple committers or the PMC should be consulted before moving forward. > >If the

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
W dniu środa, 22 lipca 2020 Driesprong, Fokko napisał(a): > That's probably one-time overhead so it is not a big issue. In my > opinion, a bigger one is possible complexity. Annotations tend to introduce > a lot of cyclic dependencies in Spark codebase. This can be addressed, but > don't look

Re: [DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

2020-07-22 Thread Russell Spitzer
There is now a full catalog API you can implement which should give you the control you are looking for. It is in Spark 3.0 and here is an example implementation for supporting Cassandra.

[DISCUSS][SQL] What is the best practice to add catalog support for customized storage format.

2020-07-22 Thread Kun H .
Hi Spark developers, My team has an internal storage format. It already has an implementaion of data source v2. Now we want to adapt catalog support for it. I expect each partition can be stored in this format and spark catalog can manage partition columns which is just like using ORC and

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Driesprong, Fokko
That's probably one-time overhead so it is not a big issue. In my opinion, a bigger one is possible complexity. Annotations tend to introduce a lot of cyclic dependencies in Spark codebase. This can be addressed, but don't look great. This is not true (anymore). With Python 3.6 you can add

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/22/20 3:45 AM, Hyukjin Kwon wrote: > For now, I tend to think adding type hints to the codes make it > difficult to backport or revert and > more difficult to discuss about typing only especially considering > typing is arguably premature yet. About being premature ‒ since typing ecosystem

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/21/20 9:40 PM, Holden Karau wrote: > Yeah I think this could be a great project now that we're only Python > 3.5+. One potential is making this an Outreachy project to get more > folks from different backgrounds involved in Spark. I am honestly not sure if that's really the case. At the

Re: [PySpark] Revisiting PySpark type annotations

2020-07-22 Thread Maciej Szymkiewicz
On 7/22/20 3:45 AM, Hyukjin Kwon wrote: > > Yeah, I tend to be positive about leveraging the Python type hints in > general. > > However, just to clarify, I don’t think we should just port the type > hints into the main codes yet but maybe think about > having/porting Maciej's work, pyi files as