Need some help and contributions in PySpark API documentation

2020-08-04 Thread Hyukjin Kwon
Hi all,

I am trying to redesign the PySpark documentation at SPARK-31851
.
Basically from:

   - https://spark.apache.org/docs/latest/api/python/index.html
   to:
   - https://hyukjin-spark.readthedocs.io/en/latest/index.html (draft)

The base works are done, and I am now adding new pages such as user guide,
getting started, etc.

There are many contents to write. I am trying to do one by one but there are
too many, and I can’t do it all alone.

I would like to ask the Spark dev community some help to build useful pages
in
PySpark documentation.
I filed some sub-tasks at SPARK-31851
 so please help some of
the subtasks.

In addition, feel free to create more sub-tasks if you guys have some good
ideas
to include in official PySpark documentation.

I wrote some tips at here

so new contributors should also be able to work on some.
I hope we can manage to build the new and better PySpark documentation
together :-).

Thanks!


Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-08-04 Thread Holden Karau
This vote passes with only +1s. In conjunction with the discussion I
believe we have consensus. I will update the website this week with the
proposed change. Thank you all for your participation.

On Sun, Aug 2, 2020 at 9:33 PM Prashant Sharma  wrote:

> +1
>
> On Fri, Jul 31, 2020 at 10:18 PM Xiao Li  wrote:
>
>> +1
>>
>> Xiao
>>
>> On Fri, Jul 31, 2020 at 9:32 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Thanks,
>>> Mridul
>>>
>>> On Thu, Jul 30, 2020 at 4:49 PM Holden Karau 
>>> wrote:
>>>
 Hi Spark Developers,

 After the discussion of the proposal to amend Spark committer
 guidelines, it appears folks are generally in agreement on policy
 clarifications. (See
 https://lists.apache.org/thread.html/r6706e977fda2c474a7f24775c933c2f46ea19afbfafb03c90f6972ba%40%3Cdev.spark.apache.org%3E,
 as well as some on the private@ list for PMC.) Therefore, I am calling
 for a majority VOTE, which will last at least 72 hours. See the ASF voting
 rules for procedural changes at
 https://www.apache.org/foundation/voting.html.

 The proposal is to add a new section entitled “When to Commit” to the
 Spark committer guidelines, currently at
 https://spark.apache.org/committers.html.

 ** START OF CHANGE **

 PRs shall not be merged during active, on-topic discussion unless they
 address issues such as critical security fixes of a public vulnerability.
 Under extenuating circumstances, PRs may be merged during active, off-topic
 discussion and the discussion directed to a more appropriate venue. Time
 should be given prior to merging for those involved with the conversation
 to explain if they believe they are on-topic.

 Lazy consensus requires giving time for discussion to settle while
 understanding that people may not be working on Spark as their full-time
 job and may take holidays. It is believed that by doing this, we can limit
 how often people feel the need to exercise their veto.

 All -1s with justification merit discussion.  A -1 from a non-committer
 can be overridden only with input from multiple committers, and suitable
 time must be offered for any committer to raise concerns. A -1 from a
 committer who cannot be reached requires a consensus vote of the PMC under
 ASF voting rules to determine the next steps within the ASF guidelines for
 code vetoes ( https://www.apache.org/foundation/voting.html ).

 These policies serve to reiterate the core principle that code must not
 be merged with a pending veto or before a consensus has been reached (lazy
 or otherwise).

 It is the PMC’s hope that vetoes continue to be infrequent, and when
 they occur, that all parties will take the time to build consensus prior to
 additional feature work.

 Being a committer means exercising your judgement while working in a
 community of people with diverse views. There is nothing wrong in getting a
 second (or third or fourth) opinion when you are uncertain. Thank you for
 your dedication to the Spark project; it is appreciated by the developers
 and users of Spark.

 It is hoped that these guidelines do not slow down development; rather,
 by removing some of the uncertainty, the goal is to make it easier for us
 to reach consensus. If you have ideas on how to improve these guidelines or
 other Spark project operating procedures, you should reach out on the dev@
 list to start the discussion.

 ** END OF CHANGE TEXT **

 I want to thank everyone who has been involved with the discussion
 leading to this proposal and those of you who take the time to vote on
 this. I look forward to our continued collaboration in building Apache
 Spark.

 I believe we share the goal of creating a welcoming community around
 the project. On a personal note, it is my belief that consistently applying
 this policy around commits can help to make a more accessible and welcoming
 community.

 Kind Regards,

 Holden


 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> 
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Hyukjin Kwon
Oh I think I caused some confusion here.
Just for clarification, I wasn’t saying we must port this into a separate
repo now. I was saying it can be one of the options we can consider.

For a bit of more context:
This option was considered as, roughly speaking, an invalid option and it
might need an incubation process as a separate project.
After some investigations, I found that this is still a valid option and we
can take this as the part of Apache Spark but in a separate repo.

FWIW, NumPy took this approach. they made a separate repo
, and merged it into the main repo
 after it became stable.


My only major concerns are:

   - the possibility to fundamentally change the approach in pyspark-stubs
   . It’s not because how it was
   done is wrong but because how Python type hinting itself evolves.
   - If my understanding is correct, pyspark-stubs
    is still incomplete and does
   not annotate types in some other APIs (by using Any). Correct me if I am
   wrong, Maciej.

I’ll have a short sync with him and share to understand better since he’d
probably know the context best in PySpark type hints and I know some
contexts in ASF and Apache Spark.



2020년 8월 5일 (수) 오전 6:31, Maciej Szymkiewicz 님이 작성:

> Indeed, though the possible advantage is that in theory, you can have
> different release cycle than for the main repo (I am not sure if that's
> feasible in practice or if that was the intention).
>
> I guess all depends on how we envision the future of annotations
> (including, but not limited to, how conservative we want to be in the
> future). Which is probably something that should be discussed here.
> On 8/4/20 11:06 PM, Felix Cheung wrote:
>
> So IMO maintaining outside in a separate repo is going to be harder. That
> was why I asked.
>
>
>
> --
> *From:* Maciej Szymkiewicz 
> 
> *Sent:* Tuesday, August 4, 2020 12:59 PM
> *To:* Sean Owen
> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; Spark
> Dev List
> *Subject:* Re: [PySpark] Revisiting PySpark type annotations
>
>
> On 8/4/20 9:35 PM, Sean Owen wrote
> > Yes, but the general argument you make here is: if you tie this
> > project to the main project, it will _have_ to be maintained by
> > everyone. That's good, but also exactly I think the downside we want
> > to avoid at this stage (I thought?) I understand for some
> > undertakings, it's just not feasible to start outside the main
> > project, but is there no proof of concept even possible before taking
> > this step -- which more or less implies it's going to be owned and
> > merged and have to be maintained in the main project.
>
>
> I think we have a bit different understanding here ‒ I believe we have
> reached a conclusion that maintaining annotations within the project is
> OK, we only differ when it comes to specific form it should take.
>
> As of POC ‒ we have stubs, which have been maintained over three years
> now and cover versions between 2.3 (though these are fairly limited) to,
> with some lag, current master.  There is some evidence there are used in
> the wild
> (
> https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D
> ),
> there are a few contributors
> (https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
> least some use cases (https://stackoverflow.com/q/40163106/). So,
> subjectively speaking, it seems we're already beyond POC.
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>


回复: [DISCUSS] Apache Spark 3.0.1 Release

2020-08-04 Thread 郑瑞峰
Hi all,
I am going to prepare the realease of 3.0.1 RC1, with the help of Wenchen.




-- 原始邮件 --
发件人:
"Jason Moore"   
 
https://issues.apache.org/jira/browse/SPARK-32307
 
 
 
   
On Tue, Jul 14, 2020 at 11:13 PM Sean Owen https://issues.apache.org/jira/browse/SPARK-32234 ?
 
 On Tue, Jul 14, 2020 at 9:57 AM Shivaram Venkataraman
 

Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
Indeed, though the possible advantage is that in theory, you can have
different release cycle than for the main repo (I am not sure if that's
feasible in practice or if that was the intention).

I guess all depends on how we envision the future of annotations
(including, but not limited to, how conservative we want to be in the
future). Which is probably something that should be discussed here.

On 8/4/20 11:06 PM, Felix Cheung wrote:
> So IMO maintaining outside in a separate repo is going to be harder.
> That was why I asked.
>
>
>  
> 
> *From:* Maciej Szymkiewicz 
> *Sent:* Tuesday, August 4, 2020 12:59 PM
> *To:* Sean Owen
> *Cc:* Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau;
> Spark Dev List
> *Subject:* Re: [PySpark] Revisiting PySpark type annotations
>  
>
> On 8/4/20 9:35 PM, Sean Owen wrote
> > Yes, but the general argument you make here is: if you tie this
> > project to the main project, it will _have_ to be maintained by
> > everyone. That's good, but also exactly I think the downside we want
> > to avoid at this stage (I thought?) I understand for some
> > undertakings, it's just not feasible to start outside the main
> > project, but is there no proof of concept even possible before taking
> > this step -- which more or less implies it's going to be owned and
> > merged and have to be maintained in the main project.
>
>
> I think we have a bit different understanding here ‒ I believe we have
> reached a conclusion that maintaining annotations within the project is
> OK, we only differ when it comes to specific form it should take.
>
> As of POC ‒ we have stubs, which have been maintained over three years
> now and cover versions between 2.3 (though these are fairly limited) to,
> with some lag, current master.  There is some evidence there are used in
> the wild
> (https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
> there are a few contributors
> (https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
> least some use cases (https://stackoverflow.com/q/40163106/). So,
> subjectively speaking, it seems we're already beyond POC.
>
> -- 
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Felix Cheung
So IMO maintaining outside in a separate repo is going to be harder. That was 
why I asked.




From: Maciej Szymkiewicz 
Sent: Tuesday, August 4, 2020 12:59 PM
To: Sean Owen
Cc: Felix Cheung; Hyukjin Kwon; Driesprong, Fokko; Holden Karau; Spark Dev List
Subject: Re: [PySpark] Revisiting PySpark type annotations


On 8/4/20 9:35 PM, Sean Owen wrote
> Yes, but the general argument you make here is: if you tie this
> project to the main project, it will _have_ to be maintained by
> everyone. That's good, but also exactly I think the downside we want
> to avoid at this stage (I thought?) I understand for some
> undertakings, it's just not feasible to start outside the main
> project, but is there no proof of concept even possible before taking
> this step -- which more or less implies it's going to be owned and
> merged and have to be maintained in the main project.


I think we have a bit different understanding here ‒ I believe we have
reached a conclusion that maintaining annotations within the project is
OK, we only differ when it comes to specific form it should take.

As of POC ‒ we have stubs, which have been maintained over three years
now and cover versions between 2.3 (though these are fairly limited) to,
with some lag, current master.  There is some evidence there are used in
the wild
(https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
there are a few contributors
(https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
least some use cases (https://stackoverflow.com/q/40163106/). So,
subjectively speaking, it seems we're already beyond POC.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz

On 8/4/20 9:35 PM, Sean Owen wrote
> Yes, but the general argument you make here is: if you tie this
> project to the main project, it will _have_ to be maintained by
> everyone. That's good, but also exactly I think the downside we want
> to avoid at this stage (I thought?) I understand for some
> undertakings, it's just not feasible to start outside the main
> project, but is there no proof of concept even possible before taking
> this step -- which more or less implies it's going to be owned and
> merged and have to be maintained in the main project.


I think we have a bit different understanding here ‒ I believe we have
reached a conclusion that maintaining annotations within the project is
OK, we only differ when it comes to specific form it should take.

As of POC ‒ we have stubs, which have been maintained over three years
now and cover versions between 2.3 (though these are fairly limited) to,
with some lag, current master.  There is some evidence there are used in
the wild
(https://github.com/zero323/pyspark-stubs/network/dependents?package_id=UGFja2FnZS02MzU1MTc4Mg%3D%3D),
there are a few contributors
(https://github.com/zero323/pyspark-stubs/graphs/contributors) and at
least some use cases (https://stackoverflow.com/q/40163106/). So,
subjectively speaking, it seems we're already beyond POC.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC




signature.asc
Description: OpenPGP digital signature


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Sean Owen
On Tue, Aug 4, 2020 at 2:32 PM Maciej Szymkiewicz
 wrote:
>
> First of all why ASF ownership?
>
> For the project of this size maintaining high quality (it is not hard to use 
> stubgen or monkeytype, but resulting annotations are rather simplistic) 
> annotations independent of the actual codebase is far from trivial. For 
> starters, changes which are mostly transparent to the final user (like 
> pyspark.ml changes in 3.0 / 3.1) might require significant changes in the 
> annotations. Additionally some signature changes are rather hard to track and 
> such separation can easily lead to divergence.
>
> Additionally, annotations are as much about describing facts, as showing 
> intended usage (the simplest use case is documenting argument dependencies). 
> This makes process of annotation rather subjective and requires good 
> understanding of author's intention.
>
> Finally, annotation-friendly signatures require conscious decisions (see for 
> example https://github.com/python/mypy/issues/5621).
>
> Overall, ASF ownership is probably the best way to ensure long-term 
> sustainability and quality of annotations.
>

Yes, but the general argument you make here is: if you tie this
project to the main project, it will _have_ to be maintained by
everyone. That's good, but also exactly I think the downside we want
to avoid at this stage (I thought?) I understand for some
undertakings, it's just not feasible to start outside the main
project, but is there no proof of concept even possible before taking
this step -- which more or less implies it's going to be owned and
merged and have to be maintained in the main project.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re:

2020-08-04 Thread Sean Owen
I think that's fine to resolve as you did. I would recommend answering
on sites like StackOverflow rather than the JIRA. That said, because
the answer is pretty trivial to reply with, I'll post there. No point
in making it excessively hard to get a simple answer if it doesn't
become a pattern.

On Tue, Aug 4, 2020 at 2:26 PM Rohit Mishra  wrote:
>
> Hello Everyone,
>
> Someone asked this question on JIRA and since it was a question I requested 
> him to check stack overflow. Personally I don't have an answer to this 
> question so in case anyone has an idea please feel free to update the issue. 
> I have marked it resolved for the time being but thought to take your 
> opinion. Whenever you are free, link is here - 
> https://issues.apache.org/jira/browse/SPARK-32527
>
> Thanks in advance for your time.
>
> Regards,
> Rohit Mishra

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Maciej Szymkiewicz
*First of all why ASF ownership? *

For the project of this size maintaining high quality (it is not hard to
use stubgen or monkeytype, but resulting annotations are rather
simplistic) annotations independent of the actual codebase is far from
trivial. For starters, changes which are mostly transparent to the final
user (like pyspark.ml changes in 3.0 / 3.1) might require significant
changes in the annotations. Additionally some signature changes are
rather hard to track and such separation can easily lead to divergence.

Additionally, annotations are as much about describing facts, as showing
intended usage (the simplest use case is documenting argument
dependencies). This makes process of annotation rather subjective and
requires good understanding of author's intention.

Finally, annotation-friendly signatures require conscious decisions (see
for example https://github.com/python/mypy/issues/5621).

Overall, ASF ownership is probably the best way to ensure long-term
sustainability and quality of annotations.

*Now, why separate repo?*

Based on the discussion so far it is clear that there is no consensus
about using inline annotations. There are three other options:

  * Stub files packaged alongside actual code.
  * Separate project within root, packaged separately.
  * Separate repository, packaged separately.

As already pointed out here and in the comments to
https://github.com/apache/spark/pull/29180, annotations are still
somewhat unstable. Ecosystem evolves quickly and new features, some
having potential for fundamental change in the way how we annotate code.

Therefore, it might be beneficial to maintain subproject (out of lack of
a better word), that can evolve faster than the code that is annotate.

While I have no strong opinion about this part, it is definitely a
relatively unobtrusive way of bringing code and annotations closer
together.

On 8/4/20 7:44 PM, Sean Owen wrote:

> Maybe more specifically, why an ASF repo?
>
> On Tue, Aug 4, 2020 at 11:45 AM Felix Cheung  
> wrote:
>> What would be the reason for separate git repo?
>>
>> 
>> From: Hyukjin Kwon 
>> Sent: Monday, August 3, 2020 1:58:55 AM
>> To: Maciej Szymkiewicz 
>> Cc: Driesprong, Fokko ; Holden Karau 
>> ; Spark Dev List 
>> Subject: Re: [PySpark] Revisiting PySpark type annotations
>>
>> Okay, seems like we can create a separate repo as apache/spark? e.g.) 
>> https://issues.apache.org/jira/browse/INFRA-20470
>> We can also think about porting the files as are.
>> I will try to have a short sync with the author Maciej, and share what we 
>> discussed offline.
>>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



signature.asc
Description: OpenPGP digital signature


[no subject]

2020-08-04 Thread Rohit Mishra
Hello Everyone,

Someone asked this question on JIRA and since it was a question I requested
him to check stack overflow. Personally I don't have an answer to this
question so in case anyone has an idea please feel free to update the
issue. I have marked it resolved for the time being but thought to take
your opinion. Whenever you are free, link is here -
https://issues.apache.org/jira/browse/SPARK-32527

Thanks in advance for your time.

Regards,
Rohit Mishra


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Sean Owen
Maybe more specifically, why an ASF repo?

On Tue, Aug 4, 2020 at 11:45 AM Felix Cheung  wrote:
>
> What would be the reason for separate git repo?
>
> 
> From: Hyukjin Kwon 
> Sent: Monday, August 3, 2020 1:58:55 AM
> To: Maciej Szymkiewicz 
> Cc: Driesprong, Fokko ; Holden Karau 
> ; Spark Dev List 
> Subject: Re: [PySpark] Revisiting PySpark type annotations
>
> Okay, seems like we can create a separate repo as apache/spark? e.g.) 
> https://issues.apache.org/jira/browse/INFRA-20470
> We can also think about porting the files as are.
> I will try to have a short sync with the author Maciej, and share what we 
> discussed offline.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SparkSql] Casting of Predicate Literals

2020-08-04 Thread Russell Spitzer
Thanks! That's exactly what I was hoping for! Thanks for finding the Jira
for me!

On Tue, Aug 4, 2020 at 11:46 AM Wenchen Fan  wrote:

> I think this is not a problem in 3.0 anymore, see
> https://issues.apache.org/jira/browse/SPARK-27638
>
> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer 
> wrote:
>
>> I've just run into this issue again with another user and I feel like
>> most folks here have seen some flavor of this at some point.
>>
>> The user registers a Datasource with a column of type Date (or some non
>> string) then performs a query that looks like.
>>
>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>
>> Seeing that the predicate literal here is a String, Spark needs to make a
>> change so that the DataSource column will be of the same type (Date),
>> so it places a "Cast" on the Datasource column so our plan ends up
>> looking like.
>>
>> Cast(date_col as String) > '2020-08-03'
>>
>> Since the Datasource Strategies can't handle a push down of the "Cast"
>> function we lose the predicate pushdown we could
>> have had. This can change a Job from a single partition lookup into a
>> full scan leading to a very confusing situation for
>> the end user. I also wonder about the relative cost here since we could
>> be avoiding doing X casts and instead just do a single
>> one on the predicate, in addition we could be doing the cast at the
>> Analysis phase and cut the run short before any work even
>> starts rather than doing a perhaps meaningless comparison between a date
>> and a non-date string.
>>
>> I think we should seriously consider whether in cases like this we should
>> attempt to cast the literal rather than casting the
>> source column.
>>
>> Please let me know if anyone has thoughts on this, or has some previous
>> Jiras I could dig into if it's been discussed before,
>> Russ
>>
>


Re: [SparkSql] Casting of Predicate Literals

2020-08-04 Thread Xiao Li
Hi, Russell,

You might hit the other cases in which CAST blocks the predicate pushdown.
If the Cast was added by users and it changes the actual type, we are
unable to optimize it automatically because it could change the query
correctness. If it was added by our type coercion rules

to
make type consistent at query compile time, we can take a look at the
specific rule. If you think any of them is not reasonable or have different
behaviors from the other database systems, we can discuss it in the PRs or
JIRAs. In general, we have to be very cautious to make any change in these
rules since it could have a big impact and change the query results
silently.

Thanks,

On Tue, Aug 4, 2020 at 9:46 AM Wenchen Fan  wrote:

> I think this is not a problem in 3.0 anymore, see
> https://issues.apache.org/jira/browse/SPARK-27638
>
> On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer 
> wrote:
>
>> I've just run into this issue again with another user and I feel like
>> most folks here have seen some flavor of this at some point.
>>
>> The user registers a Datasource with a column of type Date (or some non
>> string) then performs a query that looks like.
>>
>> *SELECT * from Source WHERE date_col > '2020-08-03'*
>>
>> Seeing that the predicate literal here is a String, Spark needs to make a
>> change so that the DataSource column will be of the same type (Date),
>> so it places a "Cast" on the Datasource column so our plan ends up
>> looking like.
>>
>> Cast(date_col as String) > '2020-08-03'
>>
>> Since the Datasource Strategies can't handle a push down of the "Cast"
>> function we lose the predicate pushdown we could
>> have had. This can change a Job from a single partition lookup into a
>> full scan leading to a very confusing situation for
>> the end user. I also wonder about the relative cost here since we could
>> be avoiding doing X casts and instead just do a single
>> one on the predicate, in addition we could be doing the cast at the
>> Analysis phase and cut the run short before any work even
>> starts rather than doing a perhaps meaningless comparison between a date
>> and a non-date string.
>>
>> I think we should seriously consider whether in cases like this we should
>> attempt to cast the literal rather than casting the
>> source column.
>>
>> Please let me know if anyone has thoughts on this, or has some previous
>> Jiras I could dig into if it's been discussed before,
>> Russ
>>
>

-- 



Re: [SparkSql] Casting of Predicate Literals

2020-08-04 Thread Wenchen Fan
I think this is not a problem in 3.0 anymore, see
https://issues.apache.org/jira/browse/SPARK-27638

On Wed, Aug 5, 2020 at 12:08 AM Russell Spitzer 
wrote:

> I've just run into this issue again with another user and I feel like most
> folks here have seen some flavor of this at some point.
>
> The user registers a Datasource with a column of type Date (or some non
> string) then performs a query that looks like.
>
> *SELECT * from Source WHERE date_col > '2020-08-03'*
>
> Seeing that the predicate literal here is a String, Spark needs to make a
> change so that the DataSource column will be of the same type (Date),
> so it places a "Cast" on the Datasource column so our plan ends up looking
> like.
>
> Cast(date_col as String) > '2020-08-03'
>
> Since the Datasource Strategies can't handle a push down of the "Cast"
> function we lose the predicate pushdown we could
> have had. This can change a Job from a single partition lookup into a full
> scan leading to a very confusing situation for
> the end user. I also wonder about the relative cost here since we could be
> avoiding doing X casts and instead just do a single
> one on the predicate, in addition we could be doing the cast at the
> Analysis phase and cut the run short before any work even
> starts rather than doing a perhaps meaningless comparison between a date
> and a non-date string.
>
> I think we should seriously consider whether in cases like this we should
> attempt to cast the literal rather than casting the
> source column.
>
> Please let me know if anyone has thoughts on this, or has some previous
> Jiras I could dig into if it's been discussed before,
> Russ
>


Re: [PySpark] Revisiting PySpark type annotations

2020-08-04 Thread Felix Cheung
What would be the reason for separate git repo?


From: Hyukjin Kwon 
Sent: Monday, August 3, 2020 1:58:55 AM
To: Maciej Szymkiewicz 
Cc: Driesprong, Fokko ; Holden Karau 
; Spark Dev List 
Subject: Re: [PySpark] Revisiting PySpark type annotations

Okay, seems like we can create a separate repo as apache/spark? e.g.) 
https://issues.apache.org/jira/browse/INFRA-20470
We can also think about porting the files as are.
I will try to have a short sync with the author Maciej, and share what we 
discussed offline.


2020년 7월 22일 (수) 오후 10:43, Maciej Szymkiewicz 
mailto:mszymkiew...@gmail.com>>님이 작성:


W dniu środa, 22 lipca 2020 Driesprong, Fokko  napisał(a):
That's probably one-time overhead so it is not a big issue.  In my opinion, a 
bigger one is possible complexity. Annotations tend to introduce a lot of 
cyclic dependencies in Spark codebase. This can be addressed, but don't look 
great.

This is not true (anymore). With Python 3.6 you can add string annotations -> 
'DenseVector', and in the future with Python 3.7 this is fixed by having 
postponed evaluation: https://www.python.org/dev/peps/pep-0563/

As far as I recall linked PEP addresses backrferences not cyclic dependencies, 
which weren't a big issue in the first place

What I mean is a actually cyclic stuff - for example pyspark.context depends on 
pyspark.rdd and the other way around. These dependencies are not explicit at he 
moment.


Merging stubs into project structure from the other hand has almost no overhead.

This feels awkward to me, this is like having the docstring in a separate file. 
In my opinion you want to have the signatures and the functions together for 
transparency and maintainability.


I guess that's the matter of preference. From maintainability perspective it is 
actually much easier to have separate objects.

For example there are different types of objects that are required for 
meaningful checking, which don't really exist in real code (protocols, aliases, 
code generated signatures fo let complex overloads) as well as some monkey 
patched entities

Additionally it is often easier to see inconsistencies when typing is separate.

However, I am not implying that this should be a persistent state.

In general I see two non breaking paths here.

 - Merge pyspark-stubs a separate subproject within main spark repo and keep it 
in-sync there with common CI pipeline and transfer ownership of pypi package to 
ASF
- Move stubs directly into python/pyspark and then apply individual stubs to 
.modules of choice.

Of course, the first proposal could be an initial step for the latter one.


I think DBT is a very nice project where they use annotations very well: 
https://github.com/fishtown-analytics/dbt/blob/dev/marian-anderson/core/dbt/graph/queue.py

Also, they left out the types in the docstring, since they are available in the 
annotations itself.


In practice, the biggest advantage is actually support for completion, not type 
checking (which works in simple cases).

Agreed.

Would you be interested in writing up the Outreachy proposal for work on this?

I would be, and also happy to mentor. But, I think we first need to agree as a 
Spark community if we want to add the annotations to the code, and in which 
extend.




At some point (in general when things are heavy in generics, which is the case 
here), annotations become somewhat painful to write.

That's true, but that might also be a pointer that it is time to refactor the 
function/code :)

That might the case, but it is more often a matter capturing useful properties 
combined with requirement to keep things in sync with Scala counterparts.


For now, I tend to think adding type hints to the codes make it difficult to 
backport or revert and more difficult to discuss about typing only especially 
considering typing is arguably premature yet.

This feels a bit weird to me, since you want to keep this in sync right? Do you 
provide different stubs for different versions of Python? I had to look up the 
literals: https://www.python.org/dev/peps/pep-0586/

I think it is more about portability between Spark versions


Cheers, Fokko

Op wo 22 jul. 2020 om 09:40 schreef Maciej Szymkiewicz 
mailto:mszymkiew...@gmail.com>>:

On 7/22/20 3:45 AM, Hyukjin Kwon wrote:
> For now, I tend to think adding type hints to the codes make it
> difficult to backport or revert and
> more difficult to discuss about typing only especially considering
> typing is arguably premature yet.

About being premature ‒ since typing ecosystem evolves much faster than
Spark it might be preferable to keep annotations as a separate project
(preferably under AST / Spark umbrella). It allows for faster iterations
and supporting new features (for example Literals proved to be very
useful), without waiting for the next Spark release.

--
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30C

[SparkSql] Casting of Predicate Literals

2020-08-04 Thread Russell Spitzer
I've just run into this issue again with another user and I feel like most
folks here have seen some flavor of this at some point.

The user registers a Datasource with a column of type Date (or some non
string) then performs a query that looks like.

*SELECT * from Source WHERE date_col > '2020-08-03'*

Seeing that the predicate literal here is a String, Spark needs to make a
change so that the DataSource column will be of the same type (Date),
so it places a "Cast" on the Datasource column so our plan ends up looking
like.

Cast(date_col as String) > '2020-08-03'

Since the Datasource Strategies can't handle a push down of the "Cast"
function we lose the predicate pushdown we could
have had. This can change a Job from a single partition lookup into a full
scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be
avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the
Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date
and a non-date string.

I think we should seriously consider whether in cases like this we should
attempt to cast the literal rather than casting the
source column.

Please let me know if anyone has thoughts on this, or has some previous
Jiras I could dig into if it's been discussed before,
Russ


Re: Renaming blacklisting feature input

2020-08-04 Thread Sean Owen
Sure, but these are English words. I don't think anybody argues that
it should be an issue to _everyone_, perhaps you. But you seem to
suggest that it shouldn't be an issue to _anyone_ because it isn't an
issue to many people. I don't think that works either. Ex: if someone
proposed a fix to the Chinese translation strings in an app, would we
say this is merely China-centric? I'd also suggest framing this as
"some people's personal issues" is dismissive, but maybe the choice of
words wasn't meant that way. (again, I think we'd never say, who cares
about the Chinese translation? that's just their personal issue.)

I read between the lines that you view this is as just some US English
speaker pet issue, so people like me must just not be global-minded. I
think the backgrounds of people you're interacting with here might
surprise you!

Constructively: what do you mean by specialists and 'blindly
renaming'? like, we shouldn't just pick a rename but go with some
well-accepted alternative name? That would be great to add as input
here.
If a rename isn't a solution -- and yes, nobody thinks this
single-handedly fixes social problems -- what would be a better one?

The compatibility concern is very much important. But I think you'll
see the proposal is not to break compatibility. Your review and input
on that technical dimension is valuable.

On Tue, Aug 4, 2020 at 10:11 AM Alexander Shorin  wrote:
>
> Hi Sean!
>
> Your point is good and I accept it, but I thought it worth to remind yet 
> again that ASF isn't limited by the US and the world is not limited by 
> English language and as the result shouldn't be limited by some people's 
> personal issue - there are specialists around who can help with these.
>
> P.S. Sorry if my response was a bit quite disrespectful, but the intention 
> was to remind that blindly renaming all the things around is not a solution - 
> let's think world wide or at least about compatibility issues which somehow 
> should be handled? And what will be motivation for people to handle them?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Renaming blacklisting feature input

2020-08-04 Thread Alexander Shorin
Hi Sean!

Your point is good and I accept it, but I thought it worth to remind yet
again that ASF isn't limited by the US and the world is not limited by
English language and as the result shouldn't be limited by some people's
personal issue - there are specialists around who can help with these.

P.S. Sorry if my response was a bit quite disrespectful, but the intention
was to remind that blindly renaming all the things around is not a solution
- let's think world wide or at least about compatibility issues which
somehow should be handled? And what will be motivation for people to handle
them?

On Tue, Aug 4, 2020 at 5:57 PM Sean Owen  wrote:

> I know this kind of argument has bounced around not just within the
> ASF but outside too. While we should feel open to debate here, even if
> I don't think it will get anywhere new, let me suggest it won't matter
> to the decision process here, so, not worth it.
>
> We should discuss this type of change like any other. If a portion of
> the community, and committers/PMC accept that this is at the least a
> small, positive change for some, then we start with a legitimate
> proposal: there is an identifiable positive. Arguments against should
> be grounded in specific reasons there are more significant harms than
> benefits. (One clear issue: API compatibility, which I believe is
> still intended to be entirely preserved).
>
> I'd merely say that if one's position is only "meh, this does not
> matter to me, this change doesn't improve my world", it's not worth
> arguing with the people to which it matters at least a little. Changes
> happen here all the time that I don't care about or even distantly
> make my work a little harder. Doesn't mean either position is right-er
> even, we don't need to decide that.
>
> On Tue, Aug 4, 2020 at 9:33 AM Alexander Shorin  wrote:
> >
> >
> > Just no changes? Name provides no issues and is pretty clear about its
> intentions. Racist links are quite overminded.
> >
> > --
> > ,,^..^,,
> >
> >
> > On Tue, Aug 4, 2020 at 5:19 PM Tom Graves 
> wrote:
> >>
> >> Hey Folks,
> >>
> >> We have jira https://issues.apache.org/jira/browse/SPARK-32037 to
> rename the blacklisting feature.  It would be nice to come to a consensus
> on what we want to call that.
> >> It doesn't looks like we have any references to whitelist other then
> from other components.  There is some discussion on the jira and I linked
> to what some other projects have done so please take a look at that.
> >>
> >> A few options:
> >>  - blocklist
> >>  - denylist
> >>  - healthy /HealthTracker
> >>  - quarantined
> >>  - benched
> >>  - exiled
> >>  - banlist
> >>
> >> Please let me know thoughts and suggestions.
> >>
> >>
> >> Thanks,
> >> Tom
>


Re: Renaming blacklisting feature input

2020-08-04 Thread Sean Owen
I know this kind of argument has bounced around not just within the
ASF but outside too. While we should feel open to debate here, even if
I don't think it will get anywhere new, let me suggest it won't matter
to the decision process here, so, not worth it.

We should discuss this type of change like any other. If a portion of
the community, and committers/PMC accept that this is at the least a
small, positive change for some, then we start with a legitimate
proposal: there is an identifiable positive. Arguments against should
be grounded in specific reasons there are more significant harms than
benefits. (One clear issue: API compatibility, which I believe is
still intended to be entirely preserved).

I'd merely say that if one's position is only "meh, this does not
matter to me, this change doesn't improve my world", it's not worth
arguing with the people to which it matters at least a little. Changes
happen here all the time that I don't care about or even distantly
make my work a little harder. Doesn't mean either position is right-er
even, we don't need to decide that.

On Tue, Aug 4, 2020 at 9:33 AM Alexander Shorin  wrote:
>
>
> Just no changes? Name provides no issues and is pretty clear about its 
> intentions. Racist links are quite overminded.
>
> --
> ,,^..^,,
>
>
> On Tue, Aug 4, 2020 at 5:19 PM Tom Graves  
> wrote:
>>
>> Hey Folks,
>>
>> We have jira https://issues.apache.org/jira/browse/SPARK-32037 to rename the 
>> blacklisting feature.  It would be nice to come to a consensus on what we 
>> want to call that.
>> It doesn't looks like we have any references to whitelist other then from 
>> other components.  There is some discussion on the jira and I linked to what 
>> some other projects have done so please take a look at that.
>>
>> A few options:
>>  - blocklist
>>  - denylist
>>  - healthy /HealthTracker
>>  - quarantined
>>  - benched
>>  - exiled
>>  - banlist
>>
>> Please let me know thoughts and suggestions.
>>
>>
>> Thanks,
>> Tom

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Removing references to Master

2020-08-04 Thread Russell Spitzer
I think we should use Scheduler or Comptroller or Leader; something that
evokes better describes the purpose as a resource management service. I
would rather we didn't use controller, coordinator, application manager,
primary because I feel that those terms make it seem like the process is
central to an Application's function when in reality it does nothing other
than turn off or on containers and processes. The key example here for me
would be, if the StandaloneResourceManager goes down, a running app is
basically unaffected . The initial usage of "master" was misleading even in
context of previous CS usage of the term imho and we should choose a much
more limited term to describe it now that we have a chance for a rename. Of
course, ymmv and really anything would be better than the current status
quo which is both misleading and insensitive.

On Tue, Aug 4, 2020 at 9:08 AM Holden Karau  wrote:

> I think this is a good idea, and yes keeping it backwards compatible
> initially is important since we missed the boat on Spark 3. I like the
> Controller/Leader one since I think that does a good job of reflecting the
> codes role.
>
> On Tue, Aug 4, 2020 at 7:01 AM Tom Graves 
> wrote:
>
>> Hey everyone,
>>
>> I filed jira https://issues.apache.org/jira/browse/SPARK-32333 to remove
>> references to Master.  I realize this is a bigger change then the slave
>> jira but I wanted to get folks input on if they are ok with making the
>> change and if so we would need to pick a name to use instead.  I think we
>> should keep it backwards compatible at first as to not break anyone and
>> depending on what we find might break it up into multiple smaller liras.
>>
>> A few name possibilities:
>>  - ApplicationManager
>>  - StandaloneClusterManager
>>  - Coordinator
>>  - Primary
>>  - Controller
>>
>> Thoughts or suggestions?
>>
>> Thanks,
>> Tom
>>
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Renaming blacklisting feature input

2020-08-04 Thread Alexander Shorin
Just no changes? Name provides no issues and is pretty clear about its
intentions. Racist links are quite overminded.

--
,,^..^,,


On Tue, Aug 4, 2020 at 5:19 PM Tom Graves 
wrote:

> Hey Folks,
>
> We have jira https://issues.apache.org/jira/browse/SPARK-32037 to rename
> the blacklisting feature.  It would be nice to come to a consensus on what
> we want to call that.
> It doesn't looks like we have any references to whitelist other then from
> other components.  There is some discussion on the jira and I linked to
> what some other projects have done so please take a look at that.
>
> A few options:
>  - blocklist
>  - denylist
>  - healthy /HealthTracker
>  - quarantined
>  - benched
>  - exiled
>  - banlist
>
> Please let me know thoughts and suggestions.
>
>
> Thanks,
> Tom
>


Renaming blacklisting feature input

2020-08-04 Thread Tom Graves
Hey Folks,
We have jira https://issues.apache.org/jira/browse/SPARK-32037 to rename the 
blacklisting feature.  It would be nice to come to a consensus on what we want 
to call that.It doesn't looks like we have any references to whitelist other 
then from other components.  There is some discussion on the jira and I linked 
to what some other projects have done so please take a look at that.
A few options: - blocklist - denylist - healthy /HealthTracker - quarantined - 
benched - exiled - banlist
Please let me know thoughts and suggestions.

Thanks,Tom

Re: Removing references to Master

2020-08-04 Thread Holden Karau
I think this is a good idea, and yes keeping it backwards compatible
initially is important since we missed the boat on Spark 3. I like the
Controller/Leader one since I think that does a good job of reflecting the
codes role.

On Tue, Aug 4, 2020 at 7:01 AM Tom Graves 
wrote:

> Hey everyone,
>
> I filed jira https://issues.apache.org/jira/browse/SPARK-32333 to remove
> references to Master.  I realize this is a bigger change then the slave
> jira but I wanted to get folks input on if they are ok with making the
> change and if so we would need to pick a name to use instead.  I think we
> should keep it backwards compatible at first as to not break anyone and
> depending on what we find might break it up into multiple smaller liras.
>
> A few name possibilities:
>  - ApplicationManager
>  - StandaloneClusterManager
>  - Coordinator
>  - Primary
>  - Controller
>
> Thoughts or suggestions?
>
> Thanks,
> Tom
>
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Removing references to Master

2020-08-04 Thread Tom Graves
Hey everyone,
I filed jira https://issues.apache.org/jira/browse/SPARK-32333 to remove 
references to Master.  I realize this is a bigger change then the slave jira 
but I wanted to get folks input on if they are ok with making the change and if 
so we would need to pick a name to use instead.  I think we should keep it 
backwards compatible at first as to not break anyone and depending on what we 
find might break it up into multiple smaller liras.
A few name possibilities: - ApplicationManager - StandaloneClusterManager - 
Coordinator - Primary - Controller
Thoughts or suggestions?
Thanks,Tom