Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-13 Thread Hyukjin Kwon
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and pandas combinations. Spark 3 should be good time to increase. 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성: > Hi All, > > We would like to discuss increasing the minimum supported version of > Pandas in Spark, which is

[DISCUSS] Increasing minimum supported version of Pandas

2019-06-13 Thread Bryan Cutler
Hi All, We would like to discuss increasing the minimum supported version of Pandas in Spark, which is currently 0.19.2. Pandas 0.19.2 was released nearly 3 years ago and there are some workarounds in PySpark that could be removed if such an old version is not required. This will help to keep

[VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-13 Thread Matt Cheah
Hi everyone, I would like to call a vote for the SPIP for SPARK-25299, which proposes to introduce a pluggable storage API for temporary shuffle data. You may find the SPIP document here. The discussion thread for the SPIP was conducted here. Please vote on whether or not this

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Dongjoon Hyun
Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco. Sure, we can do whatever we want. I'll wait for more feedbacks and proceed to the next steps. Bests, Dongjoon. On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido wrote: > Hi Dongjoon, > Thanks for the proposal! I like the

Re: Adding Custom finalize method to RDDs.

2019-06-13 Thread Phillip Henry
If you control the codebase, you control when an RDD goes out of scope. Or am I missing something? (Note that finalize will not necessarily executed when an object goes out of scope but when the GC runs at some indeterminate point in the future. Please avoid using finalize for the kind of task

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-06-13 Thread Saisai Shao
I think maybe we could start a vote on this SPIP. This has been discussed for a while, and the current doc is pretty complete as for now. Also we saw lots of demands in the community about building their own shuffle storage. Thanks Saisai Imran Rashid 于2019年6月11日周二 上午3:27写道: > I would be

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Marco Gaido
Hi Dongjoon, Thanks for the proposal! I like the idea. Maybe we can extend it to component too and to some jira labels such as correctness which may be worth to highlight in PRs too. My only concern is that in many cases JIRAs are created not very carefully so they may be incorrect at the moment

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Reynold Xin
Seems like a good idea. Can we test this with a component first? On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun wrote: > Hi, All. > > Since we use both Apache JIRA and GitHub actively for Apache Spark > contributions, we have lots of JIRAs and PRs consequently. One specific > thing I've been

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Hyukjin Kwon
Yea, I think we can automate this process via, for instance, https://github.com/apache/spark/blob/master/dev/github_jira_sync.py +1 for such sort of automatic categorizing and matching metadata between JIRA and github Adding Josh and Sean as well. On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun,