Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-13 Thread Hyukjin Kwon
I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
pandas combinations. Spark 3 should be good time to increase.

2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:

> Hi All,
>
> We would like to discuss increasing the minimum supported version of
> Pandas in Spark, which is currently 0.19.2.
>
> Pandas 0.19.2 was released nearly 3 years ago and there are some
> workarounds in PySpark that could be removed if such an old version is not
> required. This will help to keep code clean and reduce maintenance effort.
>
> The change is targeted for Spark 3.0.0 release, see
> https://issues.apache.org/jira/browse/SPARK-28041. The current thought is
> to bump the version to 0.23.2, but we would like to discuss before making a
> change. Does anyone else have thoughts on this?
>
> Regards,
> Bryan
>


[DISCUSS] Increasing minimum supported version of Pandas

2019-06-13 Thread Bryan Cutler
Hi All,

We would like to discuss increasing the minimum supported version of Pandas
in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some
workarounds in PySpark that could be removed if such an old version is not
required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is
to bump the version to 0.23.2, but we would like to discuss before making a
change. Does anyone else have thoughts on this?

Regards,
Bryan


[VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-13 Thread Matt Cheah
Hi everyone,

 

I would like to call a vote for the SPIP for SPARK-25299, which proposes to 
introduce a pluggable storage API for temporary shuffle data.

 

You may find the SPIP document here.

 

The discussion thread for the SPIP was conducted here.

 

Please vote on whether or not this proposal is agreeable to you.

 

Thanks!

 

-Matt Cheah



smime.p7s
Description: S/MIME cryptographic signature


Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Dongjoon Hyun
Thank you for the feedbacks and requirements, Hyukjin, Reynold, Marco.

Sure, we can do whatever we want.

I'll wait for more feedbacks and proceed to the next steps.

Bests,
Dongjoon.


On Wed, Jun 12, 2019 at 11:51 PM Marco Gaido  wrote:

> Hi Dongjoon,
> Thanks for the proposal! I like the idea. Maybe we can extend it to
> component too and to some jira labels such as correctness which may be
> worth to highlight in PRs too. My only concern is that in many cases JIRAs
> are created not very carefully so they may be incorrect at the moment of
> the pr creation and it may be updated later: so keeping them in sync may be
> an extra effort..
>
> On Thu, 13 Jun 2019, 08:09 Reynold Xin,  wrote:
>
>> Seems like a good idea. Can we test this with a component first?
>>
>> On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we use both Apache JIRA and GitHub actively for Apache Spark
>>> contributions, we have lots of JIRAs and PRs consequently. One specific
>>> thing I've been longing to see is `Jira Issue Type` in GitHub.
>>>
>>> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
>>> There are two main benefits:
>>> 1. It helps the communication between the contributors and reviewers
>>> with more information.
>>> (In some cases, some people only visit GitHub to see the PR and
>>> commits)
>>> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
>>> PRs to see a specific type.
>>> (For example, the reviewers can see and review 'BUG' PRs first by
>>> using `is:open is:pr label:BUG`.)
>>>
>>> Of course, this can be done automatically without human intervention.
>>> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
>>> can add the labels from the beginning. If needed, I can volunteer to update
>>> the script.
>>>
>>> To show the demo, I labeled several PRs manually. You can see the result
>>> right now in Apache Spark PR page.
>>>
>>>   - https://github.com/apache/spark/pulls
>>>
>>> If you're surprised due to those manual activities, I want to apologize
>>> for that. I hope we can take advantage of the existing GitHub features to
>>> serve Apache Spark community in a way better than yesterday.
>>>
>>> How do you think about this specific suggestion?
>>>
>>> Bests,
>>> Dongjoon
>>>
>>> PS. I saw that `Request Review` and `Assign` features are already used
>>> for some purposes, but these feature are out of the scope in this email.
>>>
>>


Re: Adding Custom finalize method to RDDs.

2019-06-13 Thread Phillip Henry
If you control the codebase, you control when an RDD goes out of scope. Or
am I missing something?

(Note that finalize will not necessarily executed when an object goes out
of scope but when the GC runs at some indeterminate point in the future.
Please avoid using finalize for the kind of task you're trying to do. It's
not what it was designed for. Better to pay more attention to
house-keeping in your own code.)



On Wed, Jun 12, 2019 at 9:11 PM Nasrulla Khan Haris <
nasrulla.k...@microsoft.com> wrote:

> We cannot have control over RDD going out of scope from memory as it was
> handled by JVM. Thus I am not sure try and finalize will help.
>
>  Thus I wanted to use some mechanism to cleanup of some temporary data
> which is created by RDD immediately as soon as it goes out of scope.
>
>
>
> Any ideas ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Phillip Henry 
> *Sent:* Tuesday, June 11, 2019 11:28 PM
> *To:* Nasrulla Khan Haris 
> *Cc:* Vinoo Ganesh ; dev@spark.apache.org
> *Subject:* Re: Adding Custom finalize method to RDDs.
>
>
>
> That's not the kind of thing a finalize method was ever supposed to do.
>
>
>
> Use a try/finally block instead.
>
>
>
> Phillip
>
>
>
>
>
> On Wed, 12 Jun 2019, 00:01 Nasrulla Khan Haris, <
> nasrulla.k...@microsoft.com.invalid> wrote:
>
> I want to delete some files which I created In my datasource api,  as soon
> as the RDD is cleaned up.
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Vinoo Ganesh 
> *Sent:* Monday, June 10, 2019 1:32 PM
> *To:* Nasrulla Khan Haris ;
> dev@spark.apache.org
> *Subject:* Re: Adding Custom finalize method to RDDs.
>
>
>
> Generally overriding the finalize() method is an antipattern (it was in
> fact deprecated in java 11
> https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Object.html#finalize())
> . What’s the use case here?
>
>
>
> *From: *Nasrulla Khan Haris 
> *Date: *Monday, June 10, 2019 at 15:44
> *To: *"dev@spark.apache.org" 
> *Subject: *RE: Adding Custom finalize method to RDDs.
>
>
>
> Hello Everyone,
>
> Is there a way  to do it from user-code ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
> *From:* Nasrulla Khan Haris 
> *Sent:* Sunday, June 9, 2019 5:30 PM
> *To:* dev@spark.apache.org
> *Subject:* Adding Custom finalize method to RDDs.
>
>
>
> Hi All,
>
>
>
> Is there a way to add custom finalize method to RDD objects to add custom
> logic when RDDs are destructed by JVM ?
>
>
>
> Thanks,
>
> Nasrulla
>
>
>
>