[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-08-02 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 Hey all - sorry I haven't been able to focus on this. It shouldn't be tough to do, but it will need some tests. If we can find someone who wants to take it over I think it makes a decent starter

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-07-01 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 @MLnick - or if you don't have a chance would it be ok for us to find someone (perhaps someone new to the project) to take this over and bring it to the finish line? --- If your project is set up

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-06-20 Thread ueshin
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/14579 @MLnick Hi, are you still working on this? If so, could you fix conflicts and update please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-05-11 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14579 gentle ping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-02-24 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 Do you have time to update this @MLnick or maybe would it be OK if someone else made an updated PR based on this? It would be a nice feature to have for 2.2 :) --- If your project is set up for

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2017-02-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14579 Is there any reason why it is not merged yet? I personally like this too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-11-26 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 just a gentle ping - would be cool to add this for 2.1 we have the time :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-11-02 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 I hope to get to this soon - It's just the test cases that I need to get to also! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-11-01 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 Just pinging to see how its going? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-25 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Looks good to me. 👍 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-25 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 I like it personally - if no one has a good reason why not it seems like a very reasonable approach. --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-25 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 @nchammas @holdenk @davies @rxin how about the approach of @MechCoder in https://github.com/apache/spark/pull/14579#discussion_r74813935? I think this will work well, so we could raise an

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 @nchammas to answer your question above (https://github.com/apache/spark/pull/14579#issuecomment-239185438) - in short no. The semantics of the utility method will be the same as for the

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Thanks for the quick overview. That's pretty straightforward, actually! I'll take a look at `PipelinedRDD` for the details. 👍 --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 Sure - so at a bit of a high level and not like exactly on point - copying data out of the Python back to the JVM is kind of slow so if we have multiple python operations we can put together in the

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Hmm, OK I see. (Apologies, I don't understand what pipelined RDDs are for, so the examples are going a bit over my head. 😅) --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 @nchammas to be clear - subclassing only breaks pipelining if the persisted_rdd is later unpersisted (e.g. used with a `with` statement or otherwise) - otherwise you can't pipeline on top of a

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 > So there is no chaining requirement, and it will only work in a with statement. @MLnick - Couldn't we also create a scenario (like @holdenk did earlier) where a user does something like

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 After looking at it and considering all the above, I would say the options are (1) do nothing; or (2) if we want to support this use case, then we implement a single utility method (I would say

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 @nchammas a utility method (e.g. `cached`) - actually it will be a context manager class implemented in the same way as `closing` - will work because it only needs to return the context manager, not

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 Right I wouldn't expect it to error with subclassing - just not pipeline successfully - but only in a very long shot corner case. I think the try/finally with persistance is not an uncommon

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Ah, I see. I don't fully understand how `PipelinedRDD` works or how it is used so I'll have to defer to y'all on this. Does the `cached()` utility method have this same problem? > We

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 yeah it would break pipelining - I don't think it will necessarily throw an error though. e.g. ``` In [22]: rdd = sc.parallelize(["b", "a", "c"]) In [23]: type(rdd)

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Sorry, you're right, `__exit__()`'s return value is not going to be consumed anywhere. What I meant is that `unpersist()` would return the base RDD or DataFrame object. But I'm not seeing

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 @nchammas so if we go with the subclassing approach but keep the current cache/persist interface (e.g. no special utility function) a user could easily write something like: `magic =

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 > the subclassing of RDD approach could cause us to miss out on pipelining if the RDD was used again after it was unpersisted How so? Wouldn't `__exit__()` simply return the parent RDD or

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread holdenk
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14579 One minor thing to keep in mind - the subclassing of RDD approach could cause us to miss out on pipelining if the RDD was used again after it was unpersisted - but I think that is a relatively

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14579 cc @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 None of our options seems great, but if I had to rank them I would say: 1. Add new `Persisted...` classes. 2. Make no changes. 3. Add separate `persisted()` or `cached()` utility

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Ah, you're right. So if we want to avoid needing magic methods in the main RDD/DataFrame classes and avoid needing a separate utility method like `cache()`, I think one option available

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 @nchammas I looked at the `@contextmanager` decorator. It is an easy way to create a method that returns a context manager, but is is essentially _only_ usable in a `with` statement as it returns a

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Thanks @MLnick for taking this on and for breaking down what you've found so far. I took a look through [`contextlib`](https://docs.python.org/3/library/contextlib.html) for inspiration,

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/14579 cc @nchammas @holdenk @rxin **Note:** This is implemented by adding the `__enter__` and `__exit__` methods to RDD/DataFrame directly. This allows some potentially weird stuff, such as

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14579 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63520/ Test PASSed. ---

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14579 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14579 **[Test build #63520 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)** for PR 14579 at commit

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14579 **[Test build #63520 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)** for PR 14579 at commit