Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
Hey all - sorry I haven't been able to focus on this. It shouldn't be tough
to do, but it will need some tests. If we can find someone who wants to take
it over I think it makes a decent starter
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
@MLnick - or if you don't have a chance would it be ok for us to find
someone (perhaps someone new to the project) to take this over and bring it to
the finish line?
---
If your project is set up
Github user ueshin commented on the issue:
https://github.com/apache/spark/pull/14579
@MLnick Hi, are you still working on this? If so, could you fix conflicts
and update please?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14579
gentle ping.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
Do you have time to update this @MLnick or maybe would it be OK if someone
else made an updated PR based on this? It would be a nice feature to have for
2.2 :)
---
If your project is set up for
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14579
Is there any reason why it is not merged yet? I personally like this too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
just a gentle ping - would be cool to add this for 2.1 we have the time :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
I hope to get to this soon - It's just the test cases that I need to get to
also!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
Just pinging to see how its going?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
Looks good to me. ð
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
I like it personally - if no one has a good reason why not it seems like a
very reasonable approach.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
@nchammas @holdenk @davies @rxin how about the approach of @MechCoder in
https://github.com/apache/spark/pull/14579#discussion_r74813935?
I think this will work well, so we could raise an
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
@nchammas to answer your question above
(https://github.com/apache/spark/pull/14579#issuecomment-239185438) - in short
no.
The semantics of the utility method will be the same as for the
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
Thanks for the quick overview. That's pretty straightforward, actually!
I'll take a look at `PipelinedRDD` for the details. ð
---
If your project is set up for it, you can reply to this email
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
Sure - so at a bit of a high level and not like exactly on point - copying
data out of the Python back to the JVM is kind of slow so if we have multiple
python operations we can put together in the
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
Hmm, OK I see. (Apologies, I don't understand what pipelined RDDs are for,
so the examples are going a bit over my head. ð
)
---
If your project is set up for it, you can reply to this email
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
@nchammas to be clear - subclassing only breaks pipelining if the
persisted_rdd is later unpersisted (e.g. used with a `with` statement or
otherwise) - otherwise you can't pipeline on top of a
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
> So there is no chaining requirement, and it will only work in a with
statement.
@MLnick - Couldn't we also create a scenario (like @holdenk did earlier)
where a user does something like
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
After looking at it and considering all the above, I would say the options
are (1) do nothing; or (2) if we want to support this use case, then we
implement a single utility method (I would say
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
@nchammas a utility method (e.g. `cached`) - actually it will be a context
manager class implemented in the same way as `closing` - will work because it
only needs to return the context manager, not
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
Right I wouldn't expect it to error with subclassing - just not pipeline
successfully - but only in a very long shot corner case.
I think the try/finally with persistance is not an uncommon
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
Ah, I see. I don't fully understand how `PipelinedRDD` works or how it is
used so I'll have to defer to y'all on this. Does the `cached()` utility method
have this same problem?
> We
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
yeah it would break pipelining - I don't think it will necessarily throw an
error though.
e.g.
```
In [22]: rdd = sc.parallelize(["b", "a", "c"])
In [23]: type(rdd)
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
Sorry, you're right, `__exit__()`'s return value is not going to be
consumed anywhere. What I meant is that `unpersist()` would return the base RDD
or DataFrame object.
But I'm not seeing
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
@nchammas so if we go with the subclassing approach but keep the current
cache/persist interface (e.g. no special utility function) a user could easily
write something like:
`magic =
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
> the subclassing of RDD approach could cause us to miss out on pipelining
if the RDD was used again after it was unpersisted
How so? Wouldn't `__exit__()` simply return the parent RDD or
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/14579
One minor thing to keep in mind - the subclassing of RDD approach could
cause us to miss out on pipelining if the RDD was used again after it was
unpersisted - but I think that is a relatively
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/14579
cc @davies
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
None of our options seems great, but if I had to rank them I would say:
1. Add new `Persisted...` classes.
2. Make no changes.
3. Add separate `persisted()` or `cached()` utility
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
Ah, you're right.
So if we want to avoid needing magic methods in the main RDD/DataFrame
classes and avoid needing a separate utility method like `cache()`, I think one
option available
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
@nchammas I looked at the `@contextmanager` decorator. It is an easy way to
create a method that returns a context manager, but is is essentially _only_
usable in a `with` statement as it returns a
Github user nchammas commented on the issue:
https://github.com/apache/spark/pull/14579
Thanks @MLnick for taking this on and for breaking down what you've found
so far.
I took a look through
[`contextlib`](https://docs.python.org/3/library/contextlib.html) for
inspiration,
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
cc @nchammas @holdenk @rxin
**Note:**
This is implemented by adding the `__enter__` and `__exit__` methods to
RDD/DataFrame directly. This allows some potentially weird stuff, such as
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14579
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63520/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14579
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14579
**[Test build #63520 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)**
for PR 14579 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14579
**[Test build #63520 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63520/consoleFull)**
for PR 14579 at commit
37 matches
Mail list logo