Github user markhamstra commented on the pull request:
https://github.com/apache/incubator-spark/pull/635#issuecomment-35825888
Huh? I don't get the point of these at all.
At first glance, allCollect looks like a really bad idea. Collecting the
entire contents of an RDD to the driver process only to immediately turn around
and push all of that data (or in this case, multiple copies of the data!) back
across the network is an anti-pattern and generally a very poor design choice
that cannot scale to large data -- if you can handle all of the data within the
driver process, then why are you using a distributed, big-data framework in the
first place?
allCollectBroadcast makes even less sense to me. Some workflows do demand
collecting a relatively small amount of data to the driver and then
broadcasting a small amount back to the workers for use in further
computations, but why would I then want to go through the extra step of pushing
the broadcast values into a strange-looking RDD instead of just using the
broadcast variable directly?
It's going to take a lot of persuading to convince me that either of these
are things we want to promote and support in the 1.0 API. That doesn't mean
that I'm not listening, but I am far from convinced at this point.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
[email protected] or file a JIRA ticket with INFRA.
---