[
https://issues.apache.org/jira/browse/CRUNCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485924#comment-14485924
]
Josh Wills commented on CRUNCH-510:
-----------------------------------
Should have said: the relevant bit of code in that Spark Dataflow file is
around line 162.
> PCollection.materialize with Spark should use collect()
> -------------------------------------------------------
>
> Key: CRUNCH-510
> URL: https://issues.apache.org/jira/browse/CRUNCH-510
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Micah Whitacre
> Assignee: Josh Wills
>
> When troubleshooting some other code noticed that when using the
> SparkPipeline and the code forces a materialize() to be called...
> {code}
> delta = Aggregate.max(scores.parallelDo(new MapFn<Pair<String,
> PageRankData>, Float>() {
> @Override
> public Float map(Pair<String, PageRankData> input) {
> PageRankData prd = input.second();
> return Math.abs(prd.score - prd.lastScore);
> }
> }, ptf.floats())).getValue();
> {code}
> That the underlying code actually results in writing out the value to HDFS:
> {noformat}
> 15/04/08 13:59:33 INFO DAGScheduler: Job 1 finished: saveAsNewAPIHadoopFile
> at SparkRuntime.java:332, took 0.223622 s
> {noformat}
> Since Spark has the method collect() on RDDs, that should accomplish a
> similar bit of functionality, I wonder if we could switch to use that and cut
> down on the need to persist it to HDFS. I think this is currently happening
> because of sharing logic between MRPipeline and SparkPipeline and have no
> context about how we could possibly break it apart easily.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)