[
https://issues.apache.org/jira/browse/CRUNCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873169#comment-15873169
]
Attila Sasvari commented on CRUNCH-636:
---------------------------------------
One approach to do this:
- in {{createTempPath()}} of {{DistributedPipeline}: keep track of temporary
directories created. We can add a new entry to the pipeline configuration; for
example ("crunch.tmp.dirs", colon separated set of directories),
- in {{MSCROutputHandler}}: introduce a new helper method to test whether we
are dealing with a temporary output directory. If so set "dfs.replication" to
the user given "crunch.tmp.dir.replication". This replication factor will be
used by MapReduce to produce output file(s) in subsequent
"configureForMapReduce()". We also need to make sure that the original/default
replication factor is used for non-intermediate nodes. To do this, we can set
something like "dfs.replication.initial" at the first time {{configure()}} of
{{MSCROutputHandler}} is called and use this replication setting for leaf
nodes.
I will attach a patch as soon as possible.
> Make replication factor for temporary files configurable
> --------------------------------------------------------
>
> Key: CRUNCH-636
> URL: https://issues.apache.org/jira/browse/CRUNCH-636
> Project: Crunch
> Issue Type: New Feature
> Reporter: Attila Sasvari
> Assignee: Attila Sasvari
>
> As of now, Crunch does not allow having different replication factor for
> temporary files and non-temporary files (e.g. final output data of leaf
> nodes) at the same time. If a user has a large amount of data (say hundreds a
> of gigabytes) to process, they might want to have lower replication factor
> for large temporary files between Crunch jobs.
> We could make this configurable via a new setting (e.g.
> {{crunch.tmp.dir.replication}}).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)