[ 
https://issues.apache.org/jira/browse/CRUNCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15873169#comment-15873169
 ] 

Attila Sasvari commented on CRUNCH-636:
---------------------------------------

One approach to do this:
- in {{createTempPath()}} of {{DistributedPipeline}: keep track of temporary 
directories created. We can add a new entry to the pipeline configuration; for 
example ("crunch.tmp.dirs", colon separated set of directories),  
- in {{MSCROutputHandler}}: introduce a new helper method to test whether we 
are dealing with a temporary output directory. If so set "dfs.replication" to 
the user given "crunch.tmp.dir.replication". This replication factor will be 
used by MapReduce to produce output file(s) in subsequent  
"configureForMapReduce()". We also need to make sure that the original/default 
replication factor is used for non-intermediate nodes. To do this, we can set 
something like "dfs.replication.initial" at the first time {{configure()}} of 
{{MSCROutputHandler}} is called and use this replication setting for leaf 
nodes. 

I will attach a patch as soon as possible.

> Make replication factor for temporary files configurable
> --------------------------------------------------------
>
>                 Key: CRUNCH-636
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-636
>             Project: Crunch
>          Issue Type: New Feature
>            Reporter: Attila Sasvari
>            Assignee: Attila Sasvari
>
> As of now, Crunch does not allow having different replication factor for 
> temporary files and non-temporary files (e.g. final output data of leaf 
> nodes) at the same time. If a user has a large amount of data (say hundreds a 
> of gigabytes) to process, they might want to have lower replication factor 
> for large temporary files between Crunch jobs. 
> We could make this configurable via a new setting (e.g. 
> {{crunch.tmp.dir.replication}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to