[
https://issues.apache.org/jira/browse/CRUNCH-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875038#comment-15875038
]
Attila Sasvari commented on CRUNCH-636:
---------------------------------------
I have a poc that suggests that the approach I previously recommended is
fragile (executed 3 times a sample dataflow, and replication settings were not
set deterministically).
[~joshwills] What is your opinion about this ticket/feature? If we allow users
to set different replication factors for intermediate files, and they set it to
1, then if a disk fail that stores the data before the pipeline finishes, the
whole Crunch pipeline should crash. If a job has both temporary and
non-temporary output, then the replication factor should be the one used for
the non-temporary. I don't know all the possible cases, but it doesn't seem
that trivial to me.
> Make replication factor for temporary files configurable
> --------------------------------------------------------
>
> Key: CRUNCH-636
> URL: https://issues.apache.org/jira/browse/CRUNCH-636
> Project: Crunch
> Issue Type: New Feature
> Reporter: Attila Sasvari
> Assignee: Attila Sasvari
>
> As of now, Crunch does not allow having different replication factor for
> temporary files and non-temporary files (e.g. final output data of leaf
> nodes) at the same time. If a user has a large amount of data (say hundreds a
> of gigabytes) to process, they might want to have lower replication factor
> for large temporary files between Crunch jobs.
> We could make this configurable via a new setting (e.g.
> {{crunch.tmp.dir.replication}}).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)