[jira] [Commented] (CRUNCH-256) SequentialFileNamingScheme should cache the # of files in the target directory after the first read

Josh Wills (JIRA) Thu, 22 Aug 2013 20:33:06 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748251#comment-13748251
 ]


Josh Wills commented on CRUNCH-256:
-----------------------------------

IIRC, it had to deal w/handling the output of a union operation, i.e., I could 
have multiple jobs in a pipeline that were all writing to the same output 
directory for subsequent processing. If a target output directory already 
existed and had files in it, we wanted to first write the output of the MR job 
to a temp directory, and then move the output of the job to the target 
directory for further processing. I suspect that to make the logic easier, we 
just implemented things so that the move was always done, since it didn't seem 
to cost that much most of the time.

Obviously this patch is just a bandaid to fix one of the more pernicious 
problems with this approach, and we should still see about tackling CRUNCH-252 
so that we can eliminate the move entirely when it's not necessary.
                
> SequentialFileNamingScheme should cache the # of files in the target 
> directory after the first read
> ---------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-256
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-256
>             Project: Crunch
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.8.0
>
>         Attachments: CRUNCH-256.patch
>
>
> After a job finishes running, the post-job hooks rename the files from a temp 
> output directory to the target output directory. When we have lots of files, 
> this move can take a long time, and I traced the performance issue to the 
> fact that SequentialFileNamingScheme does a listStatus() on the output 
> directory for every file that gets moved. If SequentialFileNamingScheme just 
> does this check once and then increments an internal counter, we can 
> significantly decrease the performance overhead involved with the move.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CRUNCH-256) SequentialFileNamingScheme should cache the # of files in the target directory after the first read

Reply via email to