[jira] [Commented] (HADOOP-17833) Improve Magic Committer Performance
[ https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451803#comment-17451803 ] Steve Loughran commented on HADOOP-17833: - Tagging as a dependent on MAPREDUCE-7341 for changes in the common jar. * statistic keys. * rate limiter * remote iterator feed for Tasks so incremental scheduling of manifest load and post. > Improve Magic Committer Performance > --- > > Key: HADOOP-17833 > URL: https://issues.apache.org/jira/browse/HADOOP-17833 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Affects Versions: 3.3.1 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: pull-request-available > Time Spent: 5h 20m > Remaining Estimate: 0h > > Magic committer tasks can be slow because every file created with > overwrite=false triggers a HEAD (verify there's no file) and a LIST (that > there's no dir). And because of delayed manifestations, it may not behave as > expected. > ParquetOutputFormat is one example of a library which does this. > we could fix parquet to use overwrite=true, but (a) there may be surprises in > other uses (b) it'd still leave the list and (c) do nothing for other formats > call > Proposed: createFile() under a magic path to skip all probes for file/dir at > end of path > Only a single task attempt Will be writing to that directory and it should > know what it is doing. If there is conflicting file names and parts across > tasks that won't even get picked up at this point. Oh and none of the > committers ever check for this: you'll get the last file manifested (s3a) or > renamed (file) > If we skip the checks we will save 2 HTTP requests/file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17833) Improve Magic Committer Performance
[ https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417773#comment-17417773 ] Chao Sun commented on HADOOP-17833: --- [~ste...@apache.org] is this targeting 3.3.2 release? since it is a point release I'm thinking perhaps we should stick to bug fixes and avoid putting new features or improvements (esp. since this looks like a big PR). > Improve Magic Committer Performance > --- > > Key: HADOOP-17833 > URL: https://issues.apache.org/jira/browse/HADOOP-17833 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Affects Versions: 3.3.1 >Reporter: Steve Loughran >Priority: Minor > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > Magic committer tasks can be slow because every file created with > overwrite=false triggers a HEAD (verify there's no file) and a LIST (that > there's no dir). And because of delayed manifestations, it may not behave as > expected. > ParquetOutputFormat is one example of a library which does this. > we could fix parquet to use overwrite=true, but (a) there may be surprises in > other uses (b) it'd still leave the list and (c) do nothing for other formats > call > Proposed: createFile() under a magic path to skip all probes for file/dir at > end of path > Only a single task attempt Will be writing to that directory and it should > know what it is doing. If there is conflicting file names and parts across > tasks that won't even get picked up at this point. Oh and none of the > committers ever check for this: you'll get the last file manifested (s3a) or > renamed (file) > If we skip the checks we will save 2 HTTP requests/file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17833) Improve Magic Committer Performance
[ https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419397#comment-17419397 ] Steve Loughran commented on HADOOP-17833: - lets hold back on this. We can ship internally and test iteratively first. Its not a bug > Improve Magic Committer Performance > --- > > Key: HADOOP-17833 > URL: https://issues.apache.org/jira/browse/HADOOP-17833 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Affects Versions: 3.3.1 >Reporter: Steve Loughran >Priority: Minor > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > Magic committer tasks can be slow because every file created with > overwrite=false triggers a HEAD (verify there's no file) and a LIST (that > there's no dir). And because of delayed manifestations, it may not behave as > expected. > ParquetOutputFormat is one example of a library which does this. > we could fix parquet to use overwrite=true, but (a) there may be surprises in > other uses (b) it'd still leave the list and (c) do nothing for other formats > call > Proposed: createFile() under a magic path to skip all probes for file/dir at > end of path > Only a single task attempt Will be writing to that directory and it should > know what it is doing. If there is conflicting file names and parts across > tasks that won't even get picked up at this point. Oh and none of the > committers ever check for this: you'll get the last file manifested (s3a) or > renamed (file) > If we skip the checks we will save 2 HTTP requests/file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17833) Improve Magic Committer Performance
[ https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513423#comment-17513423 ] Steve Loughran commented on HADOOP-17833: - see if we can identify and fix HADOOP-17935 in this > Improve Magic Committer Performance > --- > > Key: HADOOP-17833 > URL: https://issues.apache.org/jira/browse/HADOOP-17833 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Affects Versions: 3.3.1 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: pull-request-available > Time Spent: 5h 20m > Remaining Estimate: 0h > > Magic committer tasks can be slow because every file created with > overwrite=false triggers a HEAD (verify there's no file) and a LIST (that > there's no dir). And because of delayed manifestations, it may not behave as > expected. > ParquetOutputFormat is one example of a library which does this. > we could fix parquet to use overwrite=true, but (a) there may be surprises in > other uses (b) it'd still leave the list and (c) do nothing for other formats > call > Proposed: createFile() under a magic path to skip all probes for file/dir at > end of path > Only a single task attempt Will be writing to that directory and it should > know what it is doing. If there is conflicting file names and parts across > tasks that won't even get picked up at this point. Oh and none of the > committers ever check for this: you'll get the last file manifested (s3a) or > renamed (file) > If we skip the checks we will save 2 HTTP requests/file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17833) Improve Magic Committer Performance
[ https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555728#comment-17555728 ] Steve Loughran commented on HADOOP-17833: - merged in to trunk, will make sure branch&3.3. is up to date, retest and merge there too > Improve Magic Committer Performance > --- > > Key: HADOOP-17833 > URL: https://issues.apache.org/jira/browse/HADOOP-17833 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Affects Versions: 3.3.1 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Labels: pull-request-available > Time Spent: 13h > Remaining Estimate: 0h > > Magic committer tasks can be slow because every file created with > overwrite=false triggers a HEAD (verify there's no file) and a LIST (that > there's no dir). And because of delayed manifestations, it may not behave as > expected. > ParquetOutputFormat is one example of a library which does this. > we could fix parquet to use overwrite=true, but (a) there may be surprises in > other uses (b) it'd still leave the list and (c) do nothing for other formats > call > Proposed: createFile() under a magic path to skip all probes for file/dir at > end of path > Only a single task attempt Will be writing to that directory and it should > know what it is doing. If there is conflicting file names and parts across > tasks that won't even get picked up at this point. Oh and none of the > committers ever check for this: you'll get the last file manifested (s3a) or > renamed (file) > If we skip the checks we will save 2 HTTP requests/file. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org