[jira] [Comment Edited] (HADOOP-9565) Add a Blobstore interface to add to blobstore FileSystems

2017-05-09 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002879#comment-16002879
 ] 

Steve Loughran edited comment on HADOOP-9565 at 5/9/17 3:35 PM:


Note how this is a query-only API, allowing for dynamic probes of FS features, 
rather than static "does this FS implement a specific interface", and because 
its done in the base FSClass, no need to add a new interface to cast an FS to 
to look for it. We just need to make the defaults valid for the filesystems, 
which we can do by having core-default/xml define the default FS behaviour and 
with every FS we manage defining their own.

What about external filesystems? Well, the default resolver looks for a 
resource called {{contract/fs-$SCHEMA-features.xml}}. If the people 
implementing filesystems copy their contract test XML file into that location 
in the production JAR, it will be picked up automatically. This allows the 
authors to update their JARs and have the capabilities be visible on Hadoop 
implementations with the new API, but still load/run against the old one.

see also: HDFS-11644 and {{StreamCapabilities}}


was (Author: ste...@apache.org):
Note how this is a query-only API, allowing for dynamic probes of FS features, 
rather than static "does this FS implement a specific interface", and because 
its done in the base FSClass, no need to add a new interface to cast an FS to 
to look for it. We just need to make the defaults valid for the filesystems, 
which we can do by having core-default/xml define the default FS behaviour and 
with every FS we manage defining their own.

What about external filesystems? Well, the default resolver looks for a 
resource called {{contract/fs-$SCHEMA-features.xml}}. If the people 
implementing filesystems copy their contract test XML file into that location 
in the production JAR, it will be picked up automatically. This allows the 
authors to update their JARs and have the capabilities be visible on Hadoop 
implementations with the new API, but still load/run against the old one.

see also: {{StreamCapabilities}}

> Add a Blobstore interface to add to blobstore FileSystems
> -
>
> Key: HADOOP-9565
> URL: https://issues.apache.org/jira/browse/HADOOP-9565
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/s3, fs/swift
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-9565-001.patch, HADOOP-9565-002.patch, 
> HADOOP-9565-003.patch, HADOOP-9565-004.patch, HADOOP-9565-005.patch, 
> HADOOP-9565-006.patch, HADOOP-9565-008.patch, HADOOP-9565-010.patch, 
> HADOOP-9565-branch-2-007.patch
>
>
> We can make the fact that some {{FileSystem}} implementations are really 
> blobstores, with different atomicity and consistency guarantees, by adding a 
> {{Blobstore}} interface to add to them. 
> This could also be a place to add a {{Copy(Path,Path)}} method, assuming that 
> all blobstores implement at server-side copy operation as a substitute for 
> rename.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-9565) Add a Blobstore interface to add to blobstore FileSystems

2016-08-24 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435538#comment-15435538
 ] 

Chen He edited comment on HADOOP-9565 at 8/24/16 7:46 PM:
--

Hi [~steve_l], thank you for spending time on my question. The new version of 
FileOutputCommitter has algorithm 2 which does not have serial rename of all 
tasks in commitJob. Just find the parameter. It should resolve our problem. 


was (Author: airbots):
Hi [~steve_l], thank you for spending time on my question. The new version of 
FileOutputCommitter has algorithm 2 which does not have serial rename of all 
task in commitJob. Just find the parameter. It should resolve our problem. 

> Add a Blobstore interface to add to blobstore FileSystems
> -
>
> Key: HADOOP-9565
> URL: https://issues.apache.org/jira/browse/HADOOP-9565
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/s3, fs/swift
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Pieter Reuse
> Attachments: HADOOP-9565-001.patch, HADOOP-9565-002.patch, 
> HADOOP-9565-003.patch, HADOOP-9565-004.patch, HADOOP-9565-005.patch, 
> HADOOP-9565-006.patch, HADOOP-9565-branch-2-007.patch
>
>
> We can make the fact that some {{FileSystem}} implementations are really 
> blobstores, with different atomicity and consistency guarantees, by adding a 
> {{Blobstore}} interface to add to them. 
> This could also be a place to add a {{Copy(Path,Path)}} method, assuming that 
> all blobstores implement at server-side copy operation as a substitute for 
> rename.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-9565) Add a Blobstore interface to add to blobstore FileSystems

2016-08-19 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428975#comment-15428975
 ] 

Chen He edited comment on HADOOP-9565 at 8/19/16 10:52 PM:
---

>From our experiences, the main renaming overhead comes from 
>"FileOutputCommitter.commitTask()". Because it moves the files from temp dir 
>to dest dir. Some frameworks may not care whether the final task files are 
>under "dst/_temporary/0/_temporary/" or "dst/". Why don't we add a parameter 
>such as "mapreduce.skip.task.commit" parameter (default is false), so that 
>once a task is done, the output just stay in "dst/_temporary/0/_temporary/". 
>Then, the next job or application just need to take the "dst/" as input dir, 
>they do not care about whether is is deep or not. It avoids the atomicwrite 
>issue, provide compatibility, and avoid rename overhead. If there is no 
>objection, I am happy to create a JIRA to tracking that.


was (Author: airbots):
>From our experiences, the main renaming overhead comes from 
>"FileOutputCommitter.commitTask()". Because it moves the files from temp dir 
>to dest dir. Some frameworks may not care whether the final task files are 
>under "dst/_temporary/0/_temporary/" or "dst/". Why don't we add a parameter 
>such as "mapreduce.skip.task.commit" parameter (default is false), so that 
>once a task is done, the output just stay in "dst/_temporary/0/_temporary/". 
>Then, the next job or application just need to take the "dst/" as input dir, 
>they do not care about whether is is deep or not. It avoids the atomicwrite 
>issue, provide compatibility, and avoid rename overhead. If there is no 
>objection, I will create a JIRA to tracking that.

> Add a Blobstore interface to add to blobstore FileSystems
> -
>
> Key: HADOOP-9565
> URL: https://issues.apache.org/jira/browse/HADOOP-9565
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/s3, fs/swift
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Pieter Reuse
> Attachments: HADOOP-9565-001.patch, HADOOP-9565-002.patch, 
> HADOOP-9565-003.patch, HADOOP-9565-004.patch, HADOOP-9565-005.patch, 
> HADOOP-9565-006.patch, HADOOP-9565-branch-2-007.patch
>
>
> We can make the fact that some {{FileSystem}} implementations are really 
> blobstores, with different atomicity and consistency guarantees, by adding a 
> {{Blobstore}} interface to add to them. 
> This could also be a place to add a {{Copy(Path,Path)}} method, assuming that 
> all blobstores implement at server-side copy operation as a substitute for 
> rename.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-9565) Add a Blobstore interface to add to blobstore FileSystems

2016-08-05 Thread Thomas Demoor (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409534#comment-15409534
 ] 

Thomas Demoor edited comment on HADOOP-9565 at 8/5/16 2:41 PM:
---

Steve the "avoid data write" thing you mention is exactly why these direct 
outputcommitters (and what I did for the FileOutputCommitter) work on object 
stores. Multiple writers can write to the same object concurrently. At any 
point, the last-started successfully-completed write is what is visible.

Regular put: 
* Content length (=N) communicated at start of request. 
* Once N bytes hit S3 the object becomes visible
* If hadoop task aborts before writing N bytes the upload will timeout and the 
object version is garbage collected by S3. 

MulitpartUpload:
* Requires explicit API call to complete (or abort)
* Only when complete API call is used the object becomes visible
* If hadoop task fails the upload will remain to be active (s3a has the purge 
functionality to automatically clean these up after a certain period) but the 
object is NOT visible

The interesting thing to think about are network partitions.





was (Author: thomas demoor):
Steve the "avoid data write" thing you mention is exactly why these direct 
outputcommitters (and what I did for the FileOutputCommitter) work on object 
stores. Multiple writers can write to the same object concurrently. At any 
point, the last-started successfully-completed write is what is visible.

Regular put: 
* Content length (=N) communicated at start of request. 
* Once N bytes hit S3 the object becomes visible
* If hadoop task aborts before writing N bytes the upload will timeout and the 
object version is garbage collected by S3. 
MulitpartUpload:
* Requires explicit API call to complete (or abort)
* Only when complete API call is used the object becomes visible
* If hadoop task fails the upload will remain to be active (s3a has the purge 
functionality to automatically clean these up after a certain period) but the 
object is NOT visible

The interesting thing to think about are network partitions.




> Add a Blobstore interface to add to blobstore FileSystems
> -
>
> Key: HADOOP-9565
> URL: https://issues.apache.org/jira/browse/HADOOP-9565
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs, fs/s3, fs/swift
>Affects Versions: 2.6.0
>Reporter: Steve Loughran
>Assignee: Pieter Reuse
> Attachments: HADOOP-9565-001.patch, HADOOP-9565-002.patch, 
> HADOOP-9565-003.patch, HADOOP-9565-004.patch, HADOOP-9565-005.patch, 
> HADOOP-9565-006.patch, HADOOP-9565-branch-2-007.patch
>
>
> We can make the fact that some {{FileSystem}} implementations are really 
> blobstores, with different atomicity and consistency guarantees, by adding a 
> {{Blobstore}} interface to add to them. 
> This could also be a place to add a {{Copy(Path,Path)}} method, assuming that 
> all blobstores implement at server-side copy operation as a substitute for 
> rename.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org