[jira] [Commented] (HDFS-3154) Add a notion of immutable/mutable files
[ https://issues.apache.org/jira/browse/HDFS-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240244#comment-13240244 ] M. C. Srivas commented on HDFS-3154: Ok, so the Andrew File System, from which Ceph draws a lot of inspiration, had extensive client-side caching, and snapshots and mirrors. Plus the files were 100% mutable with stronger consistency semantics than NFS. And unlike Ceph and more like HDFS, AFS ran on the local file system that came bundled with the operating system. So it is not hard to implement. The source code for AFS is out there in public domain. Making every file immutable introduces more complications for the users of HDFS. It may simplify some things for the engineers developing it. But the users have to deal with it on a daily basis. I'd rather that the engineers solve the hard problems and provide simplicity to their users. I suspect underlying all this is a use-case that is in the back of people's minds, for which one solution might involve making every file permanently immutable. It would be beneficial to illustrate the use-case and discuss. > Add a notion of immutable/mutable files > --- > > Key: HDFS-3154 > URL: https://issues.apache.org/jira/browse/HDFS-3154 > Project: Hadoop HDFS > Issue Type: New Feature > Components: name-node >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Tsz Wo (Nicholas), SZE > > The notion of immutable file is useful since it lets the system and tools > optimize certain things as discussed in [this email > thread|http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201203.mbox/%3CCAPn_vTuZomPmBTypP8_1xTr49Sj0fy7Mjhik4DbcAA+BLH53=g...@mail.gmail.com%3E]. > Also, many applications require only immutable files. Here is a proposal: > - Immutable files means that the file content is immutable. Operations such > as append and truncate that change the file content are not allowed to act on > immutable files. However, the meta data such as replication and permission > of an immutable file can be updated. Immutable files can also be deleted or > renamed. > - Users have to pass immutable/mutable as a flag in file creation. This is > an unmodifiable property of the created file. > - If users want to change the data in an immutable file, the file could be > copied to another file which is created as mutable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237344#comment-13237344 ] M. C. Srivas commented on HDFS-3107: @Zhanwei and @TszWo: A comment on truncate & read interaction: The behavior of the read() system call in Posix is to return fewer number of bytes than asked for if EOF is encountered early. For example, if a file is of length 100 bytes, and a thread comes along and tries to read 200 bytes starting at offset 20, then read() should return 80. Subsequent calls to read() then return 0, to indicate EOF. The same principle can be applied to a file that gets truncated after it is opened for read ... treat it like a file that got shortened, ie, do a short read the first time, and raise the EOF exception subsequently. > HDFS truncate > - > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: data-node, name-node >Reporter: Lei Chang > Attachments: HDFS_truncate_semantics_Mar15.pdf, > HDFS_truncate_semantics_Mar21.pdf > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3107) HDFS truncate
[ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236305#comment-13236305 ] M. C. Srivas commented on HDFS-3107: One minor comment: please think about maintenance problems when you expose funky semantics that have been tacked on to truncate() ... people will start using it, and it will be hard/impossible to change. It is easy to add code, but very difficult to remove it later. I see that you need something like what's being proposed to implement snapshots, but it should be an internal-only API and not exposed. > HDFS truncate > - > > Key: HDFS-3107 > URL: https://issues.apache.org/jira/browse/HDFS-3107 > Project: Hadoop HDFS > Issue Type: New Feature > Components: data-node, name-node >Reporter: Lei Chang > Attachments: HDFS_truncate_semantics_Mar15.pdf, > HDFS_truncate_semantics_Mar21.pdf > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > Systems with transaction support often need to undo changes made to the > underlying storage when a transaction is aborted. Currently HDFS does not > support truncate (a standard Posix operation) which is a reverse operation of > append, which makes upper layer applications use ugly workarounds (such as > keeping track of the discarded byte range per file in a separate metadata > store, and periodically running a vacuum process to rewrite compacted files) > to overcome this limitation of HDFS. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2699) Store data and checksums together in block file
[ https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172059#comment-13172059 ] M. C. Srivas commented on HDFS-2699: @Todd: no one is arguing that putting the CRC inline is not beneficial wrt seek time. Recalculating CRC over with a 4K block is substantially slower than with a 512-byte block (256 bytes vs 2K on the average is a 10x factor). Imagine appending continuously to the HBase WAL with the 128-byte records that you mentioned in another thread ... the CPU burn will be much worse with 4K CRC blocks. Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. Linux doing a 4K block write guarantees almost nothing wrt atomicity of that 4K write to disk. On a crash, unless you are running some sort of RAID or data-journal, there is a likelihood of the 4K block that's in-flight getting corrupted. > Store data and checksums together in block file > --- > > Key: HDFS-2699 > URL: https://issues.apache.org/jira/browse/HDFS-2699 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The current implementation of HDFS stores the data in one block file and the > metadata(checksum) in another block file. This means that every read from > HDFS actually consumes two disk iops, one to the datafile and one to the > checksum file. This is a major problem for scaling HBase, because HBase is > usually bottlenecked on the number of random disk iops that the > storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2699) Store data and checksums together in block file
[ https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171997#comment-13171997 ] M. C. Srivas commented on HDFS-2699: @dhruba: >> a block size of 4096 is too large for the CRC >the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size >is 256 MB. which one r u referring to >here? Can you pl explain the >read-modify-write cycle? HDFS does mostly large sequential writes (no >overwrites). The CRC block size. (that is, the contiguous region of the file that a CRC covers). Modifying any portion of that region will require that the entire data for the region be read in, and the CRC recomputed for that entire region and the entire region written out again. Note that it also introduces a new failure mode ... data that was previously written safely a long time ago could be now deemed "corrupt" since the CRC is no-longer good due to a minor modification during an append. The failure scenario is as follows: 1. A thread writes to a file and closes it. Lets say the file length is 9K. There are 3 CRCs embedded inline -- one for 0-4K, one for 4K-8K, and one for 8K-9K. Call the last one CRC3. 2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is a crash, and the entire 3K is not all written out cleanly (CRC3-new and some data in written out before the crash -- all 3 copies crash and recover). 3. A subsequent read on the region 8K-9K now fails with a CRC error ... even though the write was stable and used to succeed before. If this file was the HBase WAL, wouldn't this result in a data loss? > Store data and checksums together in block file > --- > > Key: HDFS-2699 > URL: https://issues.apache.org/jira/browse/HDFS-2699 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The current implementation of HDFS stores the data in one block file and the > metadata(checksum) in another block file. This means that every read from > HDFS actually consumes two disk iops, one to the datafile and one to the > checksum file. This is a major problem for scaling HBase, because HBase is > usually bottlenecked on the number of random disk iops that the > storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2699) Store data and checksums together in block file
[ https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171970#comment-13171970 ] M. C. Srivas commented on HDFS-2699: Couple of observations: a. If you want to eventually support random-IO, then a block size of 4096 is too large for the CRC, as it will cause a read-modify-write cycle on the entire 4K. 512-bytes reduces this overhead. b. Can the value of the variable "io.bytes.per.checksum" be transferred from the *-site.xml file into the file-properties at the NN at the time of file creation? If someone messes around with it, old files will still work as before > Store data and checksums together in block file > --- > > Key: HDFS-2699 > URL: https://issues.apache.org/jira/browse/HDFS-2699 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: dhruba borthakur > > The current implementation of HDFS stores the data in one block file and the > metadata(checksum) in another block file. This means that every read from > HDFS actually consumes two disk iops, one to the datafile and one to the > checksum file. This is a major problem for scaling HBase, because HBase is > usually bottlenecked on the number of random disk iops that the > storage-hardware offers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2461) Support HDFS file name globbing in libhdfs
[ https://issues.apache.org/jira/browse/HDFS-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129303#comment-13129303 ] M. C. Srivas commented on HDFS-2461: o.a.h.fs.FileSystem defines globStatus( String pattern) globStatus( String pattern, PathFilter filter) Can we call the libhdfs functions with identical names, and returning identical values? > Support HDFS file name globbing in libhdfs > -- > > Key: HDFS-2461 > URL: https://issues.apache.org/jira/browse/HDFS-2461 > Project: Hadoop HDFS > Issue Type: Improvement > Components: libhdfs >Reporter: Mariappan Asokan >Priority: Minor > > This is to enhance the C API in libhdfs to support HDFS file name globbing. > The proposal is to keep the new API simple and return a list of matched HDFS > path names. Callers can use existing hdfsGetPathInfo() to get additional > information on each of the matched path. Following code snippet shows the > proposed API enhancements: > {code:title=hdfs.h} > /** > * hdfsGlob - Get all the HDFS file names that match a glob pattern. The > * returned result will be sorted by the file names. The last element in the > * array is NULL. The function hdfsFreeGlob() should be called to free this > * array and its contents. > * @param fs The configured filesystem handle. > * @param globPattern The glob pattern to match file names against. Note that > * this is not a POSIX regular expression but rather a POSIX glob pattern. > * @return Returns a dynamically-allocated array of strings; if there is no > * match, an array with one entry that has a NULL value will be returned. If > * there is an error, NULL will be returned. > */ > char ** hdfsGlob(hdfsFS fs, const char *globPattern); > /** > * hdfsFreeGlob - Free up the array returned by hdfsGlob(). > * @param globResult The array of dynamically-allocated strings returned by > * hdfsGlob(). > */ > void hdfsFreeGlob(char **globResult); > {code} > Please comment on the above proposed API. I will start the implementation > and testing. However, I need a committer to work with. > Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes
[ https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125165#comment-13125165 ] M. C. Srivas commented on HDFS-2422: @Todd: With soft mounts, if the server goes down, I'd expect that the fsync would fail. However, you wouldn't have any guarantee about what happened to all of the previous writes since the last successful fsync through the new failed fsync. SOme of them might have succeeded and some might get lost. Conceivably, some of them might get performed again when the server recovers. So, I'd recommend that once you switch from one log to another, that you unlink the previous one when you get the chance before using it again, just to make sure you don't get any ghost writes showing up later. > The NN should tolerate the same number of low-resource volumes as failed > volumes > > > Key: HDFS-2422 > URL: https://issues.apache.org/jira/browse/HDFS-2422 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.24.0 >Reporter: Jeff Bean >Assignee: Aaron T. Myers > Fix For: 0.24.0 > > Attachments: HDFS-2422.patch > > > We encountered a situation where the namenode dropped into safe mode after a > temporary outage of an NFS mount. > At 12:10 the NFS server goes offline > Oct 8 12:10:05 kernel: nfs: server not responding, > timed out > This caused the namenode to conclude resource issues: > 2011-10-08 12:10:34,848 WARN > org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space > available on volume '' is 0, which is below the configured reserved > amount 104857600 > Temporary loss of NFS mount shouldn't cause safemode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes
[ https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124766#comment-13124766 ] M. C. Srivas commented on HDFS-2422: Konstantin and Todd, should the timeout be short, or long? >From the NFS FAQ ... http://nfs.sourceforge.net/#faq_e4 ... soft mounts can >cause silent data corruption, even in the middle of a file, when a brief >outage occurs. Thus, during recovery, even though the edits-log looks >up-to-date, it might contain bad pages in the middle. If you wish to use soft-mounts, then the recovery process should verify all the logs before picking one of them to use for replay. (I am not sure if there are CRCs on every record of the edits-log .. are there?) Otherwise, with soft-mounts, you will hit issues like HDFS-1382. > The NN should tolerate the same number of low-resource volumes as failed > volumes > > > Key: HDFS-2422 > URL: https://issues.apache.org/jira/browse/HDFS-2422 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.24.0 >Reporter: Jeff Bean >Assignee: Aaron T. Myers > Fix For: 0.24.0 > > Attachments: HDFS-2422.patch > > > We encountered a situation where the namenode dropped into safe mode after a > temporary outage of an NFS mount. > At 12:10 the NFS server goes offline > Oct 8 12:10:05 kernel: nfs: server not responding, > timed out > This caused the namenode to conclude resource issues: > 2011-10-08 12:10:34,848 WARN > org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space > available on volume '' is 0, which is below the configured reserved > amount 104857600 > Temporary loss of NFS mount shouldn't cause safemode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes
[ https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124673#comment-13124673 ] M. C. Srivas commented on HDFS-2422: This patch does not really help. If one is using a NFS server, then one must hard-mount the server in order for the data-writes to be reliable. But when hard-mounted, the NFS client (ie, the NN machine) will hang until the NFS server recovers. > The NN should tolerate the same number of low-resource volumes as failed > volumes > > > Key: HDFS-2422 > URL: https://issues.apache.org/jira/browse/HDFS-2422 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node >Affects Versions: 0.24.0 >Reporter: Jeff Bean >Assignee: Aaron T. Myers > Fix For: 0.24.0 > > Attachments: HDFS-2422.patch > > > We encountered a situation where the namenode dropped into safe mode after a > temporary outage of an NFS mount. > At 12:10 the NFS server goes offline > Oct 8 12:10:05 kernel: nfs: server not responding, > timed out > This caused the namenode to conclude resource issues: > 2011-10-08 12:10:34,848 WARN > org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space > available on volume '' is 0, which is below the configured reserved > amount 104857600 > Temporary loss of NFS mount shouldn't cause safemode. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2413) Add public APIs for safemode
[ https://issues.apache.org/jira/browse/HDFS-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123609#comment-13123609 ] M. C. Srivas commented on HDFS-2413: This should be normal behavior part of all file-system ops. It is not practical for a programmer to wrap all file access calls (eg, write, mkdir, open) with "wait for NN to leave safemode". > Add public APIs for safemode > > > Key: HDFS-2413 > URL: https://issues.apache.org/jira/browse/HDFS-2413 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs client >Affects Versions: 0.23.0 >Reporter: Todd Lipcon > Fix For: 0.23.0 > > > Currently the APIs for safe-mode are part of DistributedFileSystem, which is > supposed to be a private interface. However, dependent software often wants > to wait until the NN is out of safemode. Though it could poll trying to > create a file and catching SafeModeException, we should consider making some > of these APIs public. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira