[jira] [Commented] (HDFS-3154) Add a notion of immutable/mutable files

2012-03-28 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240244#comment-13240244
 ] 

M. C. Srivas commented on HDFS-3154:


Ok, so the Andrew File System, from which Ceph draws a lot of inspiration, had 
extensive client-side caching, and snapshots and mirrors. Plus the files were 
100% mutable with stronger consistency semantics than NFS.  And unlike Ceph and 
more like HDFS, AFS ran on the local file system that came bundled with the 
operating system.

So it is not hard to implement. The source code for AFS is out there in public 
domain. 

Making every file immutable introduces more complications for the users of 
HDFS. It may simplify some things for the engineers developing it. But the 
users have to deal with it on a daily basis. I'd rather that the engineers 
solve the hard problems and provide simplicity to their users.

I suspect underlying all this is a use-case that is in the back of people's 
minds, for which one solution might involve making every file permanently 
immutable. It would be beneficial to illustrate the use-case and discuss.

> Add a notion of immutable/mutable files
> ---
>
> Key: HDFS-3154
> URL: https://issues.apache.org/jira/browse/HDFS-3154
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: name-node
>Reporter: Tsz Wo (Nicholas), SZE
>Assignee: Tsz Wo (Nicholas), SZE
>
> The notion of immutable file is useful since it lets the system and tools 
> optimize certain things as discussed in [this email 
> thread|http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201203.mbox/%3CCAPn_vTuZomPmBTypP8_1xTr49Sj0fy7Mjhik4DbcAA+BLH53=g...@mail.gmail.com%3E].
>   Also, many applications require only immutable files.  Here is a proposal:
> - Immutable files means that the file content is immutable.  Operations such 
> as append and truncate that change the file content are not allowed to act on 
> immutable files.  However, the meta data such as replication and permission 
> of an immutable file can be updated.  Immutable files can also be deleted or 
> renamed.
> - Users have to pass immutable/mutable as a flag in file creation.  This is 
> an unmodifiable property of the created file.
> - If users want to change the data in an immutable file, the file could be 
> copied to another file which is created as mutable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3107) HDFS truncate

2012-03-23 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237344#comment-13237344
 ] 

M. C. Srivas commented on HDFS-3107:


@Zhanwei and @TszWo:   A comment on truncate & read interaction:  The behavior 
of the read() system call in Posix is to return fewer number of bytes than 
asked for if EOF is encountered early.  For example, if a file is of length 100 
bytes, and a thread comes along and tries to read 200 bytes starting at offset 
20, then read() should return 80.  Subsequent calls to read() then return 0, to 
indicate EOF.   The same principle can be applied to a file that gets truncated 
after it is opened for read ... treat it like a file that got shortened, ie, do 
a short read the first time, and raise the EOF exception subsequently.


> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: data-node, name-node
>Reporter: Lei Chang
> Attachments: HDFS_truncate_semantics_Mar15.pdf, 
> HDFS_truncate_semantics_Mar21.pdf
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3107) HDFS truncate

2012-03-22 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236305#comment-13236305
 ] 

M. C. Srivas commented on HDFS-3107:


One minor comment:   please think about maintenance problems when you expose 
funky semantics that have been tacked on to truncate() ... people will start 
using it, and it will be hard/impossible to change. It is easy to add code, but 
very difficult to remove it later.

I see that you need something like what's being proposed to implement 
snapshots, but it should be an internal-only API and not exposed.


> HDFS truncate
> -
>
> Key: HDFS-3107
> URL: https://issues.apache.org/jira/browse/HDFS-3107
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: data-node, name-node
>Reporter: Lei Chang
> Attachments: HDFS_truncate_semantics_Mar15.pdf, 
> HDFS_truncate_semantics_Mar21.pdf
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> Systems with transaction support often need to undo changes made to the 
> underlying storage when a transaction is aborted. Currently HDFS does not 
> support truncate (a standard Posix operation) which is a reverse operation of 
> append, which makes upper layer applications use ugly workarounds (such as 
> keeping track of the discarded byte range per file in a separate metadata 
> store, and periodically running a vacuum process to rewrite compacted files) 
> to overcome this limitation of HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2699) Store data and checksums together in block file

2011-12-18 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172059#comment-13172059
 ] 

M. C. Srivas commented on HDFS-2699:


@Todd: no one is arguing that putting the CRC inline is not beneficial wrt seek 
time. Recalculating CRC over with a 4K block is substantially slower than with 
a 512-byte block (256 bytes vs 2K on the average is a 10x factor). Imagine 
appending continuously to the HBase WAL with the 128-byte records that you 
mentioned in another thread ... the CPU burn will be much worse with 4K CRC 
blocks.

Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. 
Linux doing a 4K block write guarantees almost nothing wrt atomicity of that 4K 
write to disk. On a crash, unless you are running some sort of RAID or 
data-journal, there is a likelihood of the 4K block that's in-flight getting 
corrupted.


> Store data and checksums together in block file
> ---
>
> Key: HDFS-2699
> URL: https://issues.apache.org/jira/browse/HDFS-2699
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
>
> The current implementation of HDFS stores the data in one block file and the 
> metadata(checksum) in another block file. This means that every read from 
> HDFS actually consumes two disk iops, one to the datafile and one to the 
> checksum file. This is a major problem for scaling HBase, because HBase is 
> usually  bottlenecked on the number of random disk iops that the 
> storage-hardware offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2699) Store data and checksums together in block file

2011-12-18 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171997#comment-13171997
 ] 

M. C. Srivas commented on HDFS-2699:


@dhruba:

>> a block size of 4096 is too large for the CRC

>the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size 
>is 256 MB. which one r u referring to >here? Can you pl explain the 
>read-modify-write cycle? HDFS does mostly large sequential writes (no 
>overwrites).

The CRC block size. (that is, the contiguous region of the file that a CRC 
covers).  Modifying any portion of that region will require that the entire 
data for the region be read in, and the CRC recomputed for that entire region 
and the entire region written out again.


Note that it also introduces a new failure mode ... data that was previously 
written safely a long time ago could be now deemed "corrupt" since the CRC is 
no-longer good due to a minor modification during an append. The failure 
scenario is as follows:

1. A thread writes to a file and closes it. Lets say the file length is 9K.  
There are 3 CRCs embedded inline -- one for 0-4K, one for 4K-8K, and one for 
8K-9K. Call the last one CRC3.

2. An append happens a few days later to extend the file from 9K to 11K. CRC3 
is now recomputed for the 3K-sized region spanning offsets 8K-11K and written 
out as CRC3-new. But there is a crash, and the entire 3K is not all written out 
cleanly (CRC3-new and some data in written out before the crash -- all 3 copies 
crash and recover).

3. A subsequent read on the region 8K-9K now fails with a CRC error ... even 
though the write was stable and used to succeed before.

If this file was the HBase WAL, wouldn't this result in a data loss?



> Store data and checksums together in block file
> ---
>
> Key: HDFS-2699
> URL: https://issues.apache.org/jira/browse/HDFS-2699
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
>
> The current implementation of HDFS stores the data in one block file and the 
> metadata(checksum) in another block file. This means that every read from 
> HDFS actually consumes two disk iops, one to the datafile and one to the 
> checksum file. This is a major problem for scaling HBase, because HBase is 
> usually  bottlenecked on the number of random disk iops that the 
> storage-hardware offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2699) Store data and checksums together in block file

2011-12-18 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171970#comment-13171970
 ] 

M. C. Srivas commented on HDFS-2699:


Couple of observations:

a. If you want to eventually support random-IO, then a block size of 4096 is 
too large for the CRC, as it will cause a read-modify-write cycle on the entire 
4K.  512-bytes reduces this overhead.

b. Can the value of the variable "io.bytes.per.checksum" be transferred from 
the *-site.xml file into the file-properties at the NN at the time of file 
creation?  If someone messes around with it, old files will still work as before

> Store data and checksums together in block file
> ---
>
> Key: HDFS-2699
> URL: https://issues.apache.org/jira/browse/HDFS-2699
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
>
> The current implementation of HDFS stores the data in one block file and the 
> metadata(checksum) in another block file. This means that every read from 
> HDFS actually consumes two disk iops, one to the datafile and one to the 
> checksum file. This is a major problem for scaling HBase, because HBase is 
> usually  bottlenecked on the number of random disk iops that the 
> storage-hardware offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2461) Support HDFS file name globbing in libhdfs

2011-10-17 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13129303#comment-13129303
 ] 

M. C. Srivas commented on HDFS-2461:


o.a.h.fs.FileSystem defines

   globStatus( String pattern)
   globStatus( String pattern, PathFilter filter)

Can we call the libhdfs functions with identical names, and returning identical 
values?


> Support HDFS file name globbing in libhdfs
> --
>
> Key: HDFS-2461
> URL: https://issues.apache.org/jira/browse/HDFS-2461
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs
>Reporter: Mariappan Asokan
>Priority: Minor
>
> This is to enhance the C API in libhdfs to support HDFS file name globbing.  
> The proposal is to keep the new API simple and return a list of matched HDFS 
> path names.  Callers can use existing hdfsGetPathInfo() to get additional 
> information on each of the matched path.  Following code snippet shows the 
> proposed API enhancements:
> {code:title=hdfs.h}
> /**
>  * hdfsGlob - Get all the HDFS file names that match a glob pattern.  The
>  * returned result will be sorted by the file names.  The last element in the
>  * array is NULL.  The function hdfsFreeGlob() should be called to free this
>  * array and its contents.
>  * @param fs The configured filesystem handle.
>  * @param globPattern The glob pattern to match file names against.  Note that
>  * this is not a POSIX regular expression but rather a POSIX glob pattern.
>  * @return Returns a dynamically-allocated array of strings; if there is no
>  * match, an array with one entry that has a NULL value will be returned.  If
>  * there is an error, NULL will be returned.
>  */
> char ** hdfsGlob(hdfsFS fs, const char *globPattern);
> /**
>  * hdfsFreeGlob - Free up the array returned by hdfsGlob().
>  * @param globResult The array of dynamically-allocated strings returned by
>  * hdfsGlob().
>  */
> void hdfsFreeGlob(char **globResult);
> {code}
> Please comment on the above proposed API.  I will start the implementation 
> and testing.  However, I need a committer to work with.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes

2011-10-11 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125165#comment-13125165
 ] 

M. C. Srivas commented on HDFS-2422:


@Todd:  

With soft mounts, if the server goes down, I'd expect that the fsync would 
fail.  However, you wouldn't have any guarantee about what happened to all of 
the previous writes since the last successful fsync through the new failed 
fsync.  SOme of them might have succeeded and some might get lost.  
Conceivably, some of them might get performed again when the server recovers.  
So, I'd recommend that once you switch from one log to another, that you unlink 
the previous one when you get the chance before using it again, just to make 
sure you don't get any ghost writes showing up later.

> The NN should tolerate the same number of low-resource volumes as failed 
> volumes
> 
>
> Key: HDFS-2422
> URL: https://issues.apache.org/jira/browse/HDFS-2422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.24.0
>Reporter: Jeff Bean
>Assignee: Aaron T. Myers
> Fix For: 0.24.0
>
> Attachments: HDFS-2422.patch
>
>
> We encountered a situation where the namenode dropped into safe mode after a 
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05  kernel: nfs: server  not responding, 
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN 
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space 
> available on volume '' is 0, which is below the configured reserved 
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes

2011-10-11 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124766#comment-13124766
 ] 

M. C. Srivas commented on HDFS-2422:


Konstantin and Todd, should the timeout be short, or long?

>From the NFS FAQ ... http://nfs.sourceforge.net/#faq_e4 ... soft mounts can 
>cause silent data corruption, even in the middle of a file, when a brief 
>outage occurs. Thus, during recovery, even though the edits-log looks 
>up-to-date, it might contain bad pages in the middle.

If you wish to use soft-mounts, then the recovery process should verify all the 
logs before picking one of them to use for replay. (I am not sure if there are 
CRCs on every record of the edits-log .. are there?)

Otherwise, with soft-mounts, you will hit issues like HDFS-1382.



> The NN should tolerate the same number of low-resource volumes as failed 
> volumes
> 
>
> Key: HDFS-2422
> URL: https://issues.apache.org/jira/browse/HDFS-2422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.24.0
>Reporter: Jeff Bean
>Assignee: Aaron T. Myers
> Fix For: 0.24.0
>
> Attachments: HDFS-2422.patch
>
>
> We encountered a situation where the namenode dropped into safe mode after a 
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05  kernel: nfs: server  not responding, 
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN 
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space 
> available on volume '' is 0, which is below the configured reserved 
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2422) The NN should tolerate the same number of low-resource volumes as failed volumes

2011-10-10 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124673#comment-13124673
 ] 

M. C. Srivas commented on HDFS-2422:


This patch does not really help. If one is using a NFS server, then one must 
hard-mount the server in order for the data-writes to be reliable.  But when 
hard-mounted, the NFS client (ie, the NN machine) will hang until the NFS 
server recovers.


> The NN should tolerate the same number of low-resource volumes as failed 
> volumes
> 
>
> Key: HDFS-2422
> URL: https://issues.apache.org/jira/browse/HDFS-2422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.24.0
>Reporter: Jeff Bean
>Assignee: Aaron T. Myers
> Fix For: 0.24.0
>
> Attachments: HDFS-2422.patch
>
>
> We encountered a situation where the namenode dropped into safe mode after a 
> temporary outage of an NFS mount.
> At 12:10 the NFS server goes offline
> Oct  8 12:10:05  kernel: nfs: server  not responding, 
> timed out
> This caused the namenode to conclude resource issues:
> 2011-10-08 12:10:34,848 WARN 
> org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space 
> available on volume '' is 0, which is below the configured reserved 
> amount 104857600
> Temporary loss of NFS mount shouldn't cause safemode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-2413) Add public APIs for safemode

2011-10-08 Thread M. C. Srivas (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123609#comment-13123609
 ] 

M. C. Srivas commented on HDFS-2413:


This should be normal behavior part of all file-system ops.  It is not 
practical for a programmer to wrap all file access calls (eg, write, mkdir, 
open) with "wait for NN to leave safemode".

> Add public APIs for safemode
> 
>
> Key: HDFS-2413
> URL: https://issues.apache.org/jira/browse/HDFS-2413
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs client
>Affects Versions: 0.23.0
>Reporter: Todd Lipcon
> Fix For: 0.23.0
>
>
> Currently the APIs for safe-mode are part of DistributedFileSystem, which is 
> supposed to be a private interface. However, dependent software often wants 
> to wait until the NN is out of safemode. Though it could poll trying to 
> create a file and catching SafeModeException, we should consider making some 
> of these APIs public.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira