RE: corrupt unreplicated block in dfs (0.18.3)
Mike, you might want to look at -move option in fsck. bash-3.00$ hadoop fsck Usage: DFSck [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] start checking from this path -move move corrupted files to /lost+found -delete delete corrupted files -files print out files being checked -openforwrite print out files opened for write -blocks print out block report -locations print out locations for every block -racks print out network topology for data-node locations I never use it since I would rather have users' jobs fail than jobs succeeding with incomplete inputs. Koji -Original Message- From: Aaron Kimball [mailto:aa...@cloudera.com] Sent: Thursday, March 26, 2009 9:41 AM To: core-user@hadoop.apache.org Subject: Re: corrupt unreplicated block in dfs (0.18.3) Just because a block is corrupt doesn't mean the entire file is corrupt. Furthermore, the presence/absence of a file in the namespace is a completely separate issue to the data in the file. I think it would be a surprising interface change if files suddenly disappeared just because 1 out of potentially many blocks were corrupt. - Aaron On Thu, Mar 26, 2009 at 1:21 PM, Mike Andrews wrote: > i noticed that when a file with no replication (i.e., replication=1) > develops a corrupt block, hadoop takes no action aside from the > datanode throwing an exception to the client trying to read the file. > i manually corrupted a block in order to observe this. > > obviously, with replication=1 its impossible to fix the block, but i > thought perhaps hadoop would take some other action, such as deleting > the file outright, or moving it to a "corrupt" directory, or marking > it or keeping track of it somehow to note that there's un-fixable > corruption in the filesystem? thus, the current behaviour seems to > sweep the corruption under the rug and allows its continued existence, > aside from notifying the specific client doing the read with an > exception. > > if anyone has any information about this issue or how to work around > it, please let me know. > > on the other hand, i tested that corrupting a block in a replication=3 > file causes hadoop to re-replicate the block from another existing > copy, which is good and is i what i expected. > > best, > mike > > > -- > permanent contact information at http://mikerandrews.com >
Re: corrupt unreplicated block in dfs (0.18.3)
Just because a block is corrupt doesn't mean the entire file is corrupt. Furthermore, the presence/absence of a file in the namespace is a completely separate issue to the data in the file. I think it would be a surprising interface change if files suddenly disappeared just because 1 out of potentially many blocks were corrupt. - Aaron On Thu, Mar 26, 2009 at 1:21 PM, Mike Andrews wrote: > i noticed that when a file with no replication (i.e., replication=1) > develops a corrupt block, hadoop takes no action aside from the > datanode throwing an exception to the client trying to read the file. > i manually corrupted a block in order to observe this. > > obviously, with replication=1 its impossible to fix the block, but i > thought perhaps hadoop would take some other action, such as deleting > the file outright, or moving it to a "corrupt" directory, or marking > it or keeping track of it somehow to note that there's un-fixable > corruption in the filesystem? thus, the current behaviour seems to > sweep the corruption under the rug and allows its continued existence, > aside from notifying the specific client doing the read with an > exception. > > if anyone has any information about this issue or how to work around > it, please let me know. > > on the other hand, i tested that corrupting a block in a replication=3 > file causes hadoop to re-replicate the block from another existing > copy, which is good and is i what i expected. > > best, > mike > > > -- > permanent contact information at http://mikerandrews.com >
corrupt unreplicated block in dfs (0.18.3)
i noticed that when a file with no replication (i.e., replication=1) develops a corrupt block, hadoop takes no action aside from the datanode throwing an exception to the client trying to read the file. i manually corrupted a block in order to observe this. obviously, with replication=1 its impossible to fix the block, but i thought perhaps hadoop would take some other action, such as deleting the file outright, or moving it to a "corrupt" directory, or marking it or keeping track of it somehow to note that there's un-fixable corruption in the filesystem? thus, the current behaviour seems to sweep the corruption under the rug and allows its continued existence, aside from notifying the specific client doing the read with an exception. if anyone has any information about this issue or how to work around it, please let me know. on the other hand, i tested that corrupting a block in a replication=3 file causes hadoop to re-replicate the block from another existing copy, which is good and is i what i expected. best, mike -- permanent contact information at http://mikerandrews.com