[jira] Commented: (HDFS-1542) Deadlock in Configuration.writeXml when serialized form is larger than one DFS block
[ https://issues.apache.org/jira/browse/HDFS-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972344#action_12972344 ] Todd Lipcon commented on HDFS-1542: --- Yes, you can probably set the dfs block size to very large for your job, but it will have the side effect of having very large blocks on the output as well. The other workaround is to not use Configuration to store very large objects. Instead please consider using the DistributedCache API - it should be more efficient anyway. > Deadlock in Configuration.writeXml when serialized form is larger than one > DFS block > > > Key: HDFS-1542 > URL: https://issues.apache.org/jira/browse/HDFS-1542 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client >Affects Versions: 0.20.2, 0.22.0, 0.23.0 >Reporter: Todd Lipcon >Priority: Critical > Attachments: Test.java > > > Configuration.writeXml holds a lock on itself and then writes the XML to an > output stream, during which DFSOutputStream will try to get a lock on > ackQueue/dataQueue. Meanwihle the DataStreamer thread will call functions > like conf.getInt() and deadlock against the other thread, since it could be > the same conf object. > This causes a deterministic deadlock whenever the serialized form is larger > than block size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1542) Deadlock in Configuration.writeXml when serialized form is larger than one DFS block
[ https://issues.apache.org/jira/browse/HDFS-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972322#action_12972322 ] Amit Nithian commented on HDFS-1542: is there a temporary workaround.. changing the blocksize or something.. this is a blocker for some stuff we are doing in production. > Deadlock in Configuration.writeXml when serialized form is larger than one > DFS block > > > Key: HDFS-1542 > URL: https://issues.apache.org/jira/browse/HDFS-1542 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client >Affects Versions: 0.20.2, 0.22.0, 0.23.0 >Reporter: Todd Lipcon >Priority: Critical > Attachments: Test.java > > > Configuration.writeXml holds a lock on itself and then writes the XML to an > output stream, during which DFSOutputStream will try to get a lock on > ackQueue/dataQueue. Meanwihle the DataStreamer thread will call functions > like conf.getInt() and deadlock against the other thread, since it could be > the same conf object. > This causes a deterministic deadlock whenever the serialized form is larger > than block size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1501) The logic that makes namenode exit safemode should be pluggable
[ https://issues.apache.org/jira/browse/HDFS-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Kling updated HDFS-1501: Attachment: HDFS-1501.patch This patch introduces two configuration parameters, dfs.namenode.safemode.policy and dfs.namenode.safemode.policy.manual, which specify the safe mode policy to use after name node start-up and when manually entering safe mode, respectively. This will make it easier to use custom safe mode policies (e.g., a policy that takes into account when files are RAIDed). The default implementation for dfs.namenode.safemode.policy, StartupSafeModePolicy, leaves safe mode once a certain fraction of blocks have reached a safe replication level and once a specified number of data nodes have checked in (after waiting for an additional extension period). It also initializes the replication queues once a certain block threshold has been reached (cf. HDFS-1476). This is the same behaviour currently implemented by FSNamesystem.SafeModeInfo. The default class for dfs.namenode.safemode.policy.manual, ManualSafeModePolicy, never leaves safe mode and never initializes the replication queues. Currently, this is achieved by setting the thresholds in FSNamesystem.SafeModeInfo to values that are so high that they can never be reached. With this patch, FSNamesystem.SafeModeMonitor periodically polls the safe mode policy whenever the name node is in safe mode. This is different from the current behaviour, which performs this check after every block report and only uses polling during the safe mode extension phase. This patch is still a work in progress and I would appreciate any feedback on this idea. > The logic that makes namenode exit safemode should be pluggable > --- > > Key: HDFS-1501 > URL: https://issues.apache.org/jira/browse/HDFS-1501 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: dhruba borthakur >Assignee: Patrick Kling > Attachments: HDFS-1501.patch > > > HDFS RAID creates parity blocks for data blocks. So, even if all replicas of > a block is missing, it is possible ro recreate it from the parity blocks. > Thus, when the namenode restarts, it should use a different RAID-aware logic > to figure out whether all blocks are healthy or not. > My proposal is to make the code that NN uses to exit safemode be pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1538) Refactor more startup and image loading code out of FSImage
[ https://issues.apache.org/jira/browse/HDFS-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1538: -- Attachment: hdfs-1538-2.txt Patch rebased on top of hdfs-1521.5.txt > Refactor more startup and image loading code out of FSImage > --- > > Key: HDFS-1538 > URL: https://issues.apache.org/jira/browse/HDFS-1538 > Project: Hadoop HDFS > Issue Type: Sub-task >Affects Versions: 0.22.0, 0.23.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Attachments: hdfs-1538-1.txt, hdfs-1538-2.txt > > > For HDFS-1073, we need to be able to continue to load images in the old > "fsimage/edits/edits.new" layout for the purposes of upgrade. But that code > will be only for backwards compatibility, and we want to be able to switch to > new code for the new layout. This subtask is to separate out much of that > code into an interface which we can implement for both the old and new > layouts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1521) Persist transaction ID on disk between NN restarts
[ https://issues.apache.org/jira/browse/HDFS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1521: -- Attachment: hdfs-1521.5.txt Thanks for the thorough review. Here's a new patch. bq. FSImage.loadFSEdits(StorageDirectory sd) should return boolean instead of FSEditLogLoader Fixed bq. You can avoid introducing FSEditLogLoader.needResave by setting expectedStartingTxId before checking that logVersion != FSConstants.LAYOUT_VERSION. Then the old logic of counting this event as an extra transaction will work I found the former logic here to be very confusing and somewhat of a hack. It's also important that the loader returns the correct number of edits rather than potentially returning 1 when there are 0 edits. If it did that, it would break many cases by potentially causing a skip in transaction IDs. Though the new code adds a new member, the new member has a clear purpose and I think it's easier to understand from the caller's perspective, especially now that your point #1 above is addressed. bq. It would be good if you could replace FSEditLogLoader.expectedStartingTxId member by the respective parameter to loadFSEdits bq. I think after that you can also get rid of FSEditLogLoader.numEditsLoaded. Fixed bq. Why don't we write first opCode, then txID, then Writable. There will be less code changes on the loading part Very good call! This indeed cleaned up the loading code a lot. bq. Should we introduce TransactionHeader at this point and write it as Writable. Just something to consider I think given that the header is still pretty simple it's not worth it at this point. bq. Need to change JavaDoc for EditLogOutputStream.write(). Missing parameter Fixed bq. I don't see any reason to have txID in the beginning of every edits file. You will have it the name, right bq. beginTransaction() instead of startTransaction, as it matches with endTransaction() Fixed. bq. Don't change rollEditLog() to return long. It is only used in the test It's necessary that the transaction ID be returned inside the same synchronization block. If we used a separate call to getLastWrittenTxId() then another txid could have been written in between (note that the test is multithreaded). bq. It looks to me that FSImage.checkpointTxId is simply currentTxId. If it is, it would be more intuitive It's not really current - it's the txid of the image file, not including any edits that have been written to the edit log - sort of like how checkpointTime is set only when an image is saved. Naming it "currentTxId" would imply that it is updated on every edit. bq. BackupStorage.lastAppliedTxId isn't it just checkpointTxId, which is already defined in the base FSImage. Contrary to above, lastAppliedTxId refers to the transaction ID that has been applied to the namespace. This is always >= checkpointTxId - checkpointTxId only changes when the BN saves an image, but lastAppliedTxId changes every time some edits are applied via RPC. I'll run the new patch through the unit test suite one more time. > Persist transaction ID on disk between NN restarts > -- > > Key: HDFS-1521 > URL: https://issues.apache.org/jira/browse/HDFS-1521 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.0 > > Attachments: hdfs-1521.3.txt, hdfs-1521.4.txt, hdfs-1521.5.txt, > hdfs-1521.txt, hdfs-1521.txt > > > For HDFS-1073 and other future work, we'd like to have the concept of a > transaction ID that is persisted on disk with the image/edits. We already > have this concept in the NameNode but it resets to 0 on restart. We can also > use this txid to replace the _checkpointTime_ field, I believe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1064) NN Availability - umbrella Jira
[ https://issues.apache.org/jira/browse/HDFS-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972195#action_12972195 ] Sanjay Radia commented on HDFS-1064: Would users find the the use of HA NFS in a failover solution to be a showstopper? I agree that it is somewhat embarrassing to say that HDFS failover depends on HA NFS. The reason I ask is that, HA NFS as a shared storage is one of the fastest way for us to develop a HA solution. Q. Do users already have an NFS server that they can use for this purpose? For example at Yahoo we use NFS as one of several "disks" for edits and image. I don't see this as a final solution but merely a first step. A shared dual ported disk solution will require more work esp for storage fencing. Using BackupNN I suspect is also a little bit more complicated than using shared storage. (Btw as noted above , the AvatarNN uses NFS as part of its controlled manual failover during an upgrade. ) > NN Availability - umbrella Jira > --- > > Key: HDFS-1064 > URL: https://issues.apache.org/jira/browse/HDFS-1064 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Sanjay Radia > > This is an umbrella jira for discussing availability of the HDFS NN and > providing references to other Jiras that improve its availability. This > includes, but is not limited to, automatic failover. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HDFS-1206) TestFiHFlush fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik resolved HDFS-1206. -- Resolution: Fixed Fix Version/s: 0.23.0 0.22.0 0.21.1 I have just commtted this to 0.21 branch and up > TestFiHFlush fails intermittently > - > > Key: HDFS-1206 > URL: https://issues.apache.org/jira/browse/HDFS-1206 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.21.0, 0.21.1, 0.22.0 >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Konstantin Boudnik > Fix For: 0.21.1, 0.22.0, 0.23.0 > > Attachments: HDFS-1206.patch, HDFS-1206.patch > > > When I was testing HDFS-1114, the patch passed all tests except TestFiHFlush. > Then, I tried to print out some debug messages, however, TestFiHFlush > succeeded after added the messages. > TestFiHFlush probably depends on the speed of BlocksMap. If BlocksMap is > slow enough, then it will pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-1542) Deadlock in Configuration.writeXml when serialized form is larger than one DFS block
Deadlock in Configuration.writeXml when serialized form is larger than one DFS block Key: HDFS-1542 URL: https://issues.apache.org/jira/browse/HDFS-1542 Project: Hadoop HDFS Issue Type: Bug Components: hdfs client Affects Versions: 0.20.2, 0.22.0, 0.23.0 Reporter: Todd Lipcon Priority: Critical Attachments: Test.java Configuration.writeXml holds a lock on itself and then writes the XML to an output stream, during which DFSOutputStream will try to get a lock on ackQueue/dataQueue. Meanwihle the DataStreamer thread will call functions like conf.getInt() and deadlock against the other thread, since it could be the same conf object. This causes a deterministic deadlock whenever the serialized form is larger than block size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1542) Deadlock in Configuration.writeXml when serialized form is larger than one DFS block
[ https://issues.apache.org/jira/browse/HDFS-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HDFS-1542: -- Attachment: Test.java Here's a test program which illustrates the deadlock > Deadlock in Configuration.writeXml when serialized form is larger than one > DFS block > > > Key: HDFS-1542 > URL: https://issues.apache.org/jira/browse/HDFS-1542 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs client >Affects Versions: 0.20.2, 0.22.0, 0.23.0 >Reporter: Todd Lipcon >Priority: Critical > Attachments: Test.java > > > Configuration.writeXml holds a lock on itself and then writes the XML to an > output stream, during which DFSOutputStream will try to get a lock on > ackQueue/dataQueue. Meanwihle the DataStreamer thread will call functions > like conf.getInt() and deadlock against the other thread, since it could be > the same conf object. > This causes a deterministic deadlock whenever the serialized form is larger > than block size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1509) Resync discarded directories in fs.name.dir during saveNamespace command
[ https://issues.apache.org/jira/browse/HDFS-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972154#action_12972154 ] Hairong Kuang commented on HDFS-1509: - +1. The patch looks good to me. > Resync discarded directories in fs.name.dir during saveNamespace command > > > Key: HDFS-1509 > URL: https://issues.apache.org/jira/browse/HDFS-1509 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: resyncBadNameDir1.txt, resyncBadNameDir2.txt, > resyncBadNameDir3.txt > > > In the current implementation, if the Namenode encounters an error while > writing to a fs.name.dir directory it stops writing new edits to that > directory. My proposal is to make the namenode write the fsimage to all > configured directories in fs.name.dir, and from then on, continue writing > fsedits to all configured directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1206) TestFiHFlush fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE updated HDFS-1206: - Hadoop Flags: [Reviewed] +1 patch looks good. > TestFiHFlush fails intermittently > - > > Key: HDFS-1206 > URL: https://issues.apache.org/jira/browse/HDFS-1206 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.21.0, 0.21.1, 0.22.0 >Reporter: Tsz Wo (Nicholas), SZE >Assignee: Konstantin Boudnik > Attachments: HDFS-1206.patch, HDFS-1206.patch > > > When I was testing HDFS-1114, the patch passed all tests except TestFiHFlush. > Then, I tried to print out some debug messages, however, TestFiHFlush > succeeded after added the messages. > TestFiHFlush probably depends on the speed of BlocksMap. If BlocksMap is > slow enough, then it will pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Review request: HDFS-1206
Can someone take a look at https://issues.apache.org/jira/browse/HDFS-1206 This addresses an intermittent failure on trunk. A very short patch. -- Take care, Cos 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 "To take a significant step forward, you must make a series of finite improvements." Donald J. Atwood, General Motors signature.asc Description: Digital signature
[jira] Commented: (HDFS-1509) Resync discarded directories in fs.name.dir during saveNamespace command
[ https://issues.apache.org/jira/browse/HDFS-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972033#action_12972033 ] Konstantin Shvachko commented on HDFS-1509: --- I did not understand the description for this jira. Eli is mentioning you use edits instead of image or vice versa. Could you please edit the description to clarify what is proposed. > Resync discarded directories in fs.name.dir during saveNamespace command > > > Key: HDFS-1509 > URL: https://issues.apache.org/jira/browse/HDFS-1509 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: resyncBadNameDir1.txt, resyncBadNameDir2.txt, > resyncBadNameDir3.txt > > > In the current implementation, if the Namenode encounters an error while > writing to a fs.name.dir directory it stops writing new edits to that > directory. My proposal is to make the namenode write the fsimage to all > configured directories in fs.name.dir, and from then on, continue writing > fsedits to all configured directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1521) Persist transaction ID on disk between NN restarts
[ https://issues.apache.org/jira/browse/HDFS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-1521: -- Component/s: name-node > Persist transaction ID on disk between NN restarts > -- > > Key: HDFS-1521 > URL: https://issues.apache.org/jira/browse/HDFS-1521 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.0 > > Attachments: hdfs-1521.3.txt, hdfs-1521.4.txt, hdfs-1521.txt, > hdfs-1521.txt > > > For HDFS-1073 and other future work, we'd like to have the concept of a > transaction ID that is persisted on disk with the image/edits. We already > have this concept in the NameNode but it resets to 0 on restart. We can also > use this txid to replace the _checkpointTime_ field, I believe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1521) Persist transaction ID on disk between NN restarts
[ https://issues.apache.org/jira/browse/HDFS-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972030#action_12972030 ] Konstantin Shvachko commented on HDFS-1521: --- # {{FSImage.loadFSEdits(StorageDirectory sd)}} should return boolean instead of FSEditLogLoader. The boolean means whether the edits should be saved or not. It should be calculated inside loadFSEdits. # You can avoid introducing {{FSEditLogLoader.needResave}} by setting expectedStartingTxId before checking that {{logVersion != FSConstants.LAYOUT_VERSION}}. Then the old logic of counting this event as an extra transaction will work. I see lots of new members that are used only in few places and can be avoided, which complicates the code. # It would be good if you could replace {{FSEditLogLoader.expectedStartingTxId}} member by the respective parameter to {{loadFSEdits()}} instead of passing it to the constructor {{FSEditLogLoader}}. # I think after that you can also get rid of {{FSEditLogLoader.numEditsLoaded}}. # Why don't we write first opCode, then txID, then Writable. There will be less code changes on the loading part. You will still use OP_INVALID to determine the end of file. Eliminates a lot of changes, and extra constant EOF_TXID for -1. # Should we introduce TransactionHeader at this point and write it as Writable. Just something to consider, I did not evaluate the complexity of introducing it. # Need to change JavaDoc for {{EditLogOutputStream.write()}}. Missing parameter. Could you also add explanation of the transaction header, which in your patch is just a comment in the loading code. # I don't see any reason to have txID in the beginning of every edits file. You will have it the name, right? So it is redundant. I'd rather remove it. Duplication of information brings the burden of keeping the copies in sync. Also you will not need to carry the parameters inside create() through all the calls. # {{beginTransaction()}} instead of startTransaction, as it matches with {{endTransaction()}}. I mean start-stop, begin-end. # Don't change {{rollEditLog()}} to return long. It is only used in the test. You can getLastWrittenTxId() from editsLog there instead. # It looks to me that {{FSImage.checkpointTxId}} is simply {{currentTxId}}. If it is, it would be more intuitive. I also don't understand the JavaDoc comment for this member. # {{BackupStorage.lastAppliedTxId}} isn't it just {{checkpointTxId}}, which is already defined in the base FSImage. I hope this will simplify the patch. > Persist transaction ID on disk between NN restarts > -- > > Key: HDFS-1521 > URL: https://issues.apache.org/jira/browse/HDFS-1521 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: name-node >Affects Versions: 0.22.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > Fix For: 0.22.0 > > Attachments: hdfs-1521.3.txt, hdfs-1521.4.txt, hdfs-1521.txt, > hdfs-1521.txt > > > For HDFS-1073 and other future work, we'd like to have the concept of a > transaction ID that is persisted on disk with the image/edits. We already > have this concept in the NameNode but it resets to 0 on restart. We can also > use this txid to replace the _checkpointTime_ field, I believe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1509) Resync discarded directories in fs.name.dir during saveNamespace command
[ https://issues.apache.org/jira/browse/HDFS-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhruba borthakur updated HDFS-1509: --- Attachment: resyncBadNameDir3.txt I incorporated both of Hairong's comments. Thanks Hairong and Eli for reviewing this patch. > Resync discarded directories in fs.name.dir during saveNamespace command > > > Key: HDFS-1509 > URL: https://issues.apache.org/jira/browse/HDFS-1509 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: resyncBadNameDir1.txt, resyncBadNameDir2.txt, > resyncBadNameDir3.txt > > > In the current implementation, if the Namenode encounters an error while > writing to a fs.name.dir directory it stops writing new edits to that > directory. My proposal is to make the namenode write the fsimage to all > configured directories in fs.name.dir, and from then on, continue writing > fsedits to all configured directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-1509) Resync discarded directories in fs.name.dir during saveNamespace command
[ https://issues.apache.org/jira/browse/HDFS-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971992#action_12971992 ] Hairong Kuang commented on HDFS-1509: - I agree with Eli that this looks good except for a couple of improvements: 1. Could restore the accidentally commented-out tests? 2. attemptRestoreRemovedStorage formats the newly added previously removed storage directories, which calls saveCurrent that save fsimage to disk. Later on saveNamespace save the namespace again. Could you make name space saving to be done only once? > Resync discarded directories in fs.name.dir during saveNamespace command > > > Key: HDFS-1509 > URL: https://issues.apache.org/jira/browse/HDFS-1509 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node >Reporter: dhruba borthakur >Assignee: dhruba borthakur > Attachments: resyncBadNameDir1.txt, resyncBadNameDir2.txt > > > In the current implementation, if the Namenode encounters an error while > writing to a fs.name.dir directory it stops writing new edits to that > directory. My proposal is to make the namenode write the fsimage to all > configured directories in fs.name.dir, and from then on, continue writing > fsedits to all configured directories. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.