Re: What happens when .....?
Normally MR job is used for batch processing. So I don't think this is a good use case here for MR. Since you need to run the program periodically, you cannot submit a single mapreduce job for this. An possible way is to create a cron job to scan the folder size and submit a MR job if necessary; On Wed, Aug 27, 2014 at 7:38 PM, Kandoi, Nikhil nikhil.kan...@emc.com wrote: Hi All, I have a system where files are coming in hdfs at regular intervals and I perform an operation everytime the directory size goes above a particular point. My Question is that when I submit a map reduce job, would it only work on the files present at that point ?? Regards, Nikhil Kandoi -- Regards, *Stanley Shi,*
Re: What happens when .....?
Or, maybe have a look at Apache Falcon: Falcon - Apache Falcon - Data management and processing platform Falcon - Apache Falcon - Data management and processing platform Apache Falcon - Data management and processing platform View on falcon.incubator.apache.org Preview by Yahoo From: Stanley Shi s...@pivotal.io To: user@hadoop.apache.org user@hadoop.apache.org Sent: Thursday, August 28, 2014 1:15 AM Subject: Re: What happens when .? Normally MR job is used for batch processing. So I don't think this is a good use case here for MR. Since you need to run the program periodically, you cannot submit a single mapreduce job for this. An possible way is to create a cron job to scan the folder size and submit a MR job if necessary; On Wed, Aug 27, 2014 at 7:38 PM, Kandoi, Nikhil nikhil.kan...@emc.com wrote: Hi All, I have a system where files are coming in hdfs at regular intervals and I perform an operation everytime the directory size goes above a particular point. My Question is that when I submit a map reduce job, would it only work on the files present at that point ?? Regards, Nikhil Kandoi -- Regards, Stanley Shi,
Re: What happens when .....?
unsubscribe On Thu, Aug 28, 2014 at 6:42 PM, Eric Payne eric.payne1...@yahoo.com wrote: Or, maybe have a look at Apache Falcon: Falcon - Apache Falcon - Data management and processing platform http://falcon.incubator.apache.org/ Falcon - Apache Falcon - Data management and processing platform http://falcon.incubator.apache.org/ Apache Falcon - Data management and processing platform View on falcon.incubator.apache.org http://falcon.incubator.apache.org/ Preview by Yahoo *From:* Stanley Shi s...@pivotal.io *To:* user@hadoop.apache.org user@hadoop.apache.org *Sent:* Thursday, August 28, 2014 1:15 AM *Subject:* Re: What happens when .? Normally MR job is used for batch processing. So I don't think this is a good use case here for MR. Since you need to run the program periodically, you cannot submit a single mapreduce job for this. An possible way is to create a cron job to scan the folder size and submit a MR job if necessary; On Wed, Aug 27, 2014 at 7:38 PM, Kandoi, Nikhil nikhil.kan...@emc.com wrote: Hi All, I have a system where files are coming in hdfs at regular intervals and I perform an operation everytime the directory size goes above a particular point. My Question is that when I submit a map reduce job, would it only work on the files present at that point ?? Regards, Nikhil Kandoi -- Regards, *Stanley Shi,*
What happens when .....?
Hi All, I have a system where files are coming in hdfs at regular intervals and I perform an operation everytime the directory size goes above a particular point. My Question is that when I submit a map reduce job, would it only work on the files present at that point ?? Regards, Nikhil Kandoi
Re: What happens when you have fewer input files than mapper slots?
Apologies -- I don't understand this advice : If the evenness is the goal you can also write your own input format that return empty locations for each split and read the small files in map task directly. How would manually reading the files into the map task help me? Hadoop would still spawn multiple mappers per machine, which is what I'm trying to avoid. I'm trying to get one mapper per machine for this job. --Jeremy On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu l...@apache.org wrote: Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? You're right in expecting that the tasks of the small job will likely be evenly distributed among 20 nodes, if the 20 files are evenly distributed among the nodes and that there are free slots on every node. Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Are you seeing Job B tasks are not being evenly distributed to each node? You can check the locations of the files by hadoop fsck. If the evenness is the goal you can also write your own input format that return empty locations for each split and read the small files in map task directly. If you're using Hadoop 1.0.x and fair scheduler, you might need to set mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart required) to work around a bug in fairscheduler (MAPREDUCE-2905) that causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+. __Luke
Re: What happens when you have fewer input files than mapper slots?
Is there a way to force an even spread of data? On Fri, Mar 22, 2013 at 2:14 PM, jeremy p athomewithagroove...@gmail.comwrote: Apologies -- I don't understand this advice : If the evenness is the goal you can also write your own input format that return empty locations for each split and read the small files in map task directly. How would manually reading the files into the map task help me? Hadoop would still spawn multiple mappers per machine, which is what I'm trying to avoid. I'm trying to get one mapper per machine for this job. --Jeremy On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu l...@apache.org wrote: Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? You're right in expecting that the tasks of the small job will likely be evenly distributed among 20 nodes, if the 20 files are evenly distributed among the nodes and that there are free slots on every node. Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Are you seeing Job B tasks are not being evenly distributed to each node? You can check the locations of the files by hadoop fsck. If the evenness is the goal you can also write your own input format that return empty locations for each split and read the small files in map task directly. If you're using Hadoop 1.0.x and fair scheduler, you might need to set mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart required) to work around a bug in fairscheduler (MAPREDUCE-2905) that causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+. __Luke
Re: What happens when you have fewer input files than mapper slots?
Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? You're right in expecting that the tasks of the small job will likely be evenly distributed among 20 nodes, if the 20 files are evenly distributed among the nodes and that there are free slots on every node. Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Are you seeing Job B tasks are not being evenly distributed to each node? You can check the locations of the files by hadoop fsck. If the evenness is the goal you can also write your own input format that return empty locations for each split and read the small files in map task directly. If you're using Hadoop 1.0.x and fair scheduler, you might need to set mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart required) to work around a bug in fairscheduler (MAPREDUCE-2905) that causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+. __Luke
Re: What happens when you have fewer input files than mapper slots?
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? For MRv2 (yarn): you can pretty much achieve this using: yarn.nodemanager.resource.memory-mb (system wide setting) and mapreduce.map.memory.mb (job level setting) e.g. if yarn.nodemanager.resource.memory-mb=100 and mapreduce.map.memory.mb= 40 a maximum of two mapper can run on a node at any time. For MRv1, The equivalent way will be to control mapper slots on each machine: mapred.tasktracker.map.tasks.maximum, of course this does not give you 'per job' control. on mappers. In addition in both cases, you can use a scheduler with 'pools / queues' capability in addition to restrict the overall use of grid resource. Do read fair scheduler and capacity scheduler documentation... -Rahul On Tue, Mar 19, 2013 at 1:55 PM, jeremy p athomewithagroove...@gmail.comwrote: Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Thanks! --Jeremy
Re: What happens when you have fewer input files than mapper slots?
The job we need to run executes some third-party code that utilizes multiple cores. The only way the job will get done in a timely fashion is if we give it all the cores available on the machine. This is not a task that can be split up. Yes, I know, it's not ideal, but this is the situation I have to deal with. On Tue, Mar 19, 2013 at 3:15 PM, hari harib...@gmail.com wrote: This may not be what you were looking for, but I was just curious when you mentioned that you would only want to run only one map task because it was cpu intensive. Well, the map tasks are supposed to be cpu intensive, isn't it. If the maximum map slots are 10 then that would mean you have close to 10 cores available in each node. So, if you run only one map task, no matter how much cpu intensive it is, it will only be able to max out one core, so the rest of the 9 cores would go under utilized. So, you can still run 9 more map tasks on that machine. Or, maybe your node's core count is way less than 10, in which case you might be better off setting the mapper slots to a lower value anyway. On Tue, Mar 19, 2013 at 5:18 PM, jeremy p athomewithagroove...@gmail.comwrote: Thank you for your help. We're using MRv1. I've tried setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one helped me at all. Per-job control is definitely what I need. I need to be able to say, For Job A, only use one mapper per node, but for Job B, use 16 mappers per node. I have not found any way to do this. I will definitely look into schedulers. Are there any examples you can point me to where someone does what I'm needing to do? --Jeremy On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain rja...@gmail.com wrote: Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? For MRv2 (yarn): you can pretty much achieve this using: yarn.nodemanager.resource.memory-mb (system wide setting) and mapreduce.map.memory.mb (job level setting) e.g. if yarn.nodemanager.resource.memory-mb=100 and mapreduce.map.memory.mb= 40 a maximum of two mapper can run on a node at any time. For MRv1, The equivalent way will be to control mapper slots on each machine: mapred.tasktracker.map.tasks.maximum, of course this does not give you 'per job' control. on mappers. In addition in both cases, you can use a scheduler with 'pools / queues' capability in addition to restrict the overall use of grid resource. Do read fair scheduler and capacity scheduler documentation... -Rahul On Tue, Mar 19, 2013 at 1:55 PM, jeremy p athomewithagroove...@gmail.com wrote: Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Thanks! --Jeremy
Re: What happens when you have fewer input files than mapper slots?
You can leverage YARN's CPU Core scheduling feature for this purpose. It was added to the 2.0.3 release via https://issues.apache.org/jira/browse/YARN-2 and seems to fit your need exactly. However, looking at that patch, it seems like param-config support for MR apps wasn't added by this so it may require some work before you can easily leverage it in MRv2. On MRv1, you can achieve the per-node memory supply vs. requirement hack Rahul suggested by using the CapacityScheduler instead. It does not have CPU Core based scheduling directly though. On Wed, Mar 20, 2013 at 4:08 AM, jeremy p athomewithagroove...@gmail.com wrote: The job we need to run executes some third-party code that utilizes multiple cores. The only way the job will get done in a timely fashion is if we give it all the cores available on the machine. This is not a task that can be split up. Yes, I know, it's not ideal, but this is the situation I have to deal with. On Tue, Mar 19, 2013 at 3:15 PM, hari harib...@gmail.com wrote: This may not be what you were looking for, but I was just curious when you mentioned that you would only want to run only one map task because it was cpu intensive. Well, the map tasks are supposed to be cpu intensive, isn't it. If the maximum map slots are 10 then that would mean you have close to 10 cores available in each node. So, if you run only one map task, no matter how much cpu intensive it is, it will only be able to max out one core, so the rest of the 9 cores would go under utilized. So, you can still run 9 more map tasks on that machine. Or, maybe your node's core count is way less than 10, in which case you might be better off setting the mapper slots to a lower value anyway. On Tue, Mar 19, 2013 at 5:18 PM, jeremy p athomewithagroove...@gmail.com wrote: Thank you for your help. We're using MRv1. I've tried setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one helped me at all. Per-job control is definitely what I need. I need to be able to say, For Job A, only use one mapper per node, but for Job B, use 16 mappers per node. I have not found any way to do this. I will definitely look into schedulers. Are there any examples you can point me to where someone does what I'm needing to do? --Jeremy On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain rja...@gmail.com wrote: Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? For MRv2 (yarn): you can pretty much achieve this using: yarn.nodemanager.resource.memory-mb (system wide setting) and mapreduce.map.memory.mb (job level setting) e.g. if yarn.nodemanager.resource.memory-mb=100 and mapreduce.map.memory.mb= 40 a maximum of two mapper can run on a node at any time. For MRv1, The equivalent way will be to control mapper slots on each machine: mapred.tasktracker.map.tasks.maximum, of course this does not give you 'per job' control. on mappers. In addition in both cases, you can use a scheduler with 'pools / queues' capability in addition to restrict the overall use of grid resource. Do read fair scheduler and capacity scheduler documentation... -Rahul On Tue, Mar 19, 2013 at 1:55 PM, jeremy p athomewithagroove...@gmail.com wrote: Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Thanks! --Jeremy -- Harsh J
Re: What happens when you have fewer input files than mapper slots?
Correction to my previous post: I completely missed https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the MR config ends already in 2.0.3. My bad :) On Wed, Mar 20, 2013 at 5:34 AM, Harsh J ha...@cloudera.com wrote: You can leverage YARN's CPU Core scheduling feature for this purpose. It was added to the 2.0.3 release via https://issues.apache.org/jira/browse/YARN-2 and seems to fit your need exactly. However, looking at that patch, it seems like param-config support for MR apps wasn't added by this so it may require some work before you can easily leverage it in MRv2. On MRv1, you can achieve the per-node memory supply vs. requirement hack Rahul suggested by using the CapacityScheduler instead. It does not have CPU Core based scheduling directly though. On Wed, Mar 20, 2013 at 4:08 AM, jeremy p athomewithagroove...@gmail.com wrote: The job we need to run executes some third-party code that utilizes multiple cores. The only way the job will get done in a timely fashion is if we give it all the cores available on the machine. This is not a task that can be split up. Yes, I know, it's not ideal, but this is the situation I have to deal with. On Tue, Mar 19, 2013 at 3:15 PM, hari harib...@gmail.com wrote: This may not be what you were looking for, but I was just curious when you mentioned that you would only want to run only one map task because it was cpu intensive. Well, the map tasks are supposed to be cpu intensive, isn't it. If the maximum map slots are 10 then that would mean you have close to 10 cores available in each node. So, if you run only one map task, no matter how much cpu intensive it is, it will only be able to max out one core, so the rest of the 9 cores would go under utilized. So, you can still run 9 more map tasks on that machine. Or, maybe your node's core count is way less than 10, in which case you might be better off setting the mapper slots to a lower value anyway. On Tue, Mar 19, 2013 at 5:18 PM, jeremy p athomewithagroove...@gmail.com wrote: Thank you for your help. We're using MRv1. I've tried setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one helped me at all. Per-job control is definitely what I need. I need to be able to say, For Job A, only use one mapper per node, but for Job B, use 16 mappers per node. I have not found any way to do this. I will definitely look into schedulers. Are there any examples you can point me to where someone does what I'm needing to do? --Jeremy On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain rja...@gmail.com wrote: Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? For MRv2 (yarn): you can pretty much achieve this using: yarn.nodemanager.resource.memory-mb (system wide setting) and mapreduce.map.memory.mb (job level setting) e.g. if yarn.nodemanager.resource.memory-mb=100 and mapreduce.map.memory.mb= 40 a maximum of two mapper can run on a node at any time. For MRv1, The equivalent way will be to control mapper slots on each machine: mapred.tasktracker.map.tasks.maximum, of course this does not give you 'per job' control. on mappers. In addition in both cases, you can use a scheduler with 'pools / queues' capability in addition to restrict the overall use of grid resource. Do read fair scheduler and capacity scheduler documentation... -Rahul On Tue, Mar 19, 2013 at 1:55 PM, jeremy p athomewithagroove...@gmail.com wrote: Short version : let's say you have 20 nodes, and each node has 10 mapper slots. You start a job with 20 very small input files. How is the work distributed to the cluster? Will it be even, with each node spawning one mapper task? Is there any way of predicting or controlling how the work will be distributed? Long version : My cluster is currently used for two different jobs. The cluster is currently optimized for Job A, so each node has a maximum of 18 mapper slots. However, I also need to run Job B. Job B is VERY cpu-intensive, so we really only want one mapper to run on a node at any given time. I've done a bunch of research, and it doesn't seem like Hadoop gives you any way to set the maximum number of mappers per node on a per-job basis. I'm at my wit's end here, and considering some rather egregious workarounds. If you can think of anything that can help me, I'd very much appreciate it. Thanks! --Jeremy -- Harsh J -- Harsh J
What happens when a block is being invalidated/deleted on the DataNode when it is being read?
Hi, all I am wondering if this situation could happen: a block is being invalidated/deleted on a DataNode(by the NameNode, for example, to delete a over-replicated block) while it is also concurrently being read (by some clients). If this could happen, how does hdfs handle this issue? Thank you very much. Xiao Yu
what happens when a datanode rejoins?
Hi, What happens when an existing (not new) datanode rejoins a cluster for following scenarios: 1. Some of the blocks it was managing are deleted/modified? 2. The size of the blocks are now modified say from 64MB to 128MB? 3. What if the block replication factor was one (yea not in most deployments but say incase) so does the namenode recreate a file once the datanode rejoins? Thanks, Mehul
Re: what happens when a datanode rejoins?
Hi Mehul Some of the blocks it was managing are deleted/modified? The namenode will asynchronously replicate the blocks to other datanodes in order to maintain the replication factor after a datanode has not been in contact for 10 minutes. The size of the blocks are now modified say from 64MB to 128MB? Block size is a per-file setting so new files will be 128MB, but the old ones will remain at 64MB. What if the block replication factor was one (yea not in most deployments but say incase) so does the namenode recreate a file once the datanode rejoins? (assuming you didn't perform a decommission) Blocks that lived only on that datanode will be declared missing and the files associated with those blocks will be not be able to be fully read, until the datanode rejoins. George
Re: what happens when a datanode rejoins?
Mehul, Let me make an addition. Some of the blocks it was managing are deleted/modified? Blocks that are deleted in the interim will deleted on the rejoining node as well, after it rejoins . Regarding the modified, I'd advise against modifying blocks after they have been fully written. George
Re: what happens when a datanode rejoins?
George has answered most of these. I'll just add on: On Tue, Sep 11, 2012 at 12:44 PM, Mehul Choube mehul_cho...@symantec.com wrote: 1. Some of the blocks it was managing are deleted/modified? A DN runs a block report upon start, and sends the list of blocks to the NN. NN validates them and if it finds any files to miss block replicas post-report, it will schedule a re-replication from one of the good DNs that still carry it. The modified (out-of-HDFS) blocks fail their stored checksums so are treated as corrupt and deleted, and are re-replicated in the same manner. 2. The size of the blocks are now modified say from 64MB to 128MB? George's got this already. Changing of block size does not impact any existing blocks. It is a per-file metadata prop. 3. What if the block replication factor was one (yea not in most deployments but say incase) so does the namenode recreate a file once the datanode rejoins? Files exist at the NN metadata (its fsimage/edits persist this). Blocks pertaining to a file exists at a DN. If the file had a single replica and that replica was lost, then the file's data is lost and the NameNode will tell you as much in its metrics/fsck. -- Harsh J
what happens when a datanode rejoins?
Hi, What happens when an existing (not new) datanode rejoins a cluster for following scenarios: a) Some of the blocks it was managing are deleted/modified? b) The size of the blocks are now modified say from 64MB to 128MB? c) What if the block replication factor was one (yea not in most deployments but say in case) so does the namenode recreate a file once the datanode rejoins? Thanks, Mehul
RE: what happens when a datanode rejoins?
The namenode will asynchronously replicate the blocks to other datanodes in order to maintain the replication factor after a datanode has not been in contact for 10 minutes. What happens when the datanode rejoins after namenode has already re-replicated the blocs it was managing? Will namenode ask the datanode to discard the blocks and start managing new blocks? Or will namenode discard the new blocks which were replicated due to unavailability of this datanode? Thanks, Mehul From: George Datskos [mailto:george.dats...@jp.fujitsu.com] Sent: Tuesday, September 11, 2012 12:56 PM To: user@hadoop.apache.org Subject: Re: what happens when a datanode rejoins? Hi Mehul Some of the blocks it was managing are deleted/modified? The namenode will asynchronously replicate the blocks to other datanodes in order to maintain the replication factor after a datanode has not been in contact for 10 minutes. The size of the blocks are now modified say from 64MB to 128MB? Block size is a per-file setting so new files will be 128MB, but the old ones will remain at 64MB. What if the block replication factor was one (yea not in most deployments but say incase) so does the namenode recreate a file once the datanode rejoins? (assuming you didn't perform a decommission) Blocks that lived only on that datanode will be declared missing and the files associated with those blocks will be not be able to be fully read, until the datanode rejoins. George
Re: what happens when a datanode rejoins?
Hi, Inline. On Tue, Sep 11, 2012 at 2:36 PM, Mehul Choube mehul_cho...@symantec.com wrote: The namenode will asynchronously replicate the blocks to other datanodes in order to maintain the replication factor after a datanode has not been in contact for 10 minutes. What happens when the datanode rejoins after namenode has already re-replicated the blocs it was managing? The block count total goes +1, and the file's block is treated as an over-replicated one. Will namenode ask the datanode to discard the blocks and start managing new blocks? Yes, this may happen. Or will namenode discard the new blocks which were replicated due to unavailability of this datanode? It deletes extra blocks while still keeping the block placement policy in mind. It may delete any block replica as long as the placement policy is not violated by doing so. -- Harsh J
RE: what happens when a datanode rejoins?
DataNode rejoins take care of only NameNode. Sorry didn't get this From: Narasingu Ramesh [mailto:ramesh.narasi...@gmail.com] Sent: Tuesday, September 11, 2012 2:38 PM To: user@hadoop.apache.org Subject: Re: what happens when a datanode rejoins? Hi Mehul, DataNode rejoins take care of only NameNode. Thanks Regards, Ramesh.Narasingu On Tue, Sep 11, 2012 at 2:36 PM, Mehul Choube mehul_cho...@symantec.commailto:mehul_cho...@symantec.com wrote: The namenode will asynchronously replicate the blocks to other datanodes in order to maintain the replication factor after a datanode has not been in contact for 10 minutes. What happens when the datanode rejoins after namenode has already re-replicated the blocs it was managing? Will namenode ask the datanode to discard the blocks and start managing new blocks? Or will namenode discard the new blocks which were replicated due to unavailability of this datanode? Thanks, Mehul From: George Datskos [mailto:george.dats...@jp.fujitsu.commailto:george.dats...@jp.fujitsu.com] Sent: Tuesday, September 11, 2012 12:56 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: what happens when a datanode rejoins? Hi Mehul Some of the blocks it was managing are deleted/modified? The namenode will asynchronously replicate the blocks to other datanodes in order to maintain the replication factor after a datanode has not been in contact for 10 minutes. The size of the blocks are now modified say from 64MB to 128MB? Block size is a per-file setting so new files will be 128MB, but the old ones will remain at 64MB. What if the block replication factor was one (yea not in most deployments but say incase) so does the namenode recreate a file once the datanode rejoins? (assuming you didn't perform a decommission) Blocks that lived only on that datanode will be declared missing and the files associated with those blocks will be not be able to be fully read, until the datanode rejoins. George
Re: What happens when I do not output anything from my mapper
Hi Devaraj , Indeed, the previous email that I've sent you contained -ls output of SequenceFileOutputFormat with signatures of the class in it. Hence it was 87 bytes. Hadoop was creating empty files(in fact, files containing only the signature) before I started to use LazyOutputFormat. Regards Murat On Tue, Jun 5, 2012 at 7:22 AM, Devaraj k devara...@huawei.com wrote: The output files should 0 kb size if you use FileOutputFormat/TextOutputFormat. I think your output format writer is writing some meta data in those files. Can you check what is the data present in those files. Can you tell me which output format are you using? Thanks Devaraj From: murat migdisoglu [murat.migdiso...@gmail.com] Sent: Monday, June 04, 2012 6:18 PM To: common-user@hadoop.apache.org Subject: Re: What happens when I do not output anything from my mapper Hi, Thanks for your answer. After I've read your emails, I decided to clear completely my mapper method to see If I can disable the output of the mapper class at all, but it seems it did not work So, here is my mapper method: @Override public void map(ByteBuffer key, SortedMapByteBuffer, IColumn columns, Context context) throws IOException, InterruptedException { } when I execute hadoop fs -ls, I still see many small output files as following: -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:44 /user/mmigdiso/output/part-m-00034 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00037 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00039 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00040 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00042 Do you know If I have to put something special to the context to specify the empty output? Regards Murat On Mon, Jun 4, 2012 at 2:38 PM, Devaraj k devara...@huawei.com wrote: Hi Murat, As Praveenesh explained, you can control the map outputs as you want. map() function will be called for each input i.e map() function invokes multiple times with different inputs in the same mapper. You can check by having the logs in the map function what is happening in it. Thanks Devaraj From: praveenesh kumar [praveen...@gmail.com] Sent: Monday, June 04, 2012 5:57 PM To: common-user@hadoop.apache.org Subject: Re: What happens when I do not output anything from my mapper You can control your map outputs based on any condition you want. I have done that - it worked for me. It could be your code problem that its not working for you. Can you please share your map code or cross-check whether your conditions are correct ? Regards, Praveenesh On Mon, Jun 4, 2012 at 5:52 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, I have a small application where I have only mapper class defined(no reducer, no combiner). Within the mapper class, I have an if condition according to which I decide If I want to put something in the context or not. If my condition is not match, I want that mapper does not give any output to the hdfs. But apparently, this does not worj as I expected. Once I run my job, a file per mapper in the hdfs with 87 kb of size. the if block that I'm using in the map method is as following: if (ip == null || ip.equals(cip)) { Text value = new Text(mwrapper.toJson()); word.set(ip); context.write( word, value); } else { log.info(ip not match [ + ip + ]); } } }//end of mapper method How can I manage that? Does mapper always need to have an output? -- Find a job you enjoy, and you'll never work a day in your life. Confucius -- Find a job you enjoy, and you'll never work a day in your life. Confucius -- Find a job you enjoy, and you'll never work a day in your life. Confucius
Re: What happens when I do not output anything from my mapper
You can control your map outputs based on any condition you want. I have done that - it worked for me. It could be your code problem that its not working for you. Can you please share your map code or cross-check whether your conditions are correct ? Regards, Praveenesh On Mon, Jun 4, 2012 at 5:52 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, I have a small application where I have only mapper class defined(no reducer, no combiner). Within the mapper class, I have an if condition according to which I decide If I want to put something in the context or not. If my condition is not match, I want that mapper does not give any output to the hdfs. But apparently, this does not worj as I expected. Once I run my job, a file per mapper in the hdfs with 87 kb of size. the if block that I'm using in the map method is as following: if (ip == null || ip.equals(cip)) { Text value = new Text(mwrapper.toJson()); word.set(ip); context.write( word, value); } else { log.info(ip not match [ + ip + ]); } } }//end of mapper method How can I manage that? Does mapper always need to have an output? -- Find a job you enjoy, and you'll never work a day in your life. Confucius
RE: What happens when I do not output anything from my mapper
Hi Murat, As Praveenesh explained, you can control the map outputs as you want. map() function will be called for each input i.e map() function invokes multiple times with different inputs in the same mapper. You can check by having the logs in the map function what is happening in it. Thanks Devaraj From: praveenesh kumar [praveen...@gmail.com] Sent: Monday, June 04, 2012 5:57 PM To: common-user@hadoop.apache.org Subject: Re: What happens when I do not output anything from my mapper You can control your map outputs based on any condition you want. I have done that - it worked for me. It could be your code problem that its not working for you. Can you please share your map code or cross-check whether your conditions are correct ? Regards, Praveenesh On Mon, Jun 4, 2012 at 5:52 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, I have a small application where I have only mapper class defined(no reducer, no combiner). Within the mapper class, I have an if condition according to which I decide If I want to put something in the context or not. If my condition is not match, I want that mapper does not give any output to the hdfs. But apparently, this does not worj as I expected. Once I run my job, a file per mapper in the hdfs with 87 kb of size. the if block that I'm using in the map method is as following: if (ip == null || ip.equals(cip)) { Text value = new Text(mwrapper.toJson()); word.set(ip); context.write( word, value); } else { log.info(ip not match [ + ip + ]); } } }//end of mapper method How can I manage that? Does mapper always need to have an output? -- Find a job you enjoy, and you'll never work a day in your life. Confucius
Re: What happens when I do not output anything from my mapper
Hi, Thanks for your answer. After I've read your emails, I decided to clear completely my mapper method to see If I can disable the output of the mapper class at all, but it seems it did not work So, here is my mapper method: @Override public void map(ByteBuffer key, SortedMapByteBuffer, IColumn columns, Context context) throws IOException, InterruptedException { } when I execute hadoop fs -ls, I still see many small output files as following: -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:44 /user/mmigdiso/output/part-m-00034 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00037 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00039 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00040 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00042 Do you know If I have to put something special to the context to specify the empty output? Regards Murat On Mon, Jun 4, 2012 at 2:38 PM, Devaraj k devara...@huawei.com wrote: Hi Murat, As Praveenesh explained, you can control the map outputs as you want. map() function will be called for each input i.e map() function invokes multiple times with different inputs in the same mapper. You can check by having the logs in the map function what is happening in it. Thanks Devaraj From: praveenesh kumar [praveen...@gmail.com] Sent: Monday, June 04, 2012 5:57 PM To: common-user@hadoop.apache.org Subject: Re: What happens when I do not output anything from my mapper You can control your map outputs based on any condition you want. I have done that - it worked for me. It could be your code problem that its not working for you. Can you please share your map code or cross-check whether your conditions are correct ? Regards, Praveenesh On Mon, Jun 4, 2012 at 5:52 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, I have a small application where I have only mapper class defined(no reducer, no combiner). Within the mapper class, I have an if condition according to which I decide If I want to put something in the context or not. If my condition is not match, I want that mapper does not give any output to the hdfs. But apparently, this does not worj as I expected. Once I run my job, a file per mapper in the hdfs with 87 kb of size. the if block that I'm using in the map method is as following: if (ip == null || ip.equals(cip)) { Text value = new Text(mwrapper.toJson()); word.set(ip); context.write( word, value); } else { log.info(ip not match [ + ip + ]); } } }//end of mapper method How can I manage that? Does mapper always need to have an output? -- Find a job you enjoy, and you'll never work a day in your life. Confucius -- Find a job you enjoy, and you'll never work a day in your life. Confucius
Re: What happens when I do not output anything from my mapper - Solution
Ok, For the ones that faces the problem, here is how I solved the problem: First of all, there was a task created for that on hadoop: https://issues.apache.org/jira/browse/HADOOP-4927 and http://hadoop.apache.org/mapreduce/docs/current/mapred_tutorial.html#Lazy+Output+Creation explains how to solve that. So hadoop does indeed create empty part-00x files irrespective what you do in the mapper class. So you have to call the following static method of the lazyoutputformat: LazyOutputFormat.setOutputFormatClass(job, SequenceFileOutputFormat.class); Be aware, from my experience, this method should be called after you set the outputformat class: job.setOutputFormatClass(SequenceFileOutputFormat.class); On Mon, Jun 4, 2012 at 2:48 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, Thanks for your answer. After I've read your emails, I decided to clear completely my mapper method to see If I can disable the output of the mapper class at all, but it seems it did not work So, here is my mapper method: @Override public void map(ByteBuffer key, SortedMapByteBuffer, IColumn columns, Context context) throws IOException, InterruptedException { } when I execute hadoop fs -ls, I still see many small output files as following: -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:44 /user/mmigdiso/output/part-m-00034 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00037 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00039 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00040 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00042 Do you know If I have to put something special to the context to specify the empty output? Regards Murat On Mon, Jun 4, 2012 at 2:38 PM, Devaraj k devara...@huawei.com wrote: Hi Murat, As Praveenesh explained, you can control the map outputs as you want. map() function will be called for each input i.e map() function invokes multiple times with different inputs in the same mapper. You can check by having the logs in the map function what is happening in it. Thanks Devaraj From: praveenesh kumar [praveen...@gmail.com] Sent: Monday, June 04, 2012 5:57 PM To: common-user@hadoop.apache.org Subject: Re: What happens when I do not output anything from my mapper You can control your map outputs based on any condition you want. I have done that - it worked for me. It could be your code problem that its not working for you. Can you please share your map code or cross-check whether your conditions are correct ? Regards, Praveenesh On Mon, Jun 4, 2012 at 5:52 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, I have a small application where I have only mapper class defined(no reducer, no combiner). Within the mapper class, I have an if condition according to which I decide If I want to put something in the context or not. If my condition is not match, I want that mapper does not give any output to the hdfs. But apparently, this does not worj as I expected. Once I run my job, a file per mapper in the hdfs with 87 kb of size. the if block that I'm using in the map method is as following: if (ip == null || ip.equals(cip)) { Text value = new Text(mwrapper.toJson()); word.set(ip); context.write( word, value); } else { log.info(ip not match [ + ip + ]); } } }//end of mapper method How can I manage that? Does mapper always need to have an output? -- Find a job you enjoy, and you'll never work a day in your life. Confucius -- Find a job you enjoy, and you'll never work a day in your life. Confucius -- Find a job you enjoy, and you'll never work a day in your life. Confucius
RE: What happens when I do not output anything from my mapper
The output files should 0 kb size if you use FileOutputFormat/TextOutputFormat. I think your output format writer is writing some meta data in those files. Can you check what is the data present in those files. Can you tell me which output format are you using? Thanks Devaraj From: murat migdisoglu [murat.migdiso...@gmail.com] Sent: Monday, June 04, 2012 6:18 PM To: common-user@hadoop.apache.org Subject: Re: What happens when I do not output anything from my mapper Hi, Thanks for your answer. After I've read your emails, I decided to clear completely my mapper method to see If I can disable the output of the mapper class at all, but it seems it did not work So, here is my mapper method: @Override public void map(ByteBuffer key, SortedMapByteBuffer, IColumn columns, Context context) throws IOException, InterruptedException { } when I execute hadoop fs -ls, I still see many small output files as following: -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:44 /user/mmigdiso/output/part-m-00034 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00037 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00039 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00040 -rw-r--r-- 3 mmigdiso supergroup 87 2012-06-04 12:45 /user/mmigdiso/output/part-m-00042 Do you know If I have to put something special to the context to specify the empty output? Regards Murat On Mon, Jun 4, 2012 at 2:38 PM, Devaraj k devara...@huawei.com wrote: Hi Murat, As Praveenesh explained, you can control the map outputs as you want. map() function will be called for each input i.e map() function invokes multiple times with different inputs in the same mapper. You can check by having the logs in the map function what is happening in it. Thanks Devaraj From: praveenesh kumar [praveen...@gmail.com] Sent: Monday, June 04, 2012 5:57 PM To: common-user@hadoop.apache.org Subject: Re: What happens when I do not output anything from my mapper You can control your map outputs based on any condition you want. I have done that - it worked for me. It could be your code problem that its not working for you. Can you please share your map code or cross-check whether your conditions are correct ? Regards, Praveenesh On Mon, Jun 4, 2012 at 5:52 PM, murat migdisoglu murat.migdiso...@gmail.com wrote: Hi, I have a small application where I have only mapper class defined(no reducer, no combiner). Within the mapper class, I have an if condition according to which I decide If I want to put something in the context or not. If my condition is not match, I want that mapper does not give any output to the hdfs. But apparently, this does not worj as I expected. Once I run my job, a file per mapper in the hdfs with 87 kb of size. the if block that I'm using in the map method is as following: if (ip == null || ip.equals(cip)) { Text value = new Text(mwrapper.toJson()); word.set(ip); context.write( word, value); } else { log.info(ip not match [ + ip + ]); } } }//end of mapper method How can I manage that? Does mapper always need to have an output? -- Find a job you enjoy, and you'll never work a day in your life. Confucius -- Find a job you enjoy, and you'll never work a day in your life. Confucius
Re: What happens when you do a ctrl-c on a big dfs -rmr
When you issue -rmr with directory, namenode get a directory name and starts deleting files recursively. It adds the blocks belonging to files to invalidate list. NameNode then deletes those blocks lazily. So, yes it will issue command to datanodes to delete those blocks, just give it some time. You do not need to reformat HDFS. Lohit - Original Message From: bzheng bing.zh...@gmail.com To: core-user@hadoop.apache.org Sent: Wednesday, March 11, 2009 7:48:41 PM Subject: What happens when you do a ctrl-c on a big dfs -rmr I did a ctrl-c immediately after issuing a hadoop dfs -rmr command. The rmr target is no longer visible from the dfs -ls command. The number of files deleted is huge and I don't think it can possibly delete them all between the time the command is issued and ctrl-c. Does this mean it leaves behind unreachable files on the slave nodes and making them dead weights? We can always reformat hdfs to be sure. But is there a way to check? Thanks. -- View this message in context: http://www.nabble.com/What-happens-when-you-do-a-ctrl-c-on-a-big-dfs--rmr-tp22468909p22468909.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
RE: What happens when a server loses all its state?
Thomas, in the scenario you give you have two simultaneous failures with 3 nodes, so it will not recover correctly. A is failed because it is not up. B has failed because it lost all its data. it would be good for ZooKeeper to not come up in that scenario. perhaps what we need is something similar to your safe state proposal. basically a server that has forgotten everything should not be allowed to vote in the leader election. that would avoid your scenario. we just need to put a flag file in the data directory to say that the data is valid and thus can vote. ben From: thomas.john...@sun.com [thomas.john...@sun.com] Sent: Tuesday, December 16, 2008 4:02 PM To: zookeeper-user@hadoop.apache.org Subject: Re: What happens when a server loses all its state? Mahadev Konar wrote: Hi Thomas, More generally, is it a safe assumption to make that the ZooKeeper service will maintain all its guarantees if a minority of servers lose persistent state (due to bad disks, etc) and restart at some point in the future? Yes that is true. Great - thanks Mahadev. Not to drag this on more than necessary, please bear with me for one more example of 'amnesia' that comes to mind. I have a set of ZooKeeper servers A, B, C. - C is currently not running, A is the leader, B is the follower. - A proposes zxid1 to A and B, both acknowledge. - A asks A to commit (which it persists), but before the same commit request reaches B, all servers go down (say a power failure). - Later, B and C come up (A is slow to reboot), but B has lost all state due to disk failure. - C becomes the new leader and perhaps continues with some more new transactions. Likely I'm misunderstanding the protocol, but have I effectively lost zxid1 at this point? What would happen when A comes back up? Thanks.
RE: What happens when a server loses all its state?
Just as a supporting note, from what I read, to support n simultaneous failures we need 2n+1 nodes. In this case, we need 5 nodes to operate correctly. Might be a good idea to capture this formula and if more than n failures occur, write the appropriate flags which can then be used for the right recovery state. Cheers k/ |-Original Message- |From: Benjamin Reed [mailto:br...@yahoo-inc.com] |Sent: Wednesday, December 17, 2008 11:48 AM |To: zookeeper-user@hadoop.apache.org |Subject: RE: What happens when a server loses all its state? | |Thomas, | |in the scenario you give you have two simultaneous failures with 3 |nodes, so it will not recover correctly. A is failed because it is not |up. B has failed because it lost all its data. | |it would be good for ZooKeeper to not come up in that scenario. perhaps |what we need is something similar to your safe state proposal. basically |a server that has forgotten everything should not be allowed to vote in |the leader election. that would avoid your scenario. we just need to put |a flag file in the data directory to say that the data is valid and thus |can vote. | |ben | |From: thomas.john...@sun.com [thomas.john...@sun.com] |Sent: Tuesday, December 16, 2008 4:02 PM |To: zookeeper-user@hadoop.apache.org |Subject: Re: What happens when a server loses all its state? | |Mahadev Konar wrote: | Hi Thomas, | | | | | More generally, is it a safe assumption to make that the ZooKeeper | service will maintain all its guarantees if a minority of servers |lose | persistent state (due to bad disks, etc) and restart at some point in | the future? | | Yes that is true. | | |Great - thanks Mahadev. | |Not to drag this on more than necessary, please bear with me for one |more example of 'amnesia' that comes to mind. I have a set of ZooKeeper |servers A, B, C. |- C is currently not running, A is the leader, B is the follower. |- A proposes zxid1 to A and B, both acknowledge. |- A asks A to commit (which it persists), but before the same commit |request reaches B, all servers go down (say a power failure). |- Later, B and C come up (A is slow to reboot), but B has lost all state |due to disk failure. |- C becomes the new leader and perhaps continues with some more new |transactions. | |Likely I'm misunderstanding the protocol, but have I effectively lost |zxid1 at this point? What would happen when A comes back up? | |Thanks.
Re: What happens when a server loses all its state?
Sorry, I should have been a little more explicit. At this point, the situation I'm considering is this; out of 3 servers, 1 server 'A' forgets its persistent state (due to a bad disk, say) and it restarts. My guess from what I could understand/reason about the internals was that the server 'A' will re-synchronize correctly on restart, by getting the entire snapshot. I just wanted to make sure that this was a good assumption to make - or find out if I was missing corner cases where the fact that A has lost all memory could lead to inconsistencies (to take an example, in plain Paxos, no acceptor can forget the highest number prepare request to which it has responded). More generally, is it a safe assumption to make that the ZooKeeper service will maintain all its guarantees if a minority of servers lose persistent state (due to bad disks, etc) and restart at some point in the future? Thanks. Mahadev Konar wrote: Hi Thomas, If a zookeeper server loses all state and their are enough servers in the ensemble to continue a zookeeper service ( like 2 servers in the case of ensemble of 3), then the server will get the latest snapshot from the leader and continue. The idea of zookeeper persisting its state on disk is just so that it does not lose state. All the guarantees that zookeeper makes is based on the understanding that we do not lose state of the data we store on the disk. Their might be problems if we lose the state that we stored on the disk. We might lose transactions that have been committed and the ensemble might start with some snapshot in the past. You might want ot read through how zookeeper internals work. This will help you understand on why the persistence guarantees are required. http://wiki.apache.org/hadoop-data/attachments/ZooKeeper(2f)ZooKeeperPresent ations/attachments/zk-talk-upc.pdf mahadev On 12/16/08 9:45 AM, Thomas Vinod Johnson thomas.john...@sun.com wrote: What is the expected behavior if a server in a ZooKeeper service restarts with all its prior state lost? Empirically, everything seems to work*. Is this something that one can count on, as part of ZooKeeper design, or are there known conditions under which this could cause problems, either liveness or violation of ZooKeeper guarantees? I'm really most interested in a situation where a single server loses state, but insights into issues when more than one server loses state and other interesting failure scenarios are appreciated. Thanks. * The restarted server appears to catch up to the latest snapshot (from the current leader?).
HDFS: What happens when a harddrive fails
I was wondering 1) what happens if a data node is alive but its harddrive fails? Does it throw an exception and dies? 2) If It continues to run and continue to do blockreporting, is there a console showing datanodes with healthy hard drives and unhealthy hard drives? I know the web server of the name node shows runing and not running data nodes, but I am not sure if it differentiates between datanodes with healthy and unhealthy hard drives. Thanks for your help Cagdas
Re: HDFS: What happens when a harddrive fails
It depends on the failure. For some failure modes, the disk just becomes very slow. On 3/26/08 4:39 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I was wondering 1) what happens if a data node is alive but its harddrive fails? Does it throw an exception and dies? 2) If It continues to run and continue to do blockreporting, is there a console showing datanodes with healthy hard drives and unhealthy hard drives? I know the web server of the name node shows runing and not running data nodes, but I am not sure if it differentiates between datanodes with healthy and unhealthy hard drives. Thanks for your help Cagdas