Re: Timer jobs
01.09.11 21:55, Per Steffensen написав(ла): Vitalii Tymchyshyn skrev: Hello. AFAIK now you still have HDFS NameNode and as soon as NameNode is down - your cluster is down. So, putting scheduling on the same machine as NameNode won't make you cluster worse in terms of SPOF (at least for HW failures). Best regards, Vitalii Tymchyshyn I believe this is why there is also a secondary namenode. Hello. Not at all. Secondary name node is not even a hot standby. You HDFS cluster address is namenode:port and no one who connects with it knows about secondary name node, so it's not a HA solution. AFAIR secondary name node even is not a backup, but simply a tools to help main name node to process transaction logs at a scheduled fashion. 0.21 has backup name node, but 0.21 is unstable and it's backup node does not work (tried it). For 0.20 the backup solution mentioned in the docs is to have a NFS mount on name node and specify it as a secondary name node data directory. Best regards, Vitalii Tymchyshyn.
Re: Timer jobs
01.09.11 18:14, Per Steffensen написав(ла): Well I am not sure I get you right, but anyway, basically I want a timer framework that triggers my jobs. And the triggering of the jobs need to work even though one or two particular machines goes down. So the timer triggering mechanism has to live in the cluster, so to speak. What I dont want is that the timer framework are driven from one particular machine, so that the triggering of jobs will not happen if this particular machine goes down. Basically if I have e.g. 10 machines in a Hadoop cluster I will be able to run e.g. MapReduce jobs even if 3 of the 10 machines are down. I want my timer framework to also be clustered, distributed and coordinated, so that I will also have my timer jobs triggered even though 3 out of 10 machines are down. Hello. AFAIK now you still have HDFS NameNode and as soon as NameNode is down - your cluster is down. So, putting scheduling on the same machine as NameNode won't make you cluster worse in terms of SPOF (at least for HW failures). Best regards, Vitalii Tymchyshyn
Re: Which release to use?
19.07.11 14:50, Steve Loughran написав(ла): On 19/07/11 12:44, Rita wrote: Arun, I second Joeś comment. Thanks for giving us a heads up. I will wait patiently until 0.23 is considered stable. API-wise, 0.21 is better. I know that as I'm working with 0.20.203 right now, and it is a step backwards. Regarding future releases, the best way to get it stable is participate in release testing in your own infrastructure. Nothing else will find the problems unique to your setup of hardware, network and software My little hadoop adoption story (or why I won't test 0.23) I am among those who think that latest release is what is supported and so we got to 0.21 way. BTW: I've tried to find some release roadmap, but could not find anything up to date. We are using HDFR without Map/Reduce. As far as I can see now 0.21 nowhere near beta quality with non-working new features like backup node or append. Also there is no option for such unlucky people to back off to 0.20 (at least hadoop downgrade search do not give any good results). I did already fill 5 tickets in Jira, 3 of them with patches. On two there is no activity at all, on other three answer is the latest non-autogenerated message (and over 3 weeks old). I did send few messages to this list, one to hdfs-user. No answers. With this level of project activity, I can't afford to test a thing that have not got to 0.21 quality level yet. If I will have any problems, I can't afford to wait for months to be heard. I am more or less stable on my own patched 0.21 for now and will either move forward if I will see more project activity or move somewhere else if it will become less stable. Best regards, Vitalii Tymchyshyn
HARs without Map/Reduce
Hello. In my application I am using HDFS without map/reduce. Yesterday in this list I get known about har archives. This is great solution for me to handle archived data, so I've decided to create such an archive and test it. The creation worked in Local MapReduce mode, but each file took ~3 seconds, so in overall it was too slow (yet archive was successfully created and seems to suit my needs). As time is more or less exact (no 2 or 4 second interval, sometimes it's 6 seconds), I suspect this is some kind of polling timeout. Can anyone tell me if I must go to setting up map/reduce over my cluster simply for har creation task or I can change this timeout (which one)? Best regards, Vitalii Tymchyshyn
Data node check dir storm
Hello. I can see that if data node receives some IO error, this can cause checkDir storm. What I mean: 1) any error produces DataNode.checkDiskError call 2) this call locks volume: java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228) at java.io.File.exists(File.java:733) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:65) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:86) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:228) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSDir.checkDirTree(FSDataset.java:232) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.checkDirs(FSDataset.java:414) at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:617) - locked 0x00080a8faec0 (a org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet) at org.apache.hadoop.hdfs.server.datanode.FSDataset.checkDataDir(FSDataset.java:1681) at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:745) at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:735) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:202) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151) at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111) at java.lang.Thread.run(Thread.java:619) 3) This produces timeouts on other calls, e.g. 2011-06-17 17:35:03,922 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: java.io.InterruptedIOException at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.close(BlockReceiver.java:183) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:151) at org.apache.hadoop.io.IOUtils.closeStream(IOUtils.java:167) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:646) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.opWriteBlock(DataXceiver.java:352) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.opWriteBlock(DataTransferProtocol.java:390) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$Receiver.processOp(DataTransferProtocol.java:331) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:111) at java.lang.Thread.run(Thread.java:619) 4) This, in turn, produces more dir check calls. 5) All the cluster works very slow because of half-working node. Best regards, Vitalii Tymchyshyn