Re: Problem with setting up the cluster
Have a look at the datanode log files on the datanode machines and see what the error is in there. Cheers, Tom On Thu, Jun 25, 2009 at 6:21 AM, .ke. sivakumarkesivaku...@gmail.com wrote: Hi all, I'm a student and I have been tryin to set up the hadoop cluster for a while but have been unsuccessful till now. I'm tryin to setup a 4-node cluster 1 - namenode 1 - job tracker 2 - datanode / task tracker version of hadoop - 0.18.3 *config in hadoop-site.xml* (which I have replicated in all the 4 nodes conf directory) *** configuration property namemapred.job.tracker /name valueswin07.xx.xx.edu:9001/value /property property namefs.default.name/name valuehdfs://swin06.xx.xx.edu:9000/value /property property namedfs.data.dir/name value/home/kesivakumar/hadoop/dfs/data/value finaltrue/final /property property namedfs.name.dir/name value/home/kesivakumar/hadoop/dfs/name/value finaltrue/final /property property namehadoop.tmp.dir /name value/tmp/hadoop /value finaltrue/final /property property namemapred.system.dir/name value/hadoop/mapred/system/value finaltrue/final /property property namedfs.replication/name value1/value /property /configuration *** The problem is both of my datanodes are dead.. The *slave files are configured properly* and to my surprise the tasktrackers are running (checked thru swin07:50030 and it showed 2 tasktrackers running) (swin06:50070 showed namenode is active but 0 data nodes active) so when i try copying conf dir into the filesystem using -put cmd it throws me errors. Below is the last-part of the error o/p 09/06/25 00:36:30 WARN dfs.DFSClient: NotReplicatedYetException sleeping /user/kesivakumar/input/hadoop-env.sh retries left 1 09/06/25 00:36:34 WARN dfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/kesivakumar/input/hadoop-env.sh could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1123) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890) at org.apache.hadoop.ipc.Client.call(Client.java:716) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2450) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2333) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1745) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1922) 09/06/25 00:36:34 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0] put: Could not get block locations. Aborting... Exception closing file /user/kesivakumar/input/hadoop-env.sh java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2153) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1745) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1899) * When I tried bin/hadoop dfsadmin -report :- it says that 0 datanodes are available.. And also while formatting the namenode usin bin/hadoop namenode -format the format is done at swin06/*127.0.1.1* --- why it is not gettin done at 127.0.0.1 ??? How can I rectify these errors ??? Any help would be greatly appreciated Thank You..
Re: Unable to run Jar file in Hadoop.
Hi Krishna, You get this error when the jar file cannot be found. It looks like /user/hadoop/hadoop-0.18.0-examples.jar is an HDFS path, when in fact it should be a local path. Cheers, Tom On Thu, Jun 25, 2009 at 9:43 AM, krishna prasannasvk_prasa...@yahoo.com wrote: Oh! thanks Shravan Krishna. From: Shravan Mahankali shravan.mahank...@catalytic.com To: core-user@hadoop.apache.org Sent: Thursday, 25 June, 2009 1:50:51 PM Subject: RE: Unable to run Jar file in Hadoop. Am as well having similar... there is no solution yet!!! Thank You, Shravan Kumar. M Catalytic Software Ltd. [SEI-CMMI Level 5 Company] - This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system administrator - netopshelpd...@catalytic.com -Original Message- From: krishna prasanna [mailto:svk_prasa...@yahoo.com] Sent: Thursday, June 25, 2009 1:01 PM To: core-user@hadoop.apache.org Subject: Unable to run Jar file in Hadoop. Hi, When i am trying to run a Jar in Hadoop, it is giving me the following error had...@krishna-dev:/usr/local/hadoop$ bin/hadoop jar /user/hadoop/hadoop-0.18.0-examples.jar java.io.IOException: Error opening job jar: /user/hadoop/hadoop-0.18.0-examples.jar at org.apache.hadoop.util.RunJar.main(RunJar.java:90) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Caused by: java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.init(ZipFile.java:114) at java.util.jar.JarFile.init(JarFile.java:133) at java.util.jar.JarFile.init(JarFile.java:70) at org.apache.hadoop.util.RunJar.main(RunJar.java:88) ... 4 more Here is the file permissions, rw-r--r- 2 hadoop supergroup 91176 2009-06-25 12:49 /user/hadoop/hadoop-0.18.0-examples.jar Some body help me on this Thanks in advance, Krishna. Cricket on your mind? Visit the ultimate cricket website. Enter http://cricket.yahoo.com Cricket on your mind? Visit the ultimate cricket website. Enter http://cricket.yahoo.com
Re: Rebalancing Hadoop Cluster running 15.3
Hi Usman, Before the rebalancer was introduced one trick people used was to increase the replication on all the files in the system, wait for re-replication to complete, then decrease the replication to the original level. You can do this using hadoop fs -setrep. Cheers, Tom On Thu, Jun 25, 2009 at 10:33 AM, Usman Waheedusm...@opera.com wrote: Hi, One of our test clusters is running HADOOP 15.3 with replication level set to 2. The datanodes are not balanced at all. Datanode_1: 52% Datanode_2: 82% Datanode_3: 30% 15.3 does not have the rebalancer capability, we are planning to upgrade but not for now. If i take out Datanode_1 from the cluster (decommission for sometime) will HADOOP balance so that Datanode_2 and Datanode_3 will even out to 56%? Then i can re-introduce Datanode_1 back into the cluster. Comments/Suggestions please? Thanks, Usman
Re: Rebalancing Hadoop Cluster running 15.3
You can change the value of hadoop.root.logger in conf/log4j.properties to change the log level globally. See also the section Custom Logging levels in the same file to set levels on a per-component basis. You can also use hadoop daemonlog to set log levels on a temporary basis (they are reset on restart). I'm not sure if this was in Hadoop 0.15. Cheers, Tom On Thu, Jun 25, 2009 at 11:12 AM, Usman Waheedusm...@opera.com wrote: Hi Tom, Thanks for the trick :). I tried by setting the replication to 3 in the hadoop-default.xml but then the namenode-logfile in /var/log/hadoop started getting full with the messages marked in bold: 2009-06-24 14:39:06,338 INFO org.apache.hadoop.dfs.StateChange: STATE* SafeModeInfo.leave: Safe mode is OFF. 2009-06-24 14:39:06,339 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 1 racks and 3 datanodes 2009-06-24 14:39:06,339 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks has 48545 blocks 2009-06-24 14:39:07,655 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate blk_-4602580985572290582 to datanode(s) 10.20.11.44:50010 2009-06-24 14:39:07,655 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate blk_-4602036196619511999 to datanode(s) 10.20.11.44:50010 2009-06-24 14:39:07,666 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate blk_-4601863051065326105 to datanode(s) 10.20.11.44:50010 2009-06-24 14:39:07,666 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate blk_-4601770656364938220 to datanode(s) 10.20.11.44:50010 2009-06-24 14:39:10,829 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.20.11.44:50010 is added to blk_-4601770656364938220 2009-06-24 14:39:10,832 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate blk_-4601706607039808418 to datanode(s) 10.20.11.44:50010 2009-06-24 14:39:10,833 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate blk_-4601652202073012439 to datanode(s) 10.20.11.44:50010 2009-06-24 14:39:10,834 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate blk_-4601470720696217621 to datanode(s) 10.20.11.44:50010 2009-06-24 14:39:10,834 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate blk_-4601267705629076611 to datanode(s) 10.20.11.44:50010 *2009-06-24 14:39:13,899 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1 2009-06-24 14:39:13,899 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1 2009-06-24 14:39:13,899 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1 2009-06-24 14:39:13,900 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1 2009-06-24 14:39:13,900 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1 2009-06-24 14:39:13,900 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1 2009-06-24 14:39:13,901 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1 2009-06-24 14:39:13,901 WARN org.apache.hadoop.fs.FSNamesystem: Not able to place enough replicas, still in need of 1* It is a very small cluster with limited disk space. The disk was getting full because of all these extra messages there were being written to the logfile. Eventually the file system would file up and hadoop hangs. This happened when i set the dfs.replication = 3 in the hadoop-default.xml and restarted the cluster. Is there a way i can turn off these WARN messages which are filling up the file system. I can run the command on the command line like you advised with replication set to 3 and then once done, set it back to 2. Currently the dfs.replication is set to 2 in the hadoop-default.xml. Thanks, Usman Hi Usman, Before the rebalancer was introduced one trick people used was to increase the replication on all the files in the system, wait for re-replication to complete, then decrease the replication to the original level. You can do this using hadoop fs -setrep. Cheers, Tom On Thu, Jun 25, 2009 at 10:33 AM, Usman Waheedusm...@opera.com wrote: Hi, One of our test clusters is running HADOOP 15.3 with replication level set to 2. The datanodes are not balanced at all. Datanode_1: 52% Datanode_2: 82% Datanode_3: 30% 15.3 does not have the rebalancer capability, we are planning to upgrade but not for now. If i take out Datanode_1 from the cluster (decommission
Re: HDFS Safemode and EC2 EBS?
Hi Chris, You should really start all the slave nodes to be sure that you don't lose data. If you start fewer than #nodes - #replication + 1 nodes then you are virtually guaranteed to lose blocks. Starting 6 nodes out of 10 will cause the filesystem to remain in safe mode, as you've seen. BTW I'm just created a Jira for EBS support (https://issues.apache.org/jira/browse/HADOOP-6108) which you might be interested in. Cheers, Tom On Thu, Jun 25, 2009 at 3:51 PM, Chris Curtincurtin.ch...@gmail.com wrote: Hi, I am using 0.19.0 on EC2. The Hadoop execution and HDFS directories are on EBS volumes mounted to each node in my EC2 cluster. Only the install of hadoop is in the AMI. We have 10 EBS volumes and when the cluster starts it randomly picks one for each slave. We don't always start all 10 slaves depending on what type of work we are going to do. Every third or fourth start of the cluster the namenode goes into safemode and won't come out automatically. Restarting datanodes and task trackers on each of the slaves doesn't help. Not much in the log files besides the error about waiting for the available %. Forcing it out of safe mode allows the cluster to start working. My only thought is that something is being stored on one of the EBS volumes not being mounted when starting a smaller configuration (say 6 nodes instead of 10). But isn't HDFS fault tolerant so that if there is a missing node it carries on? Any advice on why the namenode and datanodes can't find all the data blocks? Or where to look for more information about what might be going on? Thanks, Chris
Re: EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing
Hi Saptarshi, The group permissions open the firewall ports to enable access, but there are no shared keys on the cluster by default. See https://issues.apache.org/jira/browse/HADOOP-4131 for a patch to the scripts that shares keys to allow SSH access between machines in the cluster. Cheers, Tom On Sat, Jun 20, 2009 at 7:09 AM, Saptarshi Guhasaptarshi.g...@gmail.com wrote: Hello, I have a cluster with 1 master and 1 slave (testing). In the EC2 scripts, in the hadoop-ec2-init-remote.sh file, I wish to copy a file from the MASTER to the CLUSTER i.e in the slave section scp $MASTER_HOST:/tmp/v /tmp/v However, this didnt work and when I logged in, ssh'd to the slave and tried the command, I got the following error: Permission denied (publickey,gssapi-with-mic) Yet, the group permissions appear to be valid i.e ec2-authorize $CLUSTER_MASTER -o $CLUSTER -u $AWS_ACCOUNT_ID ec2-authorize $CLUSTER -o $CLUSTER_MASTER -u $AWS_ACCOUNT_ID So I don't see why I can't ssh into the MASTER group from a slave. Any suggestion as to where I'm going wrong? Regards Saptarshi P.S I know I can copy a file from S3, but would like to know what is going on here.
Re: Looking for correct way to implements WritableComparable in Hadoop-0.17
Hi Kun, The book's code is for 0.20.0. In Hadoop 0.17.x WritableComparable was not generic, so you need a declaration like: public class IntPair implements WritableComparable { } And the compareTo() method should look like this: public int compareTo(Object o) { IntPair ip = (IntPair) o; int cmp = compare(first, ip.first); if (cmp != 0) { return cmp; } return compare(second, ip.second); } Finally, if you are using Java 5 you should remove the @Override annotations. Cheers, Tom On Sun, Jun 21, 2009 at 1:16 AM, Kunsheng Chen ke...@yahoo.com wrote: Hello everyone, I am writing my own Comparator inherits from WritableComparable. I got the folliowing code from Hadoop definitive guide, which is not working at all, it reminds me WritableComparable does not take parameter. The book might be using Hadoop-0.21 I also tried the old method for 0.18 version as below: http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/io/WritableComparable.html but it will reminds me hasn't implement compareTo method, which actually I did. I am wondering if I have to reinstall the hadoop again (I prefer not) or there was any old way to do it. Any idea is well appreciated! -Kun -- import java.io.*; import org.apache.hadoop.io.*; public class IntPair implements WritableComparableIntPair { private int first; private int second; private Text third; public IntPair(int first, int second, Text third) { set(first, second, third); } public void set(int first, int second, Text third) { this.first = first; this.second = second; this.third = third; } public int getFirst() { return first; } public int getSecond() { return second; } public Text getThird() { return third; } �...@override public void write(DataOutput out) throws IOException { out.writeInt(first); out.writeInt(second); third.write(out); } �...@override public void readFields(DataInput in) throws IOException { first = in.readInt(); second = in.readInt(); // Redundant third.readFields(in); } �...@override public int hashCode() { return first * 163 + second + third.hashCode(); } �...@override public boolean equals(Object o) { if (o instanceof IntPair) { IntPair ip = (IntPair) o; return first == ip.first second == ip.second third.equals(ip.third); } return false; } �...@override public String toString() { return first + \t + second + \t + third; } �...@override public int compareTo(IntPair ip) { int cmp = compare(first, ip.first); if (cmp != 0) { return cmp; } return compare(second, ip.second); } /** * Convenience method for comparing two ints. */ public static int compare(int a, int b) { return (a b ? -1 : (a == b ? 0 : 1)); } }
Re: Is it possible? I want to group data blocks.
You might be interested in https://issues.apache.org/jira/browse/HDFS-385, where there is discussion about how to add pluggable block placement to HDFS. Cheers, Tom On Tue, Jun 23, 2009 at 5:50 PM, Alex Loddengaarda...@cloudera.com wrote: Hi Hyunsik, Unfortunately you can't control the servers that blocks go on. Hadoop does block allocation for you, and it tries its best to distribute data evenly among the cluster, so long as replicated blocks reside on different machines, on different racks (assuming you've made Hadoop rack-aware). Hope this clears things up. Alex 2009/6/23 Hyunsik Choi c0d3h...@gmail.com Hi all, I would like to give data locality. In other words, I want to place certain data blocks on one machine. In some problems, subsets of an entire dataset need one another for answer. Most of the graph problems are good examples. Is it possible? If impossible, can you advice about that? Thank you in advance. - Hyunsik Choi -
Re: Running Hadoop/Hbase in a OSGi container
Hi Ninad, I don't know if anyone has looked at this for Hadoop Core or HBase (although there is this Jira: https://issues.apache.org/jira/browse/HADOOP-4604), but there's some work for making ZooKeeper's jar OSGi compliant at https://issues.apache.org/jira/browse/ZOOKEEPER-425. Cheers, Tom On Thu, Jun 11, 2009 at 1:10 AM, Ninad Rautninad.evera...@gmail.com wrote: Hi, Our architecture team wants to run Hadoop/Hbase and the mapreduce jobs using OSGi container. This is to take advantages of the OSGi framework to have a pluggable architecture. I have searched through the net and looks like people are working or have achieved success in this. Can some one please help me understand the technical feasibility and if feasible the way to move forward? Regards, Ninad.
Re: Command-line jobConf options in 0.18.3
Actually, the space is needed, to be interpreted as a Hadoop option by ToolRunner. Without the space it sets a Java system property, which Hadoop will not automatically pick up. Ian, try putting the options after the classname and see if that helps. Otherwise, it would be useful to see a snippet of the program code. Thanks, Tom On Thu, Jun 4, 2009 at 8:23 PM, Vasyl Keretsman vasi...@gmail.com wrote: Perhaps, there should not be the space between -D and your option ? -Dprise.collopts= Vasyl 2009/6/4 Ian Soboroff ian.sobor...@nist.gov: bin/hadoop jar -files collopts -D prise.collopts=collopts p3l-3.5.jar gov.nist.nlpir.prise.mapred.MapReduceIndexer input output The 'prise.collopts' option doesn't appear in the JobConf. Ian Aaron Kimball aa...@cloudera.com writes: Can you give an example of the exact arguments you're sending on the command line? - Aaron On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote: If after I call getConf to get the conf object, I manually add the key/ value pair, it's there when I need it. So it feels like ToolRunner isn't parsing my args for some reason. Ian On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote: Yes, and I get the JobConf via 'JobConf job = new JobConf(conf, the.class)'. The conf is the Configuration object that comes from getConf. Pretty much copied from the WordCount example (which this program used to be a long while back...) thanks, Ian On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote: Are you running your program via ToolRunner.run()? How do you instantiate the JobConf object? - Aaron On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff ian.sobor...@nist.gov wrote: I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and I'm finding that when I run a job and try to pass options with -D on the command line, that the option values aren't showing up in my JobConf. I logged all the key/value pairs in the JobConf, and the option I passed through with -D isn't there. This worked in 0.19.1... did something change with command-line options from 18 to 19? Thanks, Ian
Re: InputFormat for fixed-width records?
Hi Stuart, There isn't an InputFormat that comes with Hadoop to do this. Rather than pre-processing the file, it would be better to implement your own InputFormat. Subclass FileInputFormat and provide an implementation of getRecordReader() that returns your implementation of RecordReader to read fixed width records. In the next() method you would do something like: byte[] buf = new byte[100]; IOUtils.readFully(in, buf, pos, 100); pos += 100; You would also need to check for the end of the stream. See LineRecordReader for some ideas. You'll also have to handle finding the start of records for a split, which you can do by looking at the offset and seeking to the next multiple of 100. If the RecordReader was a RecordReaderNullWritable, BytesWritable (no keys) then it would return each record as a byte array to the mapper, which would then break it into fields. Alternatively, you could do it in the RecordReader, and use your own type which encapsulates the fields for the value. Hope this helps. Cheers, Tom On Thu, May 28, 2009 at 1:15 PM, Stuart White stuart.whi...@gmail.com wrote: I need to process a dataset that contains text records of fixed length in bytes. For example, each record may be 100 bytes in length, with the first field being the first 10 bytes, the second field being the second 10 bytes, etc... There are no newlines on the file. Field values have been either whitespace-padded or truncated to fit within the specific locations in these fixed-width records. Does Hadoop have an InputFormat to support processing of such files? I looked but couldn't find one. Of course, I could pre-process the file (outside of Hadoop) to put newlines at the end of each record, but I'd prefer not to require such a prep step. Thanks.
Re: SequenceFile and streaming
Hi Walter, On Thu, May 28, 2009 at 6:52 AM, walter steffe ste...@tiscali.it wrote: Hello I am a new user and I would like to use hadoop streaming with SequenceFile in both input and output side. -The first difficoulty arises from the lack of a simple tool to generate a SequenceFile starting from a set of files in a given directory. I would like to have something similar to tar -cvf file.tar foo/ This should work also in the opposite direction like tar -xvf file.tar There's a tool for turning tar files into sequence files here: http://stuartsierra.com/2008/04/24/a-million-little-files -An other important feature that I would like to see is the possibility to feed the mapper stdin with the whole content of a file (extracted from the file SequenceFile) disregarding the key. Have a look at SequenceFileAsTextInputFormat which will do this for you (except the key is the sequence file's key). Using each file as a tar archive I it would like to be able to do: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input /user/me/inputSequenceFile \ -output /user/me/outputSequenceFile \ -inputformat SequenceFile -outputformat SequenceFile -mapper myscript.sh -reducer NONE myscrip.sh should work as a filter which takes its input from stdin and put the output on stdout: tar -x do something on the generated dir and create an outputfile cat outputfile The output file should (automatically) go into the outputSequenceFile. I think that this would be a very usefull schema which fits well with the mapreduce requirements on one side and with the unix commands on the other side. It should not be too difficoult to implement the tools needed for that. I totally agree - having more tools to better integrate sequence files and map files with unix tools would be very handy. Tom Walter
Re: avoid custom crawler getting blocked
Have you had a look at Nutch (http://lucene.apache.org/nutch/)? It has solved this kind of problem. Cheers, Tom On Wed, May 27, 2009 at 9:58 AM, John Clarke clarke...@gmail.com wrote: My current project is to gather stats from a lot of different documents. We're are not indexing just getting quite specific stats for each document. We gather 12 different stats from each document. Our requirements have changed somewhat now, originally it was working on documents from our own servers but now it needs to fetch other ones from quite a large variety of sources. My approach up to now was to have the map function simply take each filepath (or now URL) in turn, fetch the document, calculate the stats and output those stats. My new problem is some of the locations we are now visiting don't like their IP being hit multiple times in a row. Is it possible to check a URL against a visited list of IPs and if recently visited either wait for a certain amount of time or push it back onto the input stack so it will be processed later in the queue? Or is there a better way? Thanks, John
Re: RandomAccessFile with HDFS
RandomAccessFile isn't supported directly, but you can seek when reading from files in HDFS (see FSDataInputStream's seek() method). Writing at an arbitrary offset in an HDFS file is not supported however. Cheers, Tom On Sun, May 24, 2009 at 1:33 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. Any idea if RandomAccessFile is going to be supported in HDFS? Regards.
Re: Circumventing Hadoop's data placement policy
You can't use it yet, but https://issues.apache.org/jira/browse/HADOOP-3799 (Design a pluggable interface to place replicas of blocks in HDFS) would enable you to write your own policy so blocks are never placed locally. Might be worth following its development to check it can meet your need? Cheers, Tom On Sat, May 23, 2009 at 8:06 PM, jason hadoop jason.had...@gmail.com wrote: Can you give your machines multiple IP addresses, and bind the grid server to a different IP than the datanode With solaris you could put it in a different zone, On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman bbock...@math.unl.eduwrote: Hey all, Had a problem I wanted to ask advice on. The Caltech site I work with currently have a few GridFTP servers which are on the same physical machines as the Hadoop datanodes, and a few that aren't. The GridFTP server has a libhdfs backend which writes incoming network data into HDFS. They've found that the GridFTP servers which are co-located with HDFS datanode have poor performance because data is incoming at a much faster rate than the HDD can handle. The standalone GridFTP servers, however, push data out to multiple nodes at one, and can handle the incoming data just fine (200MB/s). Is there any way to turn off the preference for the local node? Can anyone think of a good workaround to trick HDFS into thinking the client isn't on the same node? Brian -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Number of maps and reduces not obeying my configuration
On Thu, May 21, 2009 at 5:18 AM, Foss User foss...@gmail.com wrote: On Wed, May 20, 2009 at 3:18 PM, Tom White t...@cloudera.com wrote: The number of maps to use is calculated on the client, since splits are computed on the client, so changing the value of mapred.map.tasks only on the jobtracker will not have any effect. Note that the number of map tasks that you set is only a suggestion, and depends on the number of splits actually created. In your case it looks like 4 splits were created. As a rule, you shouldn't set the number of map tasks, since by default one map task is created for each HDFS block, which works well for most applications. This is explained further in the javadoc: http://hadoop.apache.org/core/docs/r0.19.1/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int) The number of reduces to use is determined by the JobConf that is created on the client, so it uses the client's hadoop-site.xml, not the jobtracker's one. This is why it is set to 1, even though you set it to 2 on the jobtracker. If you don't want to set configuration properties in code (and I agree it's often a good idea not to hardcode things like the number of maps or reduces in code), then you can make your driver use Tool and ToolRunner as Chuck explained. Finally, in general you should try to keep hadoop-site.xml the same across your clients and cluster nodes to avoid surprises about which value has been set. Hope this helps, Tom By client do you mean the machine where I logged in and invoked 'hadoop jar' command to submit and run my job? Yes.
Re: Shutdown in progress exception
On Wed, May 20, 2009 at 10:22 PM, Stas Oskin stas.os...@gmail.com wrote: You should only use this if you plan on manually closing FileSystems yourself from within your own shutdown hook. It's somewhat of an advanced feature, and I wouldn't recommend using this patch unless you fully understand the ramifications of modifying the shutdown sequence. Standard dfs.close() would do the trick, no? After you've performed your application shutdown actions you should call FileSystem's closeAll() method. Just uploaded a patch based on branch 18 for you to that JIRA. Thanks a lot!
Re: multiple results for each input line
You could combine them into one file using a reduce stage (with a single reducer), or by using hadoop fs -getmerge on the output directory. Cheers, Tom On Thu, May 21, 2009 at 3:14 PM, John Clarke clarke...@gmail.com wrote: Hi, I want one output file not multiple but I think your reply has steered me in the right direction! Thanks John 2009/5/20 Tom White t...@cloudera.com Hi John, You could do this with a map only-job (using NLineInputFormat, and setting the number of reducers to 0), and write the output key as docnameN,stat1,stat2,stat3,stat12 and a null value. This assumes that you calculate all 12 statistics in one map. Each output file would have a single line in it. Cheers, Tom On Wed, May 20, 2009 at 10:21 AM, John Clarke clarke...@gmail.com wrote: Hi, I'm having some trouble implementing what I want to achieve... essentially I have a large input list of documents that I want to get statistics on. For each document I have 12 different stats to work out. So my input file is a text file with one document filepath on each line. The documents are stored on a remote server. I want to fetch each document and calculate certain stats from it. My problem is with the output. I want my output to be similar to this: docname1,stat1,stat2,stat3,stat12 docname2,stat1,stat2,stat3,stat12 docname3,stat1,stat2,stat3,stat12 . . . docnameN,stat1,stat2,stat3,stat12 I can fetch the document in my map code and perform my stats calculation on it but dont know how to return more than one value for a key, the key in this case being the document name. Cheers, John
Re: Number of maps and reduces not obeying my configuration
The number of maps to use is calculated on the client, since splits are computed on the client, so changing the value of mapred.map.tasks only on the jobtracker will not have any effect. Note that the number of map tasks that you set is only a suggestion, and depends on the number of splits actually created. In your case it looks like 4 splits were created. As a rule, you shouldn't set the number of map tasks, since by default one map task is created for each HDFS block, which works well for most applications. This is explained further in the javadoc: http://hadoop.apache.org/core/docs/r0.19.1/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int) The number of reduces to use is determined by the JobConf that is created on the client, so it uses the client's hadoop-site.xml, not the jobtracker's one. This is why it is set to 1, even though you set it to 2 on the jobtracker. If you don't want to set configuration properties in code (and I agree it's often a good idea not to hardcode things like the number of maps or reduces in code), then you can make your driver use Tool and ToolRunner as Chuck explained. Finally, in general you should try to keep hadoop-site.xml the same across your clients and cluster nodes to avoid surprises about which value has been set. Hope this helps, Tom On Wed, May 20, 2009 at 5:21 AM, Foss User foss...@gmail.com wrote: On Wed, May 20, 2009 at 3:39 AM, Chuck Lam chuck@gmail.com wrote: Can you set the number of reducers to zero and see if it becomes a map only job? If it does, then it's able to read in the mapred.reduce.tasks property correctly but just refuse to have 2 reducers. In that case, it's most likely you're running in local mode, which doesn't allow more than 1 reducer. As I have already mentioned in my original mail, I am not running it in local mode. Quoting from my original mail: My configuration file is set as follows: mapred.map.tasks = 2 mapred.reduce.tasks = 2 However, the description of these properties mention that these settings would be ignored if mapred.job.tracker is set as 'local'. Mine is set properly with IP address, port number. If setting zero doesn't change anything, then your config file is not being read, or it's being overridden. As an aside, if you use ToolRunner in your Hadoop program, then it will support generic options such that you can run your program with the option -D mapred.reduce.tasks=2 to tell it to use 2 reducers. This allows you to set the number of reducers on a per-job basis. I understand that it is being overridden by something else. What I want to know is which file is overriding it. Also, please note that I have these settings only in the conf/hadoop-site.xml of job tracker node. Is that enough?
Re: multiple results for each input line
Hi John, You could do this with a map only-job (using NLineInputFormat, and setting the number of reducers to 0), and write the output key as docnameN,stat1,stat2,stat3,stat12 and a null value. This assumes that you calculate all 12 statistics in one map. Each output file would have a single line in it. Cheers, Tom On Wed, May 20, 2009 at 10:21 AM, John Clarke clarke...@gmail.com wrote: Hi, I'm having some trouble implementing what I want to achieve... essentially I have a large input list of documents that I want to get statistics on. For each document I have 12 different stats to work out. So my input file is a text file with one document filepath on each line. The documents are stored on a remote server. I want to fetch each document and calculate certain stats from it. My problem is with the output. I want my output to be similar to this: docname1,stat1,stat2,stat3,stat12 docname2,stat1,stat2,stat3,stat12 docname3,stat1,stat2,stat3,stat12 . . . docnameN,stat1,stat2,stat3,stat12 I can fetch the document in my map code and perform my stats calculation on it but dont know how to return more than one value for a key, the key in this case being the document name. Cheers, John
Re: Linking against Hive in Hadoop development tree
On Fri, May 15, 2009 at 11:06 PM, Owen O'Malley omal...@apache.org wrote: On May 15, 2009, at 2:05 PM, Aaron Kimball wrote: In either case, there's a dependency there. You need to split it so that there are no cycles in the dependency tree. In the short term it looks like: avro: core: avro hdfs: core mapred: hdfs, core Why does mapred depend on hdfs? MapReduce should only depend on the FileSystem interface, shouldn't it? Tom hive: mapred, core pig: mapred, core Adding a dependence from core to hive would be bad. To integrate with Hive, you need to add a contrib module to Hive that adds it. -- Owen
Re: Shutdown in progress exception
Looks like you are trying to copy file to HDFS in a shutdown hook. Since you can't control the order in which shutdown hooks run, this is won't work. There is a patch to allow Hadoop's FileSystem shutdown hook to be disabled so it doesn't close filesystems on exit. See https://issues.apache.org/jira/browse/HADOOP-4829. Cheers, Tom On Tue, May 19, 2009 at 8:44 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Does anyone has any idea on this issue? Thanks! 2009/5/17 Stas Oskin stas.os...@gmail.com Hi. I have an issue where my application, when shutting down (at ShutdownHook level), is unable to copy files to HDFS. Each copy throws the following exception: java.lang.IllegalStateException: Shutdown in progress at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:39) at java.lang.Runtime.addShutdownHook(Runtime.java:192) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1353) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1185) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1161) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1133) at app.util.FileUtils.copyToCluster(FileUtils.java:392) I read some reports, including this one ( https://issues.apache.org/jira/browse/HADOOP-3818), but there was no definite answer how to do it. Is there any good solution to this issue? Thanks in advance.
Re: Access to local filesystem working folder in map task
Hi Chris, The task-attempt local working folder is actually just the current working directory of your map or reduce task. You should be able to pass your legacy command line exe and other files using the -files option (assuming you are using the Java interface to write your job, and you are implementing Tool; streaming also supports the -files option) and they will appear in the local working folder. You shouldn't have to use the DistributedCache class directly at all. Cheers, Tom On Tue, May 19, 2009 at 2:21 PM, Chris Carman kri...@redlab.ee wrote: hi users, I have started writing my first project on Hadoop and am now seeking some guidance from more experienced members. The project is about running some CPU intensive computations in parallel and should be a straightforward application for MapReduce, as the input dataset can easily be partitioned to independent jobs and the final aggregation is a low cost step. The application, however, relies on a legacy command line exe file (which runs OK under wine). It reads about 10 small files (5mb) from its working folder and produces another 10 as a result. I can easily send those files and the app to all nodes via DistributedCache so that they get stored read-only to the local file system. I now need to get a local working folder for the task-attempt, where I could copy or symlink the relevant inputs, execute the legacy exe, and read off the output. As I understand, the task is returning an HDFS location when I ask for FileOutputFormat.getWorkOutputPath(job); I read from docs that there should be task-attempt local working folder, but I struggle to find a way to get the filesystem path to it, so that I could copy files and pass it in to my app for local processing. Tell me it's an easy one that I've missed. Many Thanks, Chris
Re: Is there any performance issue with Jrockit JVM for Hadoop
On Mon, May 18, 2009 at 11:44 AM, Steve Loughran ste...@apache.org wrote: Grace wrote: To follow up this question, I have also asked help on Jrockit forum. They kindly offered some useful and detailed suggestions according to the JRA results. After updating the option list, the performance did become better to some extend. But it is still not comparable with the Sun JVM. Maybe, it is due to the use case with short duration and different implementation in JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM currently. Thanks all for your time and help. what about flipping the switch that says run tasks in the TT's own JVM?. That should handle startup costs, and reduce the memory footprint The property mapred.job.reuse.jvm.num.tasks allows you to set how many tasks the JVM may be reused for (within a job), but it always runs in a separate JVM to the tasktracker. (BTW https://issues.apache.org/jira/browse/HADOOP-3675has some discussion about running tasks in the tasktracker's JVM). Tom
Re: public IP for datanode on EC2
Hi Joydeep, The problem you are hitting may be because port 50001 isn't open, whereas from within the cluster any node may talk to any other node (because the security groups are set up to do this). However I'm not sure this is a good approach. Configuring Hadoop to use public IP addresses everywhere should work, but you have to pay for all data transfer between nodes (see http://aws.amazon.com/ec2/, Public and Elastic IP Data Transfer). This is going to get expensive fast! So to get this to work well, we would have to make changes to Hadoop so it was aware of both public and private addresses, and use the appropriate one: clients would use the public address, while daemons would use the private address. I haven't looked at what it would take to do this or how invasive it would be. Cheers, Tom On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma jssa...@facebook.com wrote: I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname). Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception: 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s). 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:177) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.hadoop.mapred.Child.main(Child.java:153) strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences. Totally foxed now. Joydeep -Original Message- From: Joydeep Sen Sarma [mailto:jssa...@facebook.com] Sent: Wednesday, May 13, 2009 9:38 PM To: core-user@hadoop.apache.org Cc: Tom White Subject: RE: public IP for datanode on EC2 Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine. However - I am hitting a new error during job submission (running hadoop-0.19.0): 2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)' java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal at org.apache.hadoop.ipc.Client$Connection.init(Client.java:195) at org.apache.hadoop.ipc.Client.getConnection(Client.java:791) at org.apache.hadoop.ipc.Client.call(Client.java:686) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:176) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788) looking at the stack trace and the code - it seems that this is happening because the jobclient asks for the mapred system directory from the jobtracker - which replies back with a path name that's qualified against the fs.default.name setting of the jobtracker. Unfortunately the standard
Re: public IP for datanode on EC2
Yes, you're absolutely right. Tom On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma jssa...@facebook.com wrote: The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and resolve to public ip addresses from outside). The only data transfer that I would incur while submitting jobs from outside is the cost of copying the jar files and any other files meant for the distributed cache). That would be extremely small. -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, May 14, 2009 5:58 AM To: core-user@hadoop.apache.org Subject: Re: public IP for datanode on EC2 Hi Joydeep, The problem you are hitting may be because port 50001 isn't open, whereas from within the cluster any node may talk to any other node (because the security groups are set up to do this). However I'm not sure this is a good approach. Configuring Hadoop to use public IP addresses everywhere should work, but you have to pay for all data transfer between nodes (see http://aws.amazon.com/ec2/, Public and Elastic IP Data Transfer). This is going to get expensive fast! So to get this to work well, we would have to make changes to Hadoop so it was aware of both public and private addresses, and use the appropriate one: clients would use the public address, while daemons would use the private address. I haven't looked at what it would take to do this or how invasive it would be. Cheers, Tom On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma jssa...@facebook.com wrote: I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname). Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception: 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s). 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused at org.apache.hadoop.ipc.Client.call(Client.java:699) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:177) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.hadoop.mapred.Child.main(Child.java:153) strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences. Totally foxed now. Joydeep -Original Message- From: Joydeep Sen Sarma [mailto:jssa...@facebook.com] Sent: Wednesday, May 13, 2009 9:38 PM To: core-user@hadoop.apache.org Cc: Tom White Subject: RE: public IP for datanode on EC2 Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine. However - I am hitting a new error during job submission (running hadoop-0.19.0): 2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)' java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal at org.apache.hadoop.ipc.Client$Connection.init(Client.java:195) at org.apache.hadoop.ipc.Client.getConnection(Client.java:791) at org.apache.hadoop.ipc.Client.call(Client.java:686) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:176) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367) at org.apache.hadoop.fs.FileSystem.access$200
Re: HDFS to S3 copy problems
Ian - Thanks for the detailed analysis. It was these issues that lead me to create a temporary file in NativeS3FileSystem in the first place. I think we can get NativeS3FileSystem to report progress though, see https://issues.apache.org/jira/browse/HADOOP-5814. Ken - I can't see why you would be getting that error. Does it work with hadoop fs, but not hadoop distcp? Cheers, Tom On Sat, May 9, 2009 at 6:48 AM, Nowland, Ian nowl...@amazon.com wrote: Hi Tom, Not creating a temp file is the ideal as it saves you from having to waste using the local hard disk by writing an output file just before uploading same to Amazon S3. There are a few problems though: 1) Amazon S3 PUTs need the file length up front. You could use a chunked POST, but then you have the disadvantage of having to Base64 encode all your data, increasing bandwidth usage, and also you still have the next problems; 2) You would still want to have MD5 checking. In Amazon S3 both PUT and POST require the MD5 to be supplied before the contents. To work around this then you would have to upload the object without MD5, then check its metadata to make sure the MD5 is correct, then delete it if it is not. This is all possible, but would be difficult to make bulletproof, whereas in the current version, if the MD5 is different the PUT fails atomically and you can easily just retry. 3) Finally, you would have to be careful in reducers that output only very rarely. If there is too big a gap between data being uploaded through the socket, then S3 may determine the connection has timed out, closing the connection and meaning your task has to rerun (perhaps just to hit the same problem again). All of this means that the current solution may be best for now as far as general upload. The best I think we can so is fix the fact that the task is not progressed in close(). The best way I can see to do this is introducing a new interface say called ExtendedClosable which defines a close(Progressable p) method. Then, have the various clients of FileSystem output streams (e.g. Distcp, TextOutputFormat) test if their DataOutputStream supports the interface, and if so call this in preference to the default. In the case of NativeS3FileSystem then, this method spins up a thread to keep the Progressable updated as the upload progresses. As an additional optimization to Distcp, where the source file already exists we could have some extended interface say ExtendedWriteFileSystem that has a create() method that takes the MD5 and the file size, then test for this interface in the Distcp mapper call the extended method. The trade off here is the fact that the checksum HDFS stored is not the MD5 needed by S3, and so two (perhaps distributed) reads would be needed so the tradeoff is these two distributed reads vs a distributed read and a local write then local read. What do you think? Cheers, Ian Nowland Amazon.com -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Friday, May 08, 2009 1:36 AM To: core-user@hadoop.apache.org Subject: Re: HDFS to S3 copy problems Perhaps we should revisit the implementation of NativeS3FileSystem so that it doesn't always buffer the file on the client. We could have an option to make it write directly to S3. Thoughts? Regarding the problem with HADOOP-3733, you can work around it by setting fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey in your hadoop-site.xml. Cheers, Tom On Fri, May 8, 2009 at 1:17 AM, Andrew Hitchcock adpow...@gmail.com wrote: Hi Ken, S3N doesn't work that well with large files. When uploading a file to S3, S3N saves it to local disk during write() and then uploads to S3 during the close(). Close can take a long time for large files and it doesn't report progress, so the call can time out. As a work around, I'd recommend either increasing the timeout or uploading the files by hand. Since you only have a few large files, you might want to copy the files to local disk and then use something like s3cmd to upload them to S3. Regards, Andrew On Thu, May 7, 2009 at 4:42 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi all, I have a few large files (4 that are 1.8GB+) I'm trying to copy from HDFS to S3. My micro EC2 cluster is running Hadoop 0.19.1, and has one master/two slaves. I first tried using the hadoop fs -cp command, as in: hadoop fs -cp output/dir/ s3n://bucket/dir/ This seemed to be working, as I could walk the network traffic spike, and temp files were being created in S3 (as seen with CyberDuck). But then it seemed to hang. Nothing happened for 30 minutes, so I killed the command. Then I tried using the hadoop distcp command, as in: hadoop distcp hdfs://host:50001/path/dir/ s3://public key:private key@bucket/dir2/ This failed, because my secret key has a '/' in it (http://issues.apache.org/jira/browse/HADOOP-3733) Then I tried using hadoop distcp
Re: Mixing s3, s3n and hdfs
Hi Kevin, The s3n filesystem treats each file as a single block, however you may be able to split files by setting the number of mappers appropriately (or setting mapred.max.split.size in the new MapReduce API in 0.20.0). S3 supports range requests, and the s3n implementation uses them, so it wouldn't try to download the entire file for each split. You don't need to run a namenode for S3 filesystems, it is only needed for HDFS. So it is feasible to run S3 and HDFS in parallel, copying data from one to the other. Cheers, Tom On Fri, May 8, 2009 at 8:55 AM, Kevin Peterson kpeter...@biz360.com wrote: Currently, we are running our cluster in EC2 with HDFS stored on the local (i.e. transient) disk. We don't want to deal with EBS, because it complicates being able to spin up additional slaves as needed. We're looking at moving to a combination of s3 (block) or s3n for data that we care about, and leaving lower value data that we can recreate on HDFS. My thinking is that s3n has significant advantages in terms of how easy it is to import data from non-Hadoop processes, and also the ease of sampling data, but I'm not sure how well it actually works. I'm guessing that it wouldn't be able to split files, or maybe it would need to download the entire file from S3 multiple times to split it? Is the issue with writes buffering the entire file on the local machine significant? Our jobs tend to be more CPU intensive than the usual kind of log processing type jobs, so we usually end up with smaller files. Is it feasible to run s3 (block) and hdfs in parallel? Would I need two namenodes to do this? Is this a good idea? Has anyone tried either of these configurations in EC2?
Re: About Hadoop optimizations
On Thu, May 7, 2009 at 6:05 AM, Foss User foss...@gmail.com wrote: Thanks for your response again. I could not understand a few things in your reply. So, I want to clarify them. Please find my questions inline. On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon t...@cloudera.com wrote: On Wed, May 6, 2009 at 1:46 PM, Foss User foss...@gmail.com wrote: 2. Is the meta data for file blocks on data node kept in the underlying OS's file system on namenode or is it kept in RAM of the name node? The block locations are kept in the RAM of the name node, and are updated whenever a Datanode does a block report. This is why the namenode is in safe mode at startup until it has received block locations for some configurable percentage of blocks from the datanodes. What is safe mode in namenode? This concept is new to me. Could you please explain this? Safe mode is described here: http://hadoop.apache.org/core/docs/r0.20.0/hdfs_design.html#Safemode 3. If no mapper more mapper functions can be run on the node that contains the data on which the mapper has to act on, is Hadoop intelligent enough to run the new mappers on some machines within the same rack? Yes, assuming you have configured a network topology script. Otherwise, Hadoop has no magical knowledge of your network infrastructure, and it treats the whole cluster as a single rack called /default-rack Is it a network topology script or is it a Java plugin code? AFAIK, we need to write an implementation of org.apache.hadoop.net.DNSToSwitchMapping interface. Can we write it as a script or configuration file and avoid Java coding to achieve this? If so, how? To tell Hadoop about your network topology you can either write a Java implementation of org.apache.hadoop.net.DNSToSwitchMapping or you can write a script in another language. There are more details at http://hadoop.apache.org/core/docs/r0.20.0/cluster_setup.html#Hadoop+Rack+Awareness and a sample script at http://www.nabble.com/Hadoop-topology.script.file.name-Form-td17683521.html
Re: move tasks to another machine on the fly
Hi David, The MapReduce framework will attempt to rerun failed tasks automatically. However, if a task is running out of memory on one machine, it's likely to run out of memory on another, isn't it? Have a look at the mapred.child.java.opts configuration property for the amount of memory that each task VM is given (200MB by default). You can also control the memory that each daemon gets using the HADOOP_HEAPSIZE variable in hadoop-env.sh. Or you can specify it on a per-daemon basis using the HADOOP_DAEMON_NAME_OPTS variables in the same file. Tom On Wed, May 6, 2009 at 1:28 AM, David Batista dsbati...@gmail.com wrote: I get this error when running Reduce tasks on a machine: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:2591) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:190) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:387) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:117) at org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44) at org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410) is it possible to move a reduce task to other machine in the cluster on the fly? -- ./david
Re: Using multiple FileSystems in hadoop input
Hi Ivan, I haven't tried this combination, but I think it should work. If it doesn't it should be treated as a bug. Tom On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net wrote: Greetings to all, Could anyone suggest if Paths from different FileSystems can be used as input of Hadoop job? Particularly I'd like to find out whether Paths from HarFileSystem can be mixed with ones from DistributedFileSystem. Thanks, -- Kind regards, Ivan
Re: multi-line records and file splits
Hi Rajarshi, FileInputFormat (SDFInputFormat's superclass) will break files into splits, typically on HDFS block boundaries (if the defaults are left unchanged). This is not a problem for your code however, since it will read every record that starts within a split (even if it crosses a split boundary). This is just like how TextInputFormat works. So you don't need to use MultiFileInputFormat - it should work as is. You could demonstrate this to yourself by writing a multi-block file, and doing an identity MapReduce on it. You should find that no records are lost. You might be able to use org.apache.hadoop.streaming.StreamXmlRecordReader (and StreamInputFormat), which does something similar. Despite its name it is not only for Streaming applications, and it isn't restricted to XML. It can parse records that begin with a certain sequence of characters, and end with another sequence. Cheers, Tom On Wed, May 6, 2009 at 2:06 AM, Nick Cen cenyo...@gmail.com wrote: I think your SDFInputFormat should implement the MultiFileInputFormat instead of the TextInputFormat, which will not splid the file into chunk. 2009/5/6 Rajarshi Guha rg...@indiana.edu Hi, I have implemented a subclass of RecordReader to handle a plain text file format where a record is multi-line and of variable length. Schematically each record is of the form some_title foo bar another_title foo foo bar where is the marker for the end of the record. My code is at http://blog.rguha.net/?p=293 and it seems to work fine on my input data. However, I realized that when I run the program, Hadoop will 'chunk' the input file. As a result, the SDFRecordReader might get a chunk of input text, such that the last record is actually incomplete (a missing ). Is this correct? If so, how would the RecordReader implementation recover from this situation? Or is there a way to indicate to Hadoop that the input file should be chunked keeping in mind end of record delimiters? Thanks --- Rajarshi Guha rg...@indiana.edu GPG Fingerprint: D070 5427 CC5B 7938 929C DD13 66A1 922C 51E7 9E84 --- Q: What's polite and works for the phone company? A: A deferential operator. -- http://daily.appspot.com/food/
Re: Specifying System Properties in the had
Another way to do this would be to set a property in the Hadoop config itself. In the job launcher you would have something like: JobConf conf = ... conf.setProperty(foo, test); Then you can read the property in your map or reduce task. Tom On Thu, Apr 30, 2009 at 3:25 PM, Aaron Kimball aa...@cloudera.com wrote: So you want a different -Dfoo=test on each node? It's probably grabbing the setting from the node where the job was submitted, and this overrides the settings on each task node. Try adding finaltrue/final to the property block on the tasktrackers, then restart Hadoop and try again. This will prevent the job from overriding the setting. - Aaron On Thu, Apr 30, 2009 at 9:25 AM, Marc Limotte mlimo...@feeva.com wrote: I'm trying to set a System Property in the Hadoop config, so my jobs will know which cluster they are running on. I think I should be able to do this with -Dname=value in mapred.child.java.opts (example below), but the setting is ignored. In hadoop-site.xml I have: property namemapred.child.java.opts/name value-Xmx200m -Dfoo=test/value /property But the job conf through the web server indicates: mapred.child.java.opts -Xmx1024M -Duser.timezone=UTC I'm using Hadoop-0.17.2.1. Any tips on why my setting is not picked up? Marc PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.
Re: Patching and bulding produces no libcordio or libhdfs
Have a look at the instructions on http://wiki.apache.org/hadoop/HowToRelease under the Building section. It tells you which environment settings and Ant targets you need to set. Tom On Tue, Apr 28, 2009 at 9:09 AM, Sid123 itis...@gmail.com wrote: HI I have applied a small patch for version 0.20 to my old 0.19.1... After i ran the ant tar I found 3 directories libhdfs and libcodio and c++ were missing from th tared build. Where do you get those from? I cant really use 0.20 because of massive library changes... So if some one can help me out here... Greatly appreciated.. Here is the patch https://issues.apache.org/jira/secure/attachment/12397007/patch-4963.txt i applied -- View this message in context: http://www.nabble.com/Patching-and-bulding-produces-no-libcordio-or-libhdfs-tp23272263p23272263.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: How to run many jobs at the same time?
You need to start each JobControl in its own thread so they can run concurrently. Something like: Thread t = new Thread(jobControl); t.start(); Then poll the jobControl.allFinished() method. Tom On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr nguyenhuynh...@gmail.com wrote: Hi all! I have some jobs: job1, job2, job3,... . Each job working with the group. To control jobs, I have JobControllers, each JobController control jobs follow the specified group. Example: - Have 2 Group: g1 and g2 - 2 JobController: jController1, jcontroller2 + jController1 contains jobs: job1, job2, job3, ... + jController2 contains jobs: job1, job2, job3, ... * To run jobs, I sue: for (i=0; i2; i++){ jCtrl[i]= new jController(group i); jCtrl[i].run(); } * I want jController1 and jController2 run parallel. But actual, when jController1 finished, jController2 begin run. Why? Please help me! * P/s: jController use org.apache.hadoop.mapred.jobcontrol.JobControl Thanks, cheer, Nguyen.
Re: Interesting Hadoop/FUSE-DFS access patterns
Not sure if will affect your findings, but when you read from a FSDataInputStream you should see how many bytes were actually read by inspecting the return value and re-read if it was fewer than you want. See Hadoop's IOUtils readFully() method. Tom On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Todd, Been playing more this morning after thinking about it for the night -- I think the culprit is not the network, but actually the cache. Here's the output of your script adjusted to do the same calls as I was doing (you had left out the random I/O part). [br...@red tmp]$ java hdfs_tester Mean value for reads of size 0: 0.0447 Mean value for reads of size 16384: 10.4872 Mean value for reads of size 32768: 10.82925 Mean value for reads of size 49152: 6.2417 Mean value for reads of size 65536: 7.0511003 Mean value for reads of size 81920: 9.411599 Mean value for reads of size 98304: 9.378799 Mean value for reads of size 114688: 8.99065 Mean value for reads of size 131072: 5.1378503 Mean value for reads of size 147456: 6.1324 Mean value for reads of size 163840: 17.1187 Mean value for reads of size 180224: 6.5492 Mean value for reads of size 196608: 8.45695 Mean value for reads of size 212992: 7.4292 Mean value for reads of size 229376: 10.7843 Mean value for reads of size 245760: 9.29095 Mean value for reads of size 262144: 6.57865 Copy of the script below. So, without the FUSE layer, we don't see much (if any) patterns here. The overhead of randomly skipping through the file is higher than the overhead of reading out the data. Upon further inspection, the biggest factor affecting the FUSE layer is actually the Linux VFS caching -- if you notice, the bandwidth in the given graph for larger read sizes is *higher* than 1Gbps, which is the limit of the network on that particular node. If I go in the opposite direction - starting with the largest reads first, then going down to the smallest reads, the graph entirely smooths out for the small values - everything is read from the filesystem cache in the client RAM. Graph attached. So, on the upside, mounting through FUSE gives us the opportunity to speed up reads for very complex, non-sequential patterns - for free, thanks to the hardworking Linux kernel. On the downside, it's incredibly difficult to come up with simple cases to demonstrate performance for an application -- the cache performance and size depends on how much activity there's on the client, the previous file system activity that the application did, and the amount of concurrent activity on the server. I can give you results for performance, but it's not going to be the performance you see in real life. (Gee, if only file systems were easy...) Ok, sorry for the list noise -- it seems I'm going to have to think more about this problem before I can come up with something coherent. Brian import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.conf.Configuration; import java.io.IOException; import java.net.URI; import java.util.Random; public class hdfs_tester { public static void main(String[] args) throws Exception { URI uri = new URI(hdfs://hadoop-name:9000/); FileSystem fs = FileSystem.get(uri, new Configuration()); Path path = new Path(/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/Nebraska/LoadTest07_Nebraska_33); FSDataInputStream dis = fs.open(path); Random rand = new Random(); FileStatus status = fs.getFileStatus(path); long file_len = status.getLen(); int iters = 20; for (int size=0; size 1024*1024; size += 4*4096) { long csum = 0; for (int i = 0; i iters; i++) { int pos = rand.nextInt((int)((file_len-size-1)/8))*8; byte buf[] = new byte[size]; if (pos 0) pos = 0; long st = System.nanoTime(); dis.read(pos, buf, 0, size); long et = System.nanoTime(); csum += et-st; //System.out.println(String.valueOf(size) + \t + String.valueOf(pos) + \t + String.valueOf(et - st)); } float csum2 = csum; csum2 /= iters; System.out.println(Mean value for reads of size + size + : + (csum2/1000/1000)); } fs.close(); } } On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote: On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon t...@cloudera.com wrote: Hey Brian, This is really interesting stuff. I'm curious - have you tried these same experiments using the Java API? I'm wondering whether this is FUSE-specific or inherent to all HDFS reads. I'll try to reproduce this over here as well. This smells sort of nagle-related to me... if you get a chance, you may want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and see if the magic number jumps up to 256KB. If so, I think it should be a pretty easy bugfix. Oops -
Re: Example of deploying jars through DistributedCache?
Does it work if you use addArchiveToClassPath()? Also, it may be more convenient to use GenericOptionsParser's -libjars option. Tom On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote: Hi all, I'm stumped as to how to use the distributed cache's classpath feature. I have a library of Java classes I'd like to distribute to jobs and use in my mapper; I figured the DCache's addFileToClassPath() method was the correct means, given the example at http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html. I've boiled it down to the following non-working example: in TestDriver.java: private void runJob() throws IOException { JobConf conf = new JobConf(getConf(), TestDriver.class); // do standard job configuration. FileInputFormat.addInputPath(conf, new Path(input)); FileOutputFormat.setOutputPath(conf, new Path(output)); conf.setMapperClass(TestMapper.class); conf.setNumReduceTasks(0); // load aaronTest2.jar into the dcache; this contains the class ValueProvider FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(aaronTest2.jar), new Path(tmp/aaronTest2.jar)); DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar), conf); // run the job. JobClient.runJob(conf); } and then in TestMapper: public void map(LongWritable key, Text value, OutputCollectorLongWritable, Text output, Reporter reporter) throws IOException { try { ValueProvider vp = (ValueProvider) Class.forName(ValueProvider).newInstance(); Text val = vp.getValue(); output.collect(new LongWritable(1), val); } catch (ClassNotFoundException e) { throw new IOException(not found: + e.toString()); // newInstance() throws to here. } catch (Exception e) { throw new IOException(Exception: + e.toString()); } } The class ValueProvider is to be loaded from aaronTest2.jar. I can verify that this code works if I put ValueProvider into the main jar I deploy. I can verify that aaronTest2.jar makes it into the ${mapred.local.dir}/taskTracker/archive/ But when run with ValueProvider in aaronTest2.jar, the job fails with: $ bin/hadoop jar aaronTest1.jar TestDriver 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process : 10 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005 09/03/01 22:36:05 INFO mapred.JobClient: map 0% reduce 0% 09/03/01 22:36:14 INFO mapred.JobClient: Task Id : attempt_200903012210_0005_m_00_0, Status : FAILED java.io.IOException: not found: java.lang.ClassNotFoundException: ValueProvider at TestMapper.map(Unknown Source) at TestMapper.map(Unknown Source) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Do I need to do something else (maybe in Mapper.configure()?) to actually classload the jar? The documentation makes me believe it should already be in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3. Thanks, - Aaron
Re: RecordReader design heuristic
Hi Josh, The other aspect to think about when writing your own record reader is input splits. As Jeff mentioned you really want mappers to be processing about one HDFS block's worth of data. If your inputs are significantly smaller, the overhead of creating mappers will be high and your jobs will be inefficient. On the other hand, if your inputs are significantly larger then you need to split them otherwise each mapper will take a very long time processing each split. Some file formats are inherently splittable, meaning you can re-align with record boundaries from an arbitrary point in the file. Examples include line-oriented text (split at newlines), and bzip2 (has a unique block marker). If your format is splittable then you will be able to take advantage of this to make MR processing more efficient. Cheers, Tom On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh jpatters...@tva.gov wrote: Jeff, Yeah, the mapper sitting on a dfs block is pretty cool. Also, yes, we are about to start crunching on a lot of energy smart grid data. TVA is sorta like Switzerland for smart grid power generation and transmission data across the nation. Right now we have about 12TB, and this is slated to be around 30TB by the end of next 2010 (possibly more, depending on how many more PMUs come online). I am very interested in Mahout and have read up on it, it has many algorithms that I am familiar with from grad school. I will be doing some very simple MR jobs early on like finding the average frequency for a range of data, and I've been selling various groups internally on what CAN be done with good data mining and tools like Hadoop/Mahout. Our production cluster wont be online for a few more weeks, but that part is already rolling so I've moved on to focus on designing the first jobs to find quality results/benefits that I can sell in order to campaign for more ambitious projects I have drawn up. I know time series data lends itself to many machine learning applications, so, yes, I would be very interested in talking to anyone who wants to talk or share notes on hadoop and machine learning. I believe Mahout can be a tremendous resource for us and definitely plan on running and contributing to it. Josh Patterson TVA -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Wednesday, March 18, 2009 12:02 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic Hi Josh, It seemed like you had a conceptual wire crossed and I'm glad to help out. The neat thing about Hadoop mappers is - since they are given a replicated HDFS block to munch on - the job scheduler has replication factor number of node choices where it can run each mapper. This means mappers are always reading from local storage. On another note, I notice you are processing what looks to be large quantities of vector data. If you have any interest in clustering this data you might want to look at the Mahout project (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready clustering algorithms, including a new non-parametric Dirichlet Process Clustering implementation that I committed recently. We are pulling it all together for a 0.1 release and I would be very interested in helping you to apply these algorithms if you have an interest. Jeff Patterson, Josh wrote: Jeff, ok, that makes more sense, I was under the mis-impression that it was creating and destroying mappers for each input record. I dont know why I had that in my head. My design suddenly became a lot clearer, and this provides a much more clean abstraction. Thanks for your help! Josh Patterson TVA
Re: Problem with com.sun.pinkdots.LogHandler
Hi Paul, Looking at the stack trace, the exception is being thrown from your map method. Can you put some debugging in there to diagnose it? Detecting and logging the size of the array and the index you are trying to access should help. You can write to standard error and look in the task logs. Another way is to use Reporter's setStatus() method as a quick way to see messages in the web UI. Cheers, Tom On Mon, Mar 16, 2009 at 11:51 PM, psterk paul.st...@sun.com wrote: Hi, I have been running a hadoop cluster successfully for a few months. During today's run, I am seeing a new error and it is not clear to me how to resolve it. Below are the stack traces and the configure file I am using. Please share any tips you may have. Thanks, Paul 09/03/16 16:28:25 INFO mapred.JobClient: Task Id : task_200903161455_0003_m_000127_0, Status : FAILED java.lang.ArrayIndexOutOfBoundsException: 3 at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71) at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) task_200903161455_0003_m_000127_0: Starting null.task_200903161455_0003_m_000127_0 task_200903161455_0003_m_000127_0: Closing task_200903161455_0003_m_000127_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.TaskRu task_200903161455_0003_m_000127_0: log4j:WARN Please initialize the log4j system properly. 09/03/16 16:28:27 INFO mapred.JobClient: Task Id : task_200903161455_0003_m_000128_0, Status : FAILED java.lang.ArrayIndexOutOfBoundsException: 3 at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71) at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) task_200903161455_0003_m_000128_0: Starting null.task_200903161455_0003_m_000128_0 task_200903161455_0003_m_000128_0: Closing 09/03/16 16:28:32 INFO mapred.JobClient: Task Id : task_200903161455_0003_m_000128_1, Status : FAILED java.lang.ArrayIndexOutOfBoundsException: 3 at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71) at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) task_200903161455_0003_m_000128_1: Starting null.task_200903161455_0003_m_000128_1 task_200903161455_0003_m_000128_1: Closing 09/03/16 16:28:37 INFO mapred.JobClient: Task Id : task_200903161455_0003_m_000127_1, Status : FAILED java.lang.ArrayIndexOutOfBoundsException: 3 at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71) at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) :qsk_200903161455_0003_m_000127_1: Starting null.task_200903161455_0003_m_000127_1 clear200903161455_0003_m_000127_1: Closing task_200903161455_0003_m_000127_1: log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Client). task_200903161455_0003_m_000127_1: log4j:WARN Please initialize the log4j system properly. 09/03/16 16:28:40 INFO mapred.JobClient: Task Id : task_200903161455_0003_m_000128_2, Status : FAILED java.lang.ArrayIndexOutOfBoundsException: 3 at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71) at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) task_200903161455_0003_m_000128_2: Starting null.task_200903161455_0003_m_000128_2 task_200903161455_0003_m_000128_2: Closing 09/03/16 16:28:46 INFO mapred.JobClient: map 100% reduce 100% java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at com.sun.pinkdots.Main.handleLogs(Main.java:63) at com.sun.pinkdots.Main.main(Main.java:35) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at
Re: contrib EC2 with hadoop 0.17
I haven't used Eucalyptus, but you could start by trying out the Hadoop EC2 scripts (http://wiki.apache.org/hadoop/AmazonEC2) with your Eucalyptus installation. Cheers, Tom On Tue, Mar 3, 2009 at 2:51 PM, falcon164 mujahid...@gmail.com wrote: I am new to hadoop. I want to run hadoop on eucalyptus. Please let me know how to do this. -- View this message in context: http://www.nabble.com/contrib-EC2-with-hadoop-0.17-tp17711758p22310068.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop AMI for EC2
Hi Richa, Yes there is. Please see http://wiki.apache.org/hadoop/AmazonEC2. Tom On Thu, Mar 5, 2009 at 4:13 PM, Richa Khandelwal richa...@gmail.com wrote: Hi All, Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? Thanks, Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763
Re: MapReduce jobs with expensive initialization
On any particular tasktracker slot, task JVMs are shared only between tasks of the same job. When the job is complete the task JVM will go away. So there is certainly no sharing between jobs. I believe the static singleton approach outlined by Scott will work since the map classes are in a single classloader (but I haven't actually tried this). Cheers, Tom On Mon, Mar 2, 2009 at 1:39 AM, jason hadoop jason.had...@gmail.com wrote: If you have to you can reach through all of the class loaders and find the instance of your singleton class that has the data loaded. It is awkward, and I haven't done this in java since the late 90's. It did work the last time I did it. On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey sc...@richrelevance.comwrote: You could create a singleton class and reference the dictionary stuff in that. You would probably want this separate from other classes as to control exactly what data is held on to for a long time and what is not. class Singleton { private static final _instance Singleton = new Singleton(); private Singleton() { ... initialize here, only ever called once per classloader or JVM; } public Singleton getSingleton() { return _instance; } in mapper: Singleton dictionary = Singleton.getSingleton(); This assumes that each mapper doesn't live in its own classloader space (which would make even static singletons not shareable), and has the drawback that once initialized, that memory associated with the singleton won't go away until the JVM or classloader that hosts it dies. I have not tried this myself, and do not know the exact classloader semantics used in the new 'persistent' task JVMs. They could have a classloader per job, and dispose of those when the job is complete -- though then it is impossible to persist data across jobs but only within them. Or there could be one permanent persisted classloader, or one per task. All will behave differently with respect to statics like the above example. From: Stuart White [stuart.whi...@gmail.com] Sent: Saturday, February 28, 2009 6:06 AM To: core-user@hadoop.apache.org Subject: MapReduce jobs with expensive initialization I have a mapreduce job that requires expensive initialization (loading of some large dictionaries before processing). I want to avoid executing this initialization more than necessary. I understand that I need to call setNumTasksToExecutePerJvm to -1 to force mapreduce to reuse JVMs when executing tasks. How I've been performing my initialization is, in my mapper, I override MapReduceBase#configure, read my parms from the JobConf, and load my dictionaries. It appears, from the tests I've run, that even though NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class are being created for each task, and therefore I'm still re-running this expensive initialization for each task. So, my question is: how can I avoid re-executing this expensive initialization per-task? Should I move my initialization code out of my mapper class and into my main class? If so, how do I pass references to the loaded dictionaries from my main class to my mapper? Thanks!
Re: OutOfMemory error processing large amounts of gz files
Do you experience the problem with and without native compression? Set hadoop.native.lib to false to disable native compression. Cheers, Tom On Tue, Feb 24, 2009 at 9:40 PM, Gordon Mohr goj...@archive.org wrote: If you're doing a lot of gzip compression/decompression, you *might* be hitting this 6+-year-old Sun JVM bug: Instantiating Inflater/Deflater causes OutOfMemoryError; finalizers not called promptly enough http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4797189 A workaround is listed in the issue: ensuring you call close() or end() on the Deflater; something similar might apply to Inflater. (This is one of those fun JVM situations where having more heap space may make OOMEs more likely: less heap memory pressure leaves more un-GCd or un-finalized heap objects around, each of which is holding a bit of native memory.) - Gordon @ IA bzheng wrote: I have about 24k gz files (about 550GB total) on hdfs and has a really simple java program to convert them into sequence files. If the script's setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory error at about 35% map complete. If I make the script process 2k files per job and run 12 jobs consecutively, then it goes through all files fine. The cluster I'm using has about 67 nodes. Each nodes has 16GB memory, max 7 map, and max 2 reduce. The map task is really simple, it takes LongWritable as key and Text as value, generate a Text newKey, and output.collect(Text newKey, Text value). It doesn't have any code that can possibly leak memory. There's no stack trace for the vast majority of the OutOfMemory error, there's just a single line in the log like this: 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker: java.lang.OutOfMemoryError: Java heap space I can't find the stack trace right now, but rarely the OutOfMemory error originates from some hadoop config array copy opertaion. There's no special config for the script.
Re: Reporter for Hadoop Streaming?
You can retrieve them from the command line using bin/hadoop job -counter job-id group-name counter-name Tom On Wed, Feb 11, 2009 at 12:20 AM, scruffy323 steve.mo...@gmail.com wrote: Do you know how to access those counters programmatically after the job has run? S D-5 wrote: This does it. Thanks! On Thu, Feb 5, 2009 at 9:14 PM, Arun C Murthy a...@yahoo-inc.com wrote: On Feb 5, 2009, at 1:40 PM, S D wrote: Is there a way to use the Reporter interface (or something similar such as Counters) with Hadoop streaming? Alternatively, can how could STDOUT be intercepted for the purpose of updates? If anyone could point me to documentation or examples that cover this I'd appreciate it. http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+counters+in+streaming+applications%3F http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+status+in+streaming+applications%3F Arun -- View this message in context: http://www.nabble.com/Reporter-for-Hadoop-Streaming--tp21861786p21945843.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: can't read the SequenceFile correctly
Hi Mark, Not all the bytes stored in a BytesWritable object are necessarily valid. Use BytesWritable#getLength() to determine how much of the buffer returned by BytesWritable#getBytes() to use. Tom On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I have written binary files to a SequenceFile, seemeingly successfully, but when I read them back with the code below, after a first few reads I get the same number of bytes for the different files. What could go wrong? Thank you, Mark reader = new SequenceFile.Reader(fs, path, conf); Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf); Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf); long position = reader.getPosition(); while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? * : ; byte [] fileBytes = ((BytesWritable) value).getBytes(); System.out.printf([%s%s]\t%s\t%s\n, position, syncSeen, key, fileBytes.length); position = reader.getPosition(); // beginning of next record }
Re: Problem with Counters
Hi Sharath, The code you posted looks right to me. Counters#getCounter() will return the counter's value. What error are you getting? Tom On Thu, Feb 5, 2009 at 10:09 AM, some speed speed.s...@gmail.com wrote: Hi, Can someone help me with the usage of counters please? I am incrementing a counter in Reduce method but I am unable to collect the counter value after the job is completed. Its something like this: public static class Reduce extends MapReduceBase implements ReducerText, FloatWritable, Text, FloatWritable { static enum MyCounter{ct_key1}; public void reduce(..) throws IOException { reporter.incrCounter(MyCounter.ct_key1, 1); output.collect(..); } } -main method { RunningJob running = null; running=JobClient.runJob(conf); Counters ct = running.getCounters(); /* How do I Collect the ct_key1 value ***/ long res = ct.getCounter(MyCounter.ct_key1); } Thanks, Sharath
Re: Problem with Counters
Try moving the enum to inside the top level class (as you already did) and then use getCounter() passing the enum value: public class MyJob { static enum MyCounter{ct_key1}; // Mapper and Reducer defined here public static void main(String[] args) throws IOException { // ... RunningJob running =JobClient.runJob(conf); Counters ct = running.getCounters(); long res = ct.getCounter(MyCounter.ct_key1); // ... } } BTW org.apache.hadoop.mapred.Task$Counter is a built-in MapReduce counter, so that won't help you retrieve your custom counter. Cheers, Tom On Thu, Feb 5, 2009 at 2:22 PM, Rasit OZDAS rasitoz...@gmail.com wrote: Sharath, You're using reporter.incrCounter(enumVal, intVal); to increment counter, I think method to get should also be similar. Try to use findCounter(enumVal).getCounter() or getCounter(enumVal). Hope this helps, Rasit 2009/2/5 some speed speed.s...@gmail.com: In fact I put the enum in my Reduce method as the following link (from Yahoo) says so: http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html#metrics ---Look at the section under Reporting Custom Metrics. 2009/2/5 some speed speed.s...@gmail.com Thanks Rasit. I did as you said. 1) Put the static enum MyCounter{ct_key1} just above main() 2) Changed result = ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 1, Reduce.MyCounter).getCounter(); Still is doesnt seem to help. It throws a null pointer exception.Its not able to find the Counter. Thanks, Sharath On Thu, Feb 5, 2009 at 8:04 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Forgot to say, value 0 means that the requested counter does not exist. 2009/2/5 Rasit OZDAS rasitoz...@gmail.com: Sharath, I think the static enum definition should be out of Reduce class. Hadoop probably tries to find it elsewhere with MyCounter, but it's actually Reduce.MyCounter in your example. Hope this helps, Rasit 2009/2/5 some speed speed.s...@gmail.com: I Tried the following...It gets compiled but the value of result seems to be 0 always. RunningJob running = JobClient.runJob(conf); Counters ct = new Counters(); ct = running.getCounters(); long result = ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 0, *MyCounter*).getCounter(); //even tried MyCounter.Key1 Does anyone know whay that is happening? Thanks, Sharath On Thu, Feb 5, 2009 at 5:59 AM, some speed speed.s...@gmail.com wrote: Hi Tom, I get the error : Cannot find Symbol* **MyCounter.ct_key1 * On Thu, Feb 5, 2009 at 5:51 AM, Tom White t...@cloudera.com wrote: Hi Sharath, The code you posted looks right to me. Counters#getCounter() will return the counter's value. What error are you getting? Tom On Thu, Feb 5, 2009 at 10:09 AM, some speed speed.s...@gmail.com wrote: Hi, Can someone help me with the usage of counters please? I am incrementing a counter in Reduce method but I am unable to collect the counter value after the job is completed. Its something like this: public static class Reduce extends MapReduceBase implements ReducerText, FloatWritable, Text, FloatWritable { static enum MyCounter{ct_key1}; public void reduce(..) throws IOException { reporter.incrCounter(MyCounter.ct_key1, 1); output.collect(..); } } -main method { RunningJob running = null; running=JobClient.runJob(conf); Counters ct = running.getCounters(); /* How do I Collect the ct_key1 value ***/ long res = ct.getCounter(MyCounter.ct_key1); } Thanks, Sharath -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ
Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)
Hi Brian, Writes to HDFS are not guaranteed to be flushed until the file is closed. In practice, as each (64MB) block is finished it is flushed and will be visible to other readers, which is what you were seeing. The addition of appends in HDFS changes this and adds a sync() method to FSDataOutputStream. You can read about the semantics of the new operations here: https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc. Unfortunately, there are some problems with sync() that are still being worked through (https://issues.apache.org/jira/browse/HADOOP-4379). Also, even with sync() working, the append() on SequenceFile does not do an implicit sync() - it is not atomic. Furthermore, there is no way to get hold of the FSDataOutputStream to call sync() yourself - see https://issues.apache.org/jira/browse/HBASE-1155. (And don't get confused by the sync() method on SequenceFile.Writer - it is for another purpose entirely.) As Jason points out, the simplest way to achieve what you're trying to so is to close the file and start a new one. If you start to get too many small files, then you can have another process to merge the smaller files in the background. Tom On Tue, Feb 3, 2009 at 3:57 AM, jason hadoop jason.had...@gmail.com wrote: If you have to do a time based solution, for now, simply close the file and stage it, then open a new file. Your reads will have to deal with the fact the file is in multiple parts. Warning: Datanodes get pokey if they have large numbers of blocks, and the quickest way to do this is to create a lot of small files. On Mon, Feb 2, 2009 at 9:54 AM, Brian Long br...@dotspots.com wrote: Let me rephrase this problem... as stated below, when I start writing to a SequenceFile from an HDFS client, nothing is visible in HDFS until I've written 64M of data. This presents three problems: fsck reports the file system as corrupt until the first block is finally written out, the presence of the file (without any data) seems to blow up my mapred jobs that try to make use of it under my input path, and finally, I want to basically flush every 15 minutes or so so I can mapred the latest data. I don't see any programmatic way to force the file to flush in 17.2. Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that not do what I think it does? What controls the 64M limit, anyway? Is it dfs.checkpoint.size or dfs.block.size? Is the buffering happening on the client, or on data nodes? Or in the namenode? It seems really bad that a SequenceFile, upon creation, is in an unusable state from the perspective of a mapred job, and also leaves fsck in a corrupt state. Surely I must be doing something wrong... but what? How can I ensure that a SequenceFile is immediately usable (but empty) on creation, and how can I make things flush on some regular time interval? Thanks, Brian On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote: I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter and write to using append(key, value). Because the writer volume is low, it's not uncommon for it to take over a day for my appends to finally be flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day). Because I am running map/reduce tasks on this data multiple times a day, I want to flush the sequence file so the mapred jobs can pick it up when they run. What's the right way to do this? I'm assuming it's a fairly common use case. Also -- are writes to the sequence files atomic? (e.g. if I am actively appending to a sequence file, is it always safe to read from that same file in a mapred job?) To be clear, I want the flushing to be time based (controlled explicitly by the app), not size based. Will this create waste in HDFS somehow? Thanks, Brian
Re: hadoop to ftp files into hdfs
NLineInputFormat is ideal for this purpose. Each split will be N lines of input (where N is configurable), so each mapper can retrieve N files for insertion into HDFS. You can set the number of redcers to zero. Tom On Tue, Feb 3, 2009 at 4:23 AM, jason hadoop jason.had...@gmail.com wrote: If you have a large number of ftp urls spread across many sites, simply set that file to be your hadoop job input, and force the input split to be a size that gives you good distribution across your cluster. On Mon, Feb 2, 2009 at 3:23 PM, Steve Morin steve.mo...@gmail.com wrote: Does any one have a good suggestion on how to submit a hadoop job that will split the ftp retrieval of a number of files for insertion into hdfs? I have been searching google for suggestions on this matter. Steve
Re: best way to copy all files from a file system to hdfs
Is there any reason why it has to be a single SequenceFile? You could write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC. Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote: Truly, I do not see any advantage to doing this, as opposed to writing (Java) code which will copy files to HDFS, because then tarring becomes my bottleneck. Unless I write code measure the file sizes and prepare pointers for multiple tarring tasks. It becomes pretty complex though, and I thought of something simple. I might as well accept that copying one hard drive to HDFS is not going to be parallelized. Mark On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer f...@infochimps.orgwrote: Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile? This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom application. Thank you, Mark -- http://www.infochimps.org Connected Open Free Data
Re: A record version mismatch occured. Expecting v6, found v32
The SequenceFile format is described here: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html. The format of the keys and values depends on the serialization classes used. For example, BytesWritable writes out the length of its byte array followed by the actual bytes in the array (see the write() method in BytesWritable). Hope this helps. Tom On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS rasitoz...@gmail.com wrote: I tried to use SequenceFile.Writer to convert my binaries into Sequence Files, I read the binary data with FileInputStream, getting all bytes with reader.read(byte[]) , wrote it to a file with SequenceFile.Writer, with parameters NullWritable as key, BytesWritable as value. But the content changes, (I can see that by converting to Base64) Binary File: 73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65 103 54 81 65 65 97 81 65 65 65 81 ... Sequence File: 73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67 69 77 65 52 80 86 67 65 73 68 114 ... Thanks for any points.. Rasit 2009/2/2 Rasit OZDAS rasitoz...@gmail.com Hi, I tried to use SequenceFileInputFormat, for this I appended SEQ as first bytes of my binary files (with hex editor). but I get this exception: A record version mismatch occured. Expecting v6, found v32 at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.Child.main(Child.java:155) What could it be? Is it not enough just to add SEQ to binary files? I use Hadoop v.0.19.0 . Thanks in advance.. Rasit different *version* of *Hadoop* between your server and your client. -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ
Re: MapFile.Reader and seek
You can use the get() method to seek and retrieve the value. It will return null if the key is not in the map. Something like: Text value = (Text) indexReader.get(from, new Text()); while (value != null ...) Tom On Thu, Jan 29, 2009 at 10:45 PM, schnitzi mark.schnitz...@fastsearch.com wrote: Greetings all... I have a situation where I want to read a range of keys and values out of a MapFile. So I have something like this: MapFile.Reader indexReader = new MapFile.Reader(fs, path.toString(), configuration) boolean seekSuccess = indexReader.seek(from); boolean readSuccess = indexReader.next(keyValue, value); while (readSuccess ...) The problem seems to be that while seekSuccess is returning true, when I call next() to get the value there, it's returning the value *after* the key that I called seek() on. So if, say, my keys are Text(id0) through Text(id9), and I seek for Text(id3), calling next() will return Text(id4) and its associated value, not Text(id3). I would expect next() to return the key/value at the seek location, not the one after it. Am I doing something wrong? Otherwise, what good is seek(), really? -- View this message in context: http://www.nabble.com/MapFile.Reader-and-seek-tp21737717p21737717.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: best way to copy all files from a file system to hdfs
Yes. SequenceFile is splittable, which means it can be broken into chunks, called splits, each of which can be processed by a separate map task. Tom On Mon, Feb 2, 2009 at 3:46 PM, Mark Kerzner markkerz...@gmail.com wrote: No, no reason for a single file - just a little simpler to think about. By the way, can multiple MapReduce workers read the same SequenceFile simultaneously? On Mon, Feb 2, 2009 at 9:42 AM, Tom White t...@cloudera.com wrote: Is there any reason why it has to be a single SequenceFile? You could write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC. Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote: Truly, I do not see any advantage to doing this, as opposed to writing (Java) code which will copy files to HDFS, because then tarring becomes my bottleneck. Unless I write code measure the file sizes and prepare pointers for multiple tarring tasks. It becomes pretty complex though, and I thought of something simple. I might as well accept that copying one hard drive to HDFS is not going to be parallelized. Mark On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer f...@infochimps.orgwrote: Could you tar.bz2 them up (setting up the tar so that it made a few dozen files), toss them onto the HDFS, and use http://stuartsierra.com/2008/04/24/a-million-little-files to go into SequenceFile? This lets you preserve the originals and do the sequence file conversion across the cluster. It's only really helpful, of course, if you also want to prepare a .tar.bz2 so you can clear out the sprawl flip On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am writing an application to copy all files from a regular PC to a SequenceFile. I can surely do this by simply recursing all directories on my PC, but I wonder if there is any way to parallelize this, a MapReduce task even. Tom White's books seems to imply that it will have to be a custom application. Thank you, Mark -- http://www.infochimps.org Connected Open Free Data
Re: tools for scrubbing HDFS data nodes?
Each datanode has a web page at http://datanode:50075/blockScannerReport where you can see details about the scans. Tom On Thu, Jan 29, 2009 at 7:29 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Owen O'Malley wrote: On Jan 28, 2009, at 6:16 PM, Sriram Rao wrote: By scrub I mean, have a tool that reads every block on a given data node. That way, I'd be able to find corrupted blocks proactively rather than having an app read the file and find it. The datanode already has a thread that checks the blocks periodically for exactly that purpose. since Hadoop 0.16.0. scans all the blocks every 3 weeks (by default, interval can be changed). Raghu.
Re: Set the Order of the Keys in Reduce
Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Archive?
Hi Mark, The archives are listed on http://wiki.apache.org/hadoop/MailingListArchives Tom On Thu, Jan 22, 2009 at 3:41 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, is there an archive to the messages? I am a newcomer, granted, but google groups has all the discussion capabilities, and it has a searchable archive. It is strange to have just a mailing list. Am I missing something? Thank you, Mark
Re: Set the Order of the Keys in Reduce
Reducers run independently and without knowledge of one another, so you can't get one reducer to depend on the output of another. I think having two jobs is the simplest way to achieve what you're trying to do. Tom On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello Tom, Would like to apply some rules To CAT_A, then use the output of CAT_A to reduce CAT_B. I'd rather not run two JOBS, so perhaps I need two reducers? First Reducer processes CAT_A, then when complete second reducer does CAT_B? I suppose this would accomplish the same thing? -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Thursday, January 22, 2009 10:41 AM To: core-user@hadoop.apache.org Subject: Re: Set the Order of the Keys in Reduce Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay brian.mac...@medecision.com wrote: Hello, Any tips would be greatly appreciated. Is there a way to set the order of the keys in reduce as shown below, no matter what order the collection in MAP occurs in. Thanks, Brian public void map(WritableComparable key, Text values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //collect many CAT_A and CAT_B in random order output.collect(CAT_A, details); output.collect(CAT_B, details); } public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { //always reduce CAT_A first, then reduce CAT_B } _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer.
Re: Why does Hadoop need ssh access to master and slaves?
Hi Matthias, It is not necessary to have SSH set up to run Hadoop, but it does make things easier. SSH is used by the scripts in the bin directory which start and stop daemons across the cluster (the slave nodes are defined in the slaves file), see the start-all.sh script as a starting point. These scripts are a convenient way to control Hadoop, but there are other possibilities. If you had another system to control daemons on your cluster then you wouldn't need SSH. Tom On Wed, Jan 21, 2009 at 1:20 PM, Matthias Scherer matthias.sche...@1und1.de wrote: Hi Steve and Amit, Thanks for your answers. I agree with you that key-based ssh is nothing to worry about. But I'm wondering what exactly - that means wich grid administration tasks - hadoop does via ssh?! Does it restart crashed data nodes or tasks trackers on the slaves? Oder does it transfer data over the grid with ssh access? How can I find a short description what exactly hadoop needs ssh for? The documentation says only that I have to configure it. Thanks Regards Matthias -Ursprüngliche Nachricht- Von: Steve Loughran [mailto:ste...@apache.org] Gesendet: Mittwoch, 21. Januar 2009 13:59 An: core-user@hadoop.apache.org Betreff: Re: Why does Hadoop need ssh access to master and slaves? Amit k. Saha wrote: On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer matthias.sche...@1und1.de wrote: Hi all, we've made our first steps in evaluating hadoop. The setup of 2 VMs as a hadoop grid was very easy and works fine. Now our operations team wonders why hadoop has to be able to connect to the master and slaves via password-less ssh?! Can anyone give us an answer to this question? 1. There has to be a way to connect to the remote hosts- slaves and a secondary master, and SSH is the secure way to do it 2. It has to be password-less to enable automatic logins SSH is *a * secure way to do it, but not the only way. Other management tools can bring up hadoop clusters. Hadoop ships with scripted support for SSH as it is standard with Linux distros and generally the best way to bring up a remote console. Matthias, Your ops team should not be worrying about the SSH security, as long as they keep their keys under control. (a) Key-based SSH is more secure than passworded SSH, as man-in-middle attacks are prevented. passphrase protected SSH keys on external USB keys even better. (b) once the cluster is up, that filesystem is pretty vulnerable to anything on the LAN. You do need to lock down your datacentre, or set up the firewall/routing of the servers so that only trusted hosts can talk to the FS. SSH becomes a detail at that point.
Re: @hadoop on twitter
Thanks flip. I've signed up for the hadoop account - be great to get some help with getting it going. Tom On Wed, Jan 14, 2009 at 6:33 AM, Philip (flip) Kromer f...@infochimps.org wrote: Hey all, There is no @hadoop on twitter, but there should be. http://twitter.com/datamapper and http://twitter.com/rails both set good examples. I'd be glad to either help get that going or to nod approvingly if someone on core does so. flip
Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)
LZO was removed due to license incompatibility: https://issues.apache.org/jira/browse/HADOOP-4874 Tom On Wed, Jan 14, 2009 at 11:18 AM, Gert Pfeifer pfei...@se.inf.tu-dresden.de wrote: I got it. For some reason getDefaultExtension() returns .lzo_deflate. Is that a bug? Shouldn't it be .lzo? In the head revision I couldn't find it at all in http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/compress/ There should be a Class LzoCodec.java. Was that moved to somewhere else? Gert Gert Pfeifer wrote: Arun C Murthy wrote: On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote: Hi, I want to use an lzo file as input for a mapper. The record reader determines the codec using a CompressionCodecFactory, like this: (Hadoop version 0.19.0) http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html I should have mentioned that I have these native libs running: 2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec: Successfully loaded initialized native-lzo library Is that what you tried to point out with this link? Gert hth, Arun compressionCodecs = new CompressionCodecFactory(job); System.out.println(Using codecFactory: +compressionCodecs.toString()); final CompressionCodec codec = compressionCodecs.getCodec(file); System.out.println(Using codec: +codec+ for file +file.getName()); The output that I get is: Using codecFactory: { etalfed_ozl.: org.apache.hadoop.io.compress.LzoCodec } Using codec: null for file test.lzo Of course, the mapper does not work without codec. What could be the problem? Thanks, Gert
Re: Problem with Hadoop and concatenated gzip files
I've opened https://issues.apache.org/jira/browse/HADOOP-5014 for this. Do you get this behaviour when you use the native libraries? Tom On Sat, Jan 10, 2009 at 12:26 AM, Oscar Gothberg oscar.gothb...@platform-a.com wrote: Hi, I'm having trouble with Hadoop (tested with 0.17 and 0.19) not fully processing certain gzipped input files. Basically it only actually reads and processes a first part of the gzipped file, and just ignores the rest without any kind of warning. It affects at least (but is maybe not limited to?) any gzip files that are a result of concatenation (which should be legal to do with gzip format): http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage Repro case, using the WordCount example from the hadoop distribution: $ echo 'one two three' f1 $ echo 'four five six' f2 $ gzip -c f1 combined_file.gz $ gzip -c f2 combined_file.gz Now, if I run WordCount with combined_file.gz as input, it will only find the words 'one', 'two', 'three', but not 'four', 'five', 'six'. It seems Java's GZIPInputStream may have a similar issue: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425 Now, if I unzip and re-gzip this 'combined_file.gz' manually, the problem goes away. It's especially dangerous since Hadoop doesn't show any errors or complains in the least. It just ignores this extra input. The only way of noticing is to run one's app with gzipped- and unzipped data side by side and notice the record counts being different. Is anyone else familiar with this problem? Any solutions, workarounds, short of re-gzipping very large amounts of data? Thanks! / Oscar The information transmitted in this email is intended only for the person(s) or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this email in error, please contact the sender and permanently delete the email from any computer.
Re: Concatenating PDF files
Hi Richard, Are you running out of memory after many PDFs have been processed by one mapper, or during the first? The former would suggest that memory isn't being released; the latter that the task VM doesn't have enough memory to start with. Are you setting the memory available to map tasks by setting mapred.child.java.opts? You can try to see how much memory the processes are using by logging into a machine when the job is running and running 'top' or 'ps'. It won't help the memory problems, but it sounds like you could run with zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2 XL instances can run more than two tasks per node (they have 4 virtual cores, see http://aws.amazon.com/ec2/instance-types/). And you should configure them to take advantage of multiple disks - https://issues.apache.org/jira/browse/HADOOP-4745. Tom On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] zak_rich...@bah.com wrote: All, I have a project that I am working on involving PDF files in HDFS. There are X number of directories and each directory contains Y number of PDFs, and per directory all the PDFs are to be concatenated. At the moment I am running a test with 5 directories and 15 PDFs in each directory. I am also using iText to handle the PDFs, and I wrote a wrapper class to take PDFs and add them to an internal PDF that grows. I am running this on Amazon's EC2 using Extra Large instances, which have a total of 15 GB RAM. Each Java process, two per Instance, has 7GB maximum (-Xmx7000m). There is one Master Instance and 4 Slave instances. I am able to confirm that the Slave processes are connected to the Master and have been working. I am using Hadoop 0.19.0. The problem is that I run out of memory when the concatenation class reads in a PDF. I have tried both the iText library version 2.1.4 and the Faceless PDF library, and both have the error in the middle of concatenating the documents. I looked into Multivalent, but that one just uses Strings to determine paths and it opens the files directly, while I am using a wrapper class to interact with items in HDFS, so Multivalent is out. Since the PDFs aren't enourmous (17 MB or less) and each Instance has tons of memory, so why am I running out of memory? The mapper works like this. It gets a text file with a list of directories, and per directory it reads in the contents and adds them to the concatenation class. The reducer pretty much does nothing. Is this the best way to do this, or is there a better way? Thank you! Richard J. Zak
Re: Predefined counters
Hi Jim, Try something like: Counters counters = job.getCounters(); counters.findCounter(org.apache.hadoop.mapred.Task$Counter, REDUCE_INPUT_RECORDS).getCounter() The pre-defined counters are unfortunately not public and are not in one place in the source code, so you'll need to hunt to find them (search the source for the counter name you see in the web UI). I opened https://issues.apache.org/jira/browse/HADOOP-4043 a while back to address the fact they are not public. Please consider voting for it if you think it would be useful. Cheers, Tom On Mon, Dec 22, 2008 at 2:47 AM, Jim Twensky jim.twen...@gmail.com wrote: Hello, I need to collect some statistics using some of the counters defined by the Map/Reduce framework such as Reduce input records. I know I should use the getCounter method from Counters.Counter but I couldn't figure how to use it. Can someone give me a two line example of how to read the values for those counters and where I can find the names/groups of the predefined counters? Thanks in advance, Jim
Re: contrib/ec2 USER_DATA not used
Hi Stefan, The USER_DATA line is a hangover from the way that these parameters used to be passed to the node. This line can safely be removed, since the scripts now pass the data in the USER_DATA_FILE as you rightly point out. Tom On Thu, Dec 18, 2008 at 10:09 AM, Stefan Groschupf s...@101tec.com wrote: Hi, can someone tell me what the variable USER_DATA in the launch-hadoop-master is all about. I cant see that it is reused in the script or any other script. Isnt the way those parameters are passed to the nodes the USER_DATA_FILE ? The line is: USER_DATA=MASTER_HOST=master,MAX_MAP_TASKS=$MAX_MAP_TASKS,MAX_REDUCE_TASKS=$MAX_REDUCE_TASKS,COMPRESS=$COMPRESS Any hints? Thanks, Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
Re: EC2 Usage?
Hi Ryan, The ec2-describe-instances command in the API tool reports the launch time for each instance, so you could work out the machine hours of your cluster using that information. Tom On Thu, Dec 18, 2008 at 4:59 PM, Ryan LeCompte lecom...@gmail.com wrote: Hello all, Somewhat of a an off-topic related question, but I know there are Hadoop + EC2 users here. Does anyone know if there is a programmatic API to get find out how many machine time hours have been used by a Hadoop cluster (or anything) running on EC2? I know that you can log into the EC2 web site and see this, but I'm wondering if there's a way to access this data programmaticly via web services? Thanks, Ryan
Re: API Documentation question - WritableComparable
I've opened https://issues.apache.org/jira/browse/HADOOP-4881 and attached a patch to fix this. Tom On Fri, Dec 12, 2008 at 2:18 AM, Tarandeep Singh tarand...@gmail.com wrote: The example is just to illustrate how one should implement one's own WritableComparable class and in the compreTo method, it is just showing how it works in case of IntWritable with value as its member variable. You are right the example's code is misleading. It should have used either timestamp or counter or both and not value. -Taran On Thu, Dec 11, 2008 at 3:55 PM, Andy Sautins andy.saut...@returnpath.netwrote: I have a question regarding the Hadoop API documentation for .19. The question is in regard to: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Writ ableComparable.htmlhttp://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/WritableComparable.html. The document shows the following for the compareTo method: public int compareTo(MyWritableComparable w) { int thisValue = this.value; int thatValue = ((IntWritable)o).value; return (thisValue thatValue ? -1 : (thisValue==thatValue ? 0 : 1)); } Taking the full class example doesn't compile. What I _think_ would be right would be: public int compareTo(Object o) { int thisValue = this.value; int thatValue = ((MyWritableComparable)o).value; return (thisValue thatValue ? -1 : (thisValue==thatValue ? 0 : 1)); } But even at that it's unclear why the compareTo function is comparing value ( which isn't a member of the class in the example ) and not the counter and timestamp variables in the class. Am I understanding this right? Is there something amiss with the documentation? Thanks Andy
Re: When I system.out.println() in a map or reduce, where does it go?
You can also see the logs from the web UI (http://jobtracker:50030 by default), by clicking through to the map or reduce task that you are interested in and looking at the page for task attempts. Tom On Wed, Dec 10, 2008 at 10:41 PM, Tarandeep Singh [EMAIL PROTECTED] wrote: you can see the output in hadoop log directory (if you have used default settings, it would be $HADOOP_HOME/logs/userlogs) On Wed, Dec 10, 2008 at 1:31 PM, David Coe [EMAIL PROTECTED] wrote: I've noticed that if I put a system.out.println in the run() method I see the result on my console. If I put it in the map or reduce class, I never see the result. Where does it go? Is there a way to get this result easily (eg dump it in a log file)? David
Re: Auto-shutdown for EC2 clusters
I've just created a basic script to do something similar for running a benchmark on EC2. See https://issues.apache.org/jira/browse/HADOOP-4382. As it stands the code for detecting when Hadoop is ready to accept jobs is simplistic, to say the least, so any ideas for improvement would be great. Thanks, Tom On Fri, Oct 24, 2008 at 11:53 PM, Chris K Wensel [EMAIL PROTECTED] wrote: fyi, the src/contrib/ec2 scripts do just what Paco suggests. minus the static IP stuff (you can use the scripts to login via cluster name, and spawn a tunnel for browsing nodes) that is, you can spawn any number of uniquely named, configured, and sized clusters, and you can increase their size independently as well. (shrinking is another matter altogether) ckw On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote: Hi Karl, Rather than using separate key pairs, you can use EC2 security groups to keep track of different clusters. Effectively, that requires a new security group for every cluster -- so just allocate a bunch of different ones in a config file, then have the launch scripts draw from those. We also use EC2 static IP addresses and then have a DNS entry named similarly to each security group, associated with a static IP once that cluster is launched. It's relatively simple to query the running instances and collect them according to security groups. One way to handle detecting failures is just to attempt SSH in a loop. Our rough estimate is that approximately 2% of the attempted EC2 nodes fail at launch. So we allocate more than enough, given that rate. In a nutshell, that's one approach for managing a Hadoop cluster remotely on EC2. Best, Paco On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Same here. I've always had the namenode and web interface be accessible, but sometimes I don't get the slave nodes - usually zero slaves when this happens, sometimes I only miss one or two. My rough estimate is that this happens 1% of the time. I currently have to notice this and restart manually. Do you have a good way to detect it? I have several Hadoop clusters running at once with the same AWS image and SSH keypair, so I can't count running instances. I could have a separate keypair per cluster and count instances with that keypair, but I'd like to be able to start clusters opportunistically, with more than one cluster doing the same kind of job on different data. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Google Terasort Benchmark
From the Google Blog, http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html We are excited to announce we were able to sort 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds. By comparison, the previous 1TB sorting record [using Hadoop] is 209 seconds on 910 computers. Something for the Hadoop community to aim for: a threefold performance increase. Tom
Re: Hadoop Book
The Rough Cut of the book is now available from http://oreilly.com/catalog/9780596521998/index.html. There are a few chapters available already, at various stages of completion. I'd love to hear any suggestions for improvements that you may have. You can give feedback on the Safari website where the book is hosted (please don't post it on this thread). As the Rough Cuts FAQ explains (http://oreilly.com/roughcuts/faq.csp), the most valuable feedback is on missing topics, if something is not understandable, and technical mistakes. Thanks, Tom 2008/9/4 叶双明 [EMAIL PROTECTED]: waiting for it!!! 2008/9/5, Owen O'Malley [EMAIL PROTECTED]: On Sep 4, 2008, at 6:36 AM, 叶双明 wrote: what book? To summarize, Tom White is writing a book about Hadoop. He will post a message to the list when a draft is ready. -- Owen
Re: Parameterized deserializers?
If you make your Serialization implement Configurable it will be given a Configuration object that it can pass to the Deserializer on construction. Also, this thread may be related: http://www.nabble.com/Serialization-with-additional-schema-info-td19260579.html Tom On Sat, Sep 13, 2008 at 12:38 AM, Pete Wyckoff [EMAIL PROTECTED] wrote: I should mention this is out of the context of SequenceFiles where we get the class names in the file itself. Here there is some information inserted into the JobConf that tells me the class of the records in the input file. -- pete On 9/12/08 3:26 PM, Pete Wyckoff [EMAIL PROTECTED] wrote: If I have a generic Serializer/Deserializers that take some runtime information to instantiate, how would this work in the current serializer/deserializer APIs? And depending on this runtime information, may return different Objects although they may all derive from the same class. For example, for Thrift, I may have something called a ThriftSerializer that is general: {code} Public class ThriftDeserializerT extends ThriftBase implements Deserializer { T deserialize(T); } {code} How would I instantiate this, since the current getDeserializer takes only the Class but not configuration object. How would I implement createKey in RecordReader In other words, I think we need a {code}Class? getClass(); {code} method in Deserializer() and a {code}Deserializer getDeserializer(Class, Configuration conf); {code} method in Serializer.java. Or is there another way to do this? IF not, I can open a JIRA for implementing parameterized serializers. Thanks, pete
Re: Hadoop EC2
On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: I'm noticing that using bin/hadoop fs -put ... svn://... is uploading multi-gigabyte files in ~64MB chunks. That's because S3Filesystem stores files as 64MB blocks on S3. Then, when this is copied from S3 into HDFS using bin/hadoop distcp. Once the files are there and the job begins, it looks like it's breaking up the 4 multigigabyte text files into about 225 maps. Does this mean that each map is roughly processing 64MB of data each? Yes, HDFS stores files as 64MB blocks too, and map input is split by default so each map processes one block. If so, is there any way to change this so that I can get my map tasks to process more data at a time? I'm curious if this will shorten the time it takes to run the program. You could try increasing the HDFS block size. 128MB is actually usually a better value, for this very reason. In the future https://issues.apache.org/jira/browse/HADOOP-2560 will help here too. Tom, in your article about Hadoop + EC2 you mention processing about 100GB of logs in under 6 minutes or so. In this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873, it took 35 minutes to run the job. I'm planning on doing some benchmarking on EC2 fairly soon, which should help us improve the performance of Hadoop on EC2. It's worth remarking that this was running on small instances. The larger instances perform a lot better in my experience. Do you remember how many EC2 instances you had running, and also how many map tasks did you have to operate on the 100GB? Was each map task handling about 1GB each? I was running 20 nodes, and each map task was handling a HDFS block, 64MB. Hope this helps, Tom
Re: Reading and writing Thrift data from MapReduce
Hi Juho, I think you should be able to use the Thrift serialization stuff that I've been working on in https://issues.apache.org/jira/browse/HADOOP-3787 - at least as a basis. Since you are not using sequence files, you will need to write an InputFormat (probably one that extends FileInputFormat) and an associated RecordReader that knows how to break the input into logical records. See SequenceFileInputFormat for the kind of thing. Also, since you store the Thrift type's name in the data, you can use a variant of ThriftDeserializer that first reads the type name and instantiates an instance of the type before reading its fields from the stream. Hope this helps. Tom On Wed, Sep 3, 2008 at 8:32 AM, Juho Mäkinen [EMAIL PROTECTED] wrote: Thanks Jeff. I believe that you mean the serde module inside hadoop (hadoop-core-trunk\src\contrib\hive\serde)? I'm currently looking into it, but it seems to lack a lot of useful documentation so it'll take me some time to figure it out (all additional info is appreciated). I've already put some effort into this and designed a partial sollution for my log analysis which so far seems ok to me. As I don't know the details of serde yet, I'm not sure if this is the way I should go, or should I change my implementation and plans so that I could use serve (if it makes my job easier). I'm not yet interested in HIVE, but I'd like to keep the option open in the future, so that I could easily run hive on my datas (so that I would not need to transform my datas to hive if I choose to use it in the future). Currently I've come up with the following design: 1) Each log event type has it's own thrift structure. The structure is compiled into php code. The log entry creators creates and populates the structure php object with data and sends it to be stored 2) Log sender object receiveres this object ($tbase) and serializes it using TBinaryTransport, adds the structure name to the beginning and sends the byte array to loc receiver using UDP. The following code does this: $this-transport = new TResetableMemoryBuffer(); // a TMemoryBuffer with a reset() method $this-protocol = new TBinaryProtocol($this-transport); $this-transport-open(); $this-transport-reset(); // Reset the memory buffer array $this-protocol-writeByte(1); // version 1: we have the TBase name in string $this-protocol-writeString($tbase-getName()); // Name of the structure $tbase-write($this-protocol); // Serialize our thrift structure to the memory buffer $this-sendBytes($this-transport-getBuffer()); 3) Log receiver reads the structure name and stores the byte array (without the version byte and structure name) into HDFS file /events/insert structure name here/week number/timestamp.datafile My plan is that I could read the stored entries using MapReduce, deserialize them into java objects (the map-reducer would need to have the thrift compiled structures available) and use the structures directly in Map operations. (How) can serde help me with this part? Should I modify my plans so that I could use HIVE directly in the future? How Hive stores the thrift serialized log data into HDFS? - Juho Mäkinen On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher [EMAIL PROTECTED] wrote: Hey Juho, You should check out Hive (https://issues.apache.org/jira/browse/HADOOP-3601), which was just committed to the Hadoop trunk today. It's what we use at Facebook to query our collection of Thrift-serialized logfiles. Inside of the Hive code, you'll find a pure-Java (using JavaCC) parser for Thrift-serialized data structures. Regards, Jeff On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra [EMAIL PROTECTED] wrote: On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen [EMAIL PROTECTED] wrote: What's the current status of Thrift with Hadoop? Is there any documentation online or even some code in the SVN which I could look into? I think you have two choices: 1) wrap your Thrift code in a class that implements Writable, or 2) use Thrift to serialize your data to byte arrays and store them as BytesWritable. -Stuart
Re: Error while uploading large file to S3 via Hadoop 0.18
For the s3:// filesystem, files are split into 64MB blocks which are sent to S3 individually. Rather than increase the jets3t.properties retry buffer and retry count, it is better to change the Hadoop properties fs.s3.maxRetries and fs.s3.sleepTimeSeconds, since the Hadoop-level retry mechanism retries the whole block transfer, and the block is stored on disk, so it doesn't consume memory. (The jets3t mechanism is still useful for metadata operation retries.) See https://issues.apache.org/jira/browse/HADOOP-997 for background. Tom On Tue, Sep 2, 2008 at 4:23 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Actually not if you're using the s3:// as opposed to s3n:// ... Thanks, Ryan On Tue, Sep 2, 2008 at 11:21 AM, James Moore [EMAIL PROTECTED] wrote: On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I'm trying to upload a fairly large file (18GB or so) to my AWS S3 account via bin/hadoop fs -put ... s3://... Isn't the maximum size of a file on s3 5GB? -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Hadoop EC2
There's a case study with some numbers in it from a presentation I gave on Hadoop and AWS in London last month, which you may find interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. tim robertson [EMAIL PROTECTED] wrote: For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. This sounds very useful. Please consider creating a Jira and submitting the code (even if it's not finished folks might like to see it). Thanks. Tom Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Hadoop Book
Lukáš, Feris, I'll be sure to post a message to the list when the book's available as a Rough Cut. Tom 2008/8/28 Feris Thia [EMAIL PROTECTED]: Agree... I will be glad to be early notified about the release :) Regards, Feris 2008/8/29 Lukáš Vlček [EMAIL PROTECTED] Tom, Do you think you could drop a small note into this list once it is available? Lukas
Re: Hadoop EC2
On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Tom, I noticed that you mentioned using Amazon's new elastic block store as an alternative to using S3. Right now I'm testing pushing data to S3, then moving it from S3 into HDFS once the Hadoop cluster is up and running in EC2. It works pretty well -- moving data from S3 to HDFS is fast when the data in S3 is broken up into multiple files, since bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the data. Yes, this is a good-enough solution for many applications. Are there any real advantages to using the new elastic block store? Is moving data from the elastic block store into HDFS any faster than doing it from S3? Or can HDFS essentially live inside of the elastic block store? Bandwidth between EBS and EC2 is better than between S3 and EC2, so if you intend to run MapReduce on your data then you might consider running an elastic Hadoop cluster that stores data on EBS-backed HDFS. The nice thing is that you can shut down the cluster when you're not using it and then restart it later. But if you have other applications that need to access data from S3, then this may not be appropriate. Also, it may not be as fast as HDFS using local disks for storage. This is a new area, and I haven't done any measurements, so a lot of this is conjecture on my part. Hadoop on EBS doesn't exist yet - but it looks like a natural fit. Thanks! Ryan On Wed, Sep 3, 2008 at 9:54 AM, Tom White [EMAIL PROTECTED] wrote: There's a case study with some numbers in it from a presentation I gave on Hadoop and AWS in London last month, which you may find interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. tim robertson [EMAIL PROTECTED] wrote: For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. This sounds very useful. Please consider creating a Jira and submitting the code (even if it's not finished folks might like to see it). Thanks. Tom Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: EC2 AMI for Hadoop 0.18.0
I've just created public AMIs for 0.18.0. Note that they are in the hadoop-images bucket. Tom On Fri, Aug 29, 2008 at 9:22 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 29-Aug-08, at 6:49 AM, Stuart Sierra wrote: Anybody have one? Any success building it with create-hadoop-image? Thanks, -Stuart I was able to build one following the instructions in the wiki. You'll need to find the Java download url (see wiki) and put it and your own S3 bucket name in hadoop-ec2-env.sh.
Re: Hadoop Book
That's right, I'm writing a book on Hadoop for O'Reilly. It will be a part of the Rough Cuts program (http://oreilly.com/roughcuts/), which means it'll be available as writing progresses. Tom 2008/8/28 Lukáš Vlček [EMAIL PROTECTED]: BTW: I found (http://skillsmatter.com/custom/presentations/ec2-talk.pdf) that Tom White is working on Hadoop book now. Lukas 2008/8/26 Feris Thia [EMAIL PROTECTED] Hi Lukas, I've check on Youtube.. and yes, there are many explanations on Hadoop. Thanks for your guide :) Regards, Feris On Tue, Aug 26, 2008 at 1:39 AM, Lukáš Vlček [EMAIL PROTECTED] wrote: Hi, As far as I know, there is no Hadoop specific book yet. However; you can find several interesting video presentations from Google or Yahoo! Hadoop meetings. There are good tutorials on the net as well as several interesting blog posts (sevearl people involved in Hadoop development do regularly blog about Hadoop) and you can read user and dev mail lists (and you can also ask questions there! - you can not do this with the book). On the other hand Hadoop is under development and as a such API can change and new fatures can be added every day. Hadoop is not settled down the same way the Oracle is now. But I am *sure* the book about Hadoop is comming in the future because there is a demand... Regards, Lukas -- http://blog.lukas-vlcek.com/ -- http://blog.lukas-vlcek.com/
Re: Namenode Exceptions with S3
On Thu, Jul 17, 2008 at 6:16 PM, Doug Cutting [EMAIL PROTECTED] wrote: Can't one work around this by using a different configuration on the client than on the namenodes and datanodes? The client should be able to set fs.default.name to an s3: uri, while the namenode and datanode must have it set to an hdfs: uri, no? Yes, that's a good solution. It might be less confusing if the HDFS daemons didn't use fs.default.name to define the namenode host and port. Just like mapred.job.tracker defines the host and port for the jobtracker, dfs.namenode.address (or similar) could define the namenode. Would this be a good change to make? Probably. For back-compatibility we could leave it empty by default, deferring to fs.default.name, only if folks specify a non-empty dfs.namenode.address would it be used. I've opened https://issues.apache.org/jira/browse/HADOOP-3782 for this. Tom
Re: Namenode Exceptions with S3
On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter [EMAIL PROTECTED] wrote: Thank you, Tom. Forgive me for being dense, but I don't understand your reply: Sorry! I'll try to explain it better (see below). Do you mean that it is possible to use the Hadoop daemons with S3 but the default filesystem must be HDFS? The HDFS daemons use the value of fs.default.name to set the namenode host and port, so if you set it to a s3 URI, you can't run the HDFS daemons. So in this case you would use the start-mapred.sh script instead of start-all.sh. If that is the case, can I specify the output filesystem on a per-job basis and can that be an S3 FS? Yes, that's exactly how you do it. Also, is there a particular reason to not allow S3 as the default FS? You can allow S3 as the default FS, it's just that then you can't run HDFS at all in this case. You would only do this if you don't want to use HDFS at all, for example, if you were running a MapReduce job which read from S3 and wrote to S3. It might be less confusing if the HDFS daemons didn't use fs.default.name to define the namenode host and port. Just like mapred.job.tracker defines the host and port for the jobtracker, dfs.namenode.address (or similar) could define the namenode. Would this be a good change to make? Tom
Re: Namenode Exceptions with S3
On Fri, Jul 11, 2008 at 9:09 PM, slitz [EMAIL PROTECTED] wrote: a) Use S3 only, without HDFS and configuring fs.default.name as s3://bucket - PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode: java.lang.RuntimeException: Not a host:port pair: X What command are you using to start Hadoop? b) Use HDFS as the default FS, specifying S3 only as input for the first Job and output for the last(assuming one has multiple jobs on same data) - PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733 Yes, this is a problem. I've added a comment to the Jira description describing a workaround. Tom
Re: Namenode Exceptions with S3
I get (where the all-caps portions are the actual values...): 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode: java.lang.NumberFormatException: For input string: [EMAIL PROTECTED] at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:447) at java.lang.Integer.parseInt(Integer.java:497) at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857) These exceptions are taken from the namenode log. The datanode logs show the same exceptions. If you make the default filesystem S3 then you can't run HDFS daemons. If you want to run HDFS and use an S3 filesystem, you need to make the default filesystem a hdfs URI, and use s3 URIs to reference S3 filesystems. Hope this helps. Tom
Re: Hadoop on EC2 + S3 - best practice?
Hi Tim, The steps you outline look about right. Because your file is 5GB you will need to use the S3 block file system, which has a s3 URL. (See http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build your own AMI unless you have dependencies that can't be submitted as a part of the MapReduce job. To read and write to S3 you can just use s3 URLs. Otherwise you can use distcp to copy between S3 and HDFS before and after running your job. This article I wrote has some more tips: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 Hope that helps, Tom On Sat, Jun 28, 2008 at 10:24 AM, tim robertson [EMAIL PROTECTED] wrote: Hi all, I have data in a file (150million lines at 100Gb or so) and have several MapReduce classes for my processing (custom index generation). Can someone please confirm the following is the best way to run on EC2 and S3 (both of which I am new to..) 1) load my 100Gb file into S3 2) create a class that will load the file from S3 and use as input to mapreduce (S3 not used during processing) and save output back to S3 3) create an AMI with the Hadoop + dependencies and my Jar file (loading the S3 input and the MR code) - I will base this on the public Hadoop AMI I guess 4) run using the standard scripts Is this best practice? I assume this is pretty common... is there a better way where I can submit my Jar at runtime and just pass in the URL for the input and output files in S3? If not, has anyone an example that takes input from S3 and writes output to S3 also? Thanks for advice, or suggestions of best way to run. Tim
Re: hadoop on Solaris
I've successfully run Hadoop on Solaris 5.10 (on Intel). The path included /usr/ucb so whoami was picked up correctly. Satoshi, you say you added /usr/ucb to you path too, so I'm puzzled why you get a LoginException saying whoami: not found - did you export your changes to path? I've also managed to test and build Hadoop on Solaris. From 0.17 there's support for building the native libraries on Solaris, which are useful for performance (see https://issues.apache.org/jira/browse/HADOOP-3123). Tom On Tue, Jun 17, 2008 at 11:47 AM, Steve Loughran [EMAIL PROTECTED] wrote: Satoshi YAMADA wrote: From hadoop doc, only Linux and Windows are supported platforms. Is it possible to run hadoop on Solaris? Is hadoop implemented in pure java? What kinds of problems are there in order to port to Solaris? Thanks in advance. hi, no one seems to reply to the previous hadoop on Solaris Thread. I just tried running hadoop on Solaris 5.10 and somehow got error message. If you can give some advices, I would appreciate it. (single operation seems to work). You are probably the first person trying this. This means you have more work, but it gives you an opportunity to contribute code back into the next release. I'd recommend you check out the trunk and try building it and running the tests on solaris. Then when the tests fail, you can file bug reports (with stack traces) against specific tests. Then -possibly- other people might pick up and fix the problems, or you can fix them one by one, submitting patches to the bugreps as you go. I'm sure the Hadoop team would be happy to have Solaris support, its just a matter of whoever has the need sitting down to do it. -steve
Re: distcp/ls fails on Hadoop-0.17.0 on ec2.
Hi Einar, How did you put the data onto S3, using Hadoop's S3 FileSystem or using other S3 tools? If it's the latter then it won't work as the s3 scheme is for Hadoop's block-based S3 storage. Native S3 support is coming - see https://issues.apache.org/jira/browse/HADOOP-930, but it's not integrated yet. Tom On Thu, May 29, 2008 at 10:15 PM, Einar Vollset [EMAIL PROTECTED] wrote: Hi, I'm using the current Hadoop ec2 image (ami-ee53b687), and am having some trouble getting hadoop to access S3. Specifically, I'm trying to copy files from my bucket, into HDFS on the running cluster, so (on the master on the booted cluster) I do: hadoop-0.17.0 einar$ bin/hadoop distcp s3://ID:[EMAIL PROTECTED]/ input 08/05/29 14:10:44 INFO util.CopyFiles: srcPaths=[ s3://ID:[EMAIL PROTECTED]/] 08/05/29 14:10:44 INFO util.CopyFiles: destPath=input 08/05/29 14:10:46 WARN fs.FileSystem: localhost:9000 is a deprecated filesystem name. Use hdfs://localhost:9000/ instead. With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source s3://ID:[EMAIL PROTECTED]/ does not exist. at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:578) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763) ..which clearly doesn't work. The ID:SECRET are right - as if I change them I get : org.jets3t.service.S3ServiceException: S3 HEAD request failed. ResponseCode=403, ResponseMessage=Forbidden ..etc I suspect it might be a generic problem, as if I do: bin/hadoop fs -ls s3://ID:[EMAIL PROTECTED]/ I get: ls: Cannot access s3://ID:[EMAIL PROTECTED]/ : No such file or directory. ..even though the bucket is there and has a lot of data in it. Any thoughts? Cheers, Einar
Re: Hadoop 0.17 AMI?
Hi Jeff, I've built two public 0.17.0 AMIs (32-bit and 64-bit), so you should be able to use the 0.17 scripts to launch them now. Cheers, Tom On Thu, May 22, 2008 at 6:37 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Jeff, 0.17.0 was released yesterday, from what I can tell. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jeff Eastman [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, May 21, 2008 11:18:56 AM Subject: Re: Hadoop 0.17 AMI? Any word on 0.17? I was able to build an AMI from a trunk checkout and deploy a single node cluster but the create-hadoop-image-remote script really wants a tarball in the archive. I'd rather not waste time munging the scripts if a release is near. Jeff Nigel Daley wrote: Hadoop 0.17 hasn't been released yet. I (or Mukund) is hoping to call a vote this afternoon or tomorrow. Nige On May 14, 2008, at 12:36 PM, Jeff Eastman wrote: I'm trying to bring up a cluster on EC2 using (http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the version to use because of the DNS improvements, etc. Unfortunately, I cannot find a public AMI with this build. Is there one that I'm not finding or do I need to create one? Jeff
Re: Hadoop 0.17 AMI?
Hi Jeff, There is no public 0.17 AMI yet - we need 0.17 to be released first. So in the meantime you'll have to build your own. Tom On Wed, May 14, 2008 at 8:36 PM, Jeff Eastman [EMAIL PROTECTED] wrote: I'm trying to bring up a cluster on EC2 using (http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the version to use because of the DNS improvements, etc. Unfortunately, I cannot find a public AMI with this build. Is there one that I'm not finding or do I need to create one? Jeff
Re: Not able to back up to S3
Part of the problem here is that the error message is confusing. It looks like there's a problem with the AWS credentials, when in fact the host name is malformed (but URI isn't telling us). I've created a patch to make the error message more helpful: https://issues.apache.org/jira/browse/HADOOP-3301. Tom On Fri, Apr 18, 2008 at 11:20 AM, Steve Loughran [EMAIL PROTECTED] wrote: Chris K Wensel wrote: you cannot have underscores in a bucket name. it freaks out java.net.URI. freaks out DNS, too, which is why the java.net classes whine. minus signs should work -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: distcp fails when copying from s3 to hdfs
Hi Siddhartha, This is a problem in 0.16.1 (https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in 0.16.2, which was released yesterday. Tom On 04/04/2008, Siddhartha Reddy [EMAIL PROTECTED] wrote: I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of CentOS 5 images (ami-08f41161). I am able to copy from hdfs to S3 using the following command: bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt But copying from S3 to hdfs with the following command fails: bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt with the following error: With failures, global counters are inaccurate; consider running with -i Copy failed: java.lang.IllegalArgumentException: Hook previously registered at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45) at java.lang.Runtime.addShutdownHook(Runtime.java:192) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180) at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596) Can someone please point out if and what I am doing wrong? Thanks, Siddhartha Reddy
Re: distcp fails :Input source not found
However, when I try it on 0.15.3, it doesn't allow a folder copy. I have 100+ files in my S3 bucket, and I had to run distcp on each one of them to get them on HDFS on EC2 . Not a nice experience! This sounds like a bug - could you log a Jira issue for this please? Thanks, Tom
Re: S3 Support in 16.1
Hi Jake, Yes, this is a known issue that is fixed in 0.16.2 - see https://issues.apache.org/jira/browse/HADOOP-3027. Tom On 31/03/2008, Jake Thompson [EMAIL PROTECTED] wrote: So I am new to hadoop, but everything is working well so far. Except. I can use S3 fs in 15.3 without a problem. However, if I try the same in 16.1 I get: Exception in thread main java.lang.IllegalArgumentException: Hook previously registered at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45) at java.lang.Runtime.addShutdownHook(Runtime.java:192) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180) at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94) at org.apache.hadoop.fs.FsShell.init(FsShell.java:79) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1567) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1704) Same config both place, any thoughts on this? I checked out Jira and could not find a related bug report. -Jake -- Blog: http://www.lexemetech.com/
Re: Map/reduce with input files on S3
I wonder if it is related to: https://issues.apache.org/jira/browse/HADOOP-3027 I think it is - the same problem is fixed for me when using HADOOP-3027. Tom
Re: Input file globbing
Thanks Hairong, I've just created https://issues.apache.org/jira/browse/HADOOP-3064 for this. Tom On 20/03/2008, Hairong Kuang [EMAIL PROTECTED] wrote: Yes, this is a bug. This only occurs when a job's input path contains the closures. JobConf.getInputPaths interprets mr/input/glob/2008/02/{02.08} as two input paths: mr/input/glob/2008/02/{02 and 08}. Let's see how to fix it. Hairong On 3/20/08 9:43 AM, Tom White [EMAIL PROTECTED] wrote: I'm trying to use file globbing to select various input paths, like so: conf.setInputPath(new Path(mr/input/glob/2008/02/{02,08})); But this gives an exception: Exception in thread main java.io.IOException: Illegal file pattern: Expecting set closure character or end of range, or } for glob {02 at 3 at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1023) at org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1008) at org.apache.hadoop.fs.FileSystem$GlobFilter.init(FileSystem.java:926) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:826) at org.apache.hadoop.fs.FileSystem.globPaths(FileSystem.java:873) at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:13 1) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:541) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:809) Looking at the code for JobConf.getInputPaths I see it tokenizes using a comma as the delimiter, producing two paths mr/input/glob/2008/02/{02 and 08}. This looks like a bug to me. I'm surprised as this feature has been around for some time - are folks not using it like this? Tom -- Blog: http://www.lexemetech.com/
Re: Hadoop on EC2 for large cluster
Yes, this isn't ideal for larger clusters. There's a jira to address this: https://issues.apache.org/jira/browse/HADOOP-2410. Tom On 20/03/2008, Prasan Ary [EMAIL PROTECTED] wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances. I wanted to know if there is a better way to accomplish this. Thanks, PA - Never miss a thing. Make Yahoo your homepage. -- Blog: http://www.lexemetech.com/
Re: Issue with cluster over EC2 and different AMI types
Unfortunately there is no way to discover the rack that EC2 instances are running on so you won't be able to use this optimization. Tom On 18/03/2008, Andrey Pankov [EMAIL PROTECTED] wrote: Hi, I'm apologize. It was my fault - I forgot to run tasktracker on slaves. But anyway can anyone share his experience how to use rack? Thanks. Andrey Pankov wrote: Hi all, I'm trying to configure Hadoop cluster over Amazon EC2, one m1.small instance for master node, and some m1.large instances for slaves. Both master's on slaves's AMIs have the same version of Hadoop, 0.16.0. I run ec2 instances using ec2-run-instances, with the same --group parameter, but in two step, one call - run for master, second call - run for slaves. It looks like EC2 instances with different AMI types starting in different networks, for example external and internal DNS names: * ec2-67-202-59-12.compute-1.amazonaws.com ip-10-251-74-181.ec2.internal - for small instance * ec2-67-202-3-191.compute-1.amazonaws.com domU-12-31-38-00-5C-C1.compute-1.internal - for large The trouble is that slaves could not contact the master. When I specify fs.default.name parameter in hadoop-site.xml on slaves box to be full DNS name of master (either external or internal) and try to start datanode on it (bin/hadoop-daemon.sh ... start datanode), Hadoop replaces fs.default.name with just 'ip-10-251-74-181' and puts in log 2008-03-18 07:08:16,028 ERROR org.apache.hadoop.dfs.DataNode: java.net.UnknownHostException: unknown host: ip-10-251-74-181 ... So DataNode could not be started. I tried to specify IP addr of ip-10-251-74-181 in /etc/hosts for each slave instance and it helped to start DataNode on slaves. And it became possible to store smth in HDFS. But. When I'm trying to run map-reduce job (in jar file), it doesn't work. I mean that jobs is still working but there is no any progress at all. Hadoop have written Map 0% Reduce 0% and just freeze. Can not not find anything in logs what could help a bit, both on master and on slave boxes. I found that dfs.network.script could be used to specify somehow a network location for a machine, but have no ideas now to use it. Can racks help me with it? Thanks in advance. --- Andrey Pankov --- Andrey Pankov -- Blog: http://www.lexemetech.com/
Re: Amazon S3 questions
One other note: When you use S3 URIs, you get a port out of range error on startup but that doesn't appear to be fatal. I spent a few hours on that one before I realized it didn't seem to matter. It seems like the S3 URI format where ':' is used to separate ID and secret key is confusing someone. Do you have a stacktrace for this? Sounds like something we could improve, if only by printing a warning message. Tom