Hadoop and Eclipse integration
Hello everybody, I attempted to use the Eclipse IDE for Hadoop development and I followed the instructions shown in here: http://wiki.apache.org/hadoop/EclipseEnvironment Everything goes well until I am starting to import projects in Eclipse, and particularly HDFS. When I follow the instructions for HDFS import i get the following error from Eclipse: Project 'hadoop-hdfs' is missing required library: '/home/nick/.m2/repository/org/aspectj/aspectjtools/1.6.5/aspectjtools-1.6.5.jar' I should mention that the directory hadoop-common that i checked out hadoop is located at: /home/nick/hadoop-common and I am using Ubuntu 10.04. Similar errors appear when I attempt to import the MapReduceTools: Project 'MapReduceTools' is missing required library: 'classes' Project 'MapReduceTools' is missing required library: 'lib/hadoop-core.jar' How can I resolve these issues? When I resolve them, how can I execute a simple Wordcount job from eclipse? Thank you
How to mapreduce in the scenario
Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
Re: How to mapreduce in the scenario
Hive? Sure Assuming you mean that the id is a FK common amongst the tables... Sent from a remote device. Please excuse any typos... Mike Segel On May 29, 2012, at 5:29 AM, liuzhg liu...@cernet.com wrote: Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
Re: How to mapreduce in the scenario
hive is one approach (similar to routine databases but exactly not the same) if you are looking at mapreduce program then using multipleinput formats http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html On Tue, May 29, 2012 at 4:02 PM, Michel Segel michael_se...@hotmail.comwrote: Hive? Sure Assuming you mean that the id is a FK common amongst the tables... Sent from a remote device. Please excuse any typos... Mike Segel On May 29, 2012, at 5:29 AM, liuzhg liu...@cernet.com wrote: Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump -- Nitin Pawar
Re: How to Integrate LDAP in Hadoop ?
Which release? Version? I believe there are variables in the *-site.xml that allow LDAP integration ... Sent from a remote device. Please excuse any typos... Mike Segel On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com wrote: Hi All, Did any one work on hadoop with LDAP integration. Please help me for same. Thanks samir
RE: How to mapreduce in the scenario
Hi Gump, Mapreduce fits well for solving these types(joins) of problem. I hope this will help you to solve the described problem.. 1. Mapoutput key and value classes : Write a map out put key class(Text.class), value class(CombinedValue.class). Here value class should be able to hold the values from both the files(a.txt and b.txt) as shown below. class CombinedValue implements WritableComparator { String name; int age; String address; boolean isLeft; // flag to identify from which file } 2. Mapper : Write a map() function which can parse from both the files(a.txt, b.txt) and produces common output key and value class. 3. Partitioner : Write the partitioner in such a way that it will Send all the (key, value) pairs to same reducer which are having same key. 4. Reducer : In the reduce() function, you will receive the records from both the files and you can combine those easily. Thanks Devaraj From: liuzhg [liu...@cernet.com] Sent: Tuesday, May 29, 2012 3:45 PM To: common-user@hadoop.apache.org Subject: How to mapreduce in the scenario Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
Re: How to mapreduce in the scenario
Hi, You can also try to use the Hadoop Reduce Side Join functionality. Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and Reduce classes to do the same. Regards, Soumya. On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote: Hi Gump, Mapreduce fits well for solving these types(joins) of problem. I hope this will help you to solve the described problem.. 1. Mapoutput key and value classes : Write a map out put key class(Text.class), value class(CombinedValue.class). Here value class should be able to hold the values from both the files(a.txt and b.txt) as shown below. class CombinedValue implements WritableComparator { String name; int age; String address; boolean isLeft; // flag to identify from which file } 2. Mapper : Write a map() function which can parse from both the files(a.txt, b.txt) and produces common output key and value class. 3. Partitioner : Write the partitioner in such a way that it will Send all the (key, value) pairs to same reducer which are having same key. 4. Reducer : In the reduce() function, you will receive the records from both the files and you can combine those easily. Thanks Devaraj From: liuzhg [liu...@cernet.com] Sent: Tuesday, May 29, 2012 3:45 PM To: common-user@hadoop.apache.org Subject: How to mapreduce in the scenario Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
Re: How to Integrate LDAP in Hadoop ?
It is cloudera version .20 On Tue, May 29, 2012 at 4:14 PM, Michel Segel michael_se...@hotmail.comwrote: Which release? Version? I believe there are variables in the *-site.xml that allow LDAP integration ... Sent from a remote device. Please excuse any typos... Mike Segel On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com wrote: Hi All, Did any one work on hadoop with LDAP integration. Please help me for same. Thanks samir
Re: How to mapreduce in the scenario
Yes it is possible by using MultipleInputs format to multiple mapper (basically 2 different mapper) Setp: 1 MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, *Mapper1.class*); MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, *Mapper2.class*); while defining two mappers value put some identifier (*output.collect(new Text(key), new Text(*identifier+~ *+value));*) related to a.txt and b.txt so that it will easy to distinct two file mapper output within the reducer. Step 2: put b.txt in the distcach and compare the reducer value against the b.txt List String currValue = values.next().toString(); String valueSplitted[] = currValue.split(~); if(valueSplitted[0].equals(A)) // A:- Identifier from A mapper { //where process A file } else if(valueSplitted[0].equals(B)) //B:- Identifier from B mapper { //here process B file } output.collect(new Text(key), new Text(Formated Value as like you to display)); Decide the key as like what you want to produce the result. After that you have to use one reducer to perform the ouput. thanks samir On Tue, May 29, 2012 at 3:45 PM, liuzhg liu...@cernet.com wrote: Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
distributed cache symlink
I'm trying to use the DistributedCache but having an issue resolving the symlinks to my files. My Driver class writes some hashmaps to files in the DC like this: Path tPath = new Path(/data/cache/fd, UUID.randomUUID().toString()); os = new ObjectOutputStream(fs.create(tPath)); os.writeObject(myHashMap); os.close(); URI uri = new URI(tPath.toString() + # + q_map); DistributedCache.addCacheFile(uri, config); DistributedCache.createSymlink(config); But what Path() do I need to access to read the symlinks? I tried variations of q_map, work/q_map but neither works. The files are definitely there because I can set a config var to the path and read the files in my reducer. For example, in my Driver class I set a variable via config.set(q_map, tPath.toString()); And then in my Reducer's setup() I do something like Path q_map_path = new Path(config.get(q_map_path)); if (fs.exists(q_map_path)) { HashMapString,String qMap = loadmap(conf,q_map_path); } I tried to resolve the path to the symlinks via ${mapred.local.dir}/work but that doesn't work either. In the STDOUT of my mapper attempt I see: 2012-05-29 03:59:54,369 - INFO [main:TaskRunner@759] - Creating symlink: /tmp/hadoop-mapred/mapred/local/taskTracker/distcache/-3168904771265144450_-884848596_406879224/varuna010/data/cache/fd/6dc9d5c0-98be-4105-bd59-b344924dd989 - /tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/attempt_201205250826_0020_m_00_0/work/q_map Which says it's creating the symlinks, BUT I also see this output: mapred.local.dir: /tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/attempt_201205250826_0020_m_00_0 job.local.dir: /tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/work mapred.task.id: attempt_201205250826_0020_m_00_0 Path [work/q_map] does not exist Path [/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/attempt_201205250826_0020_m_00_0/work/q_map] does not exist Which is from this code in my mapper's setup() method: try { System.out.printf(mapred.local.dir: %s\n, conf.get(mapred.local.dir)); System.out.printf( job.local.dir: %s\n, conf.get(job.local.dir)); System.out.printf( mapred.task.id: %s\n, conf.get(mapred.task.id)); fs = FileSystem.get(conf); Path symlink = new Path(work/q_map); Path fullpath = new Path(conf.get(mapred.local.dir) + /work/q_map); System.out.printf(Path [%s] ,symlink.toString()); if (fs.exists(symlink)) { System.out.println(exists); } else { System.out.println(does not exist); } System.out.printf(Path [%s] ,fullpath.toString()); if (fs.exists(fullpath)) { System.out.println(exists); } else { System.out.println(does not exist); } } catch (IOException e1) { e1.printStackTrace(); } Regards, Alan
Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......
So my question is that do hadoop 0.20 and 1.0.3 differ in their support of writing or reading sequencefiles? same code works fine with hadoop 0.20 but problem occurs when run it under hadoop 1.0.3. On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote: But the thing is, it works with hadoop 0.20. even with 100 x100(and even bigger matrices) but when it comes to hadoop 1.0.3 then even there is a problem with 3x3 matrix. On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi prash1...@gmail.com wrote: I have seen this issue with large file writes using SequenceFile writer. Not found the same issue when testing with writing fairly small files ( 1GB). On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam kasisubbu...@gmail.comwrote: Hi, If you are using a custom writable object while passing data from the mapper to the reducer make sure that the read fields and the write has the same number of variables. It might be possible that you wrote datavtova file using custom writable but later modified the custom writable (like adding new attribute to the writable) which the old data doesn't have. It might be a possibility is please check once On Friday, May 25, 2012, waqas latif wrote: Hi Experts, I am fairly new to hadoop MapR and I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3 but I am getting following error. Is it the problem with my hadoop configuration or it is compatibility problem in the code which was written in hadoop 0.20 by author.Also please guide me that how can I fix this error in either case. Here is the error I am getting. in thread main java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60) at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87) at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112) at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150) at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278) at TestMatrixMultiply.main(TestMatrixMultiply.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Thanks in advance Regards, waqas
Re: How to Integrate LDAP in Hadoop ?
I believe that their CDH3u3 or later has this... parameter. (Possibly even earlier.) On May 29, 2012, at 6:12 AM, samir das mohapatra wrote: It is cloudera version .20 On Tue, May 29, 2012 at 4:14 PM, Michel Segel michael_se...@hotmail.comwrote: Which release? Version? I believe there are variables in the *-site.xml that allow LDAP integration ... Sent from a remote device. Please excuse any typos... Mike Segel On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com wrote: Hi All, Did any one work on hadoop with LDAP integration. Please help me for same. Thanks samir
Re: Multiple fs.FSInputChecker: Found checksum error .. because of load ?
Found the problem. Shifting VMs from VirtualBox to KVM worked for me, all other configurations of VMs were kept same. So, checksum errors were indeed showing problem with hardware .. though virtual in this case. -Akshay From: Akshay Singh akshay_i...@yahoo.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wednesday, 23 May 2012 4:38 PM Subject: Multiple fs.FSInputChecker: Found checksum error .. because of load ? Hi, I am trying to run few benchmarks on a small hadoop-cluster of 4 VMs (2 on 2 phyiscal hosts, each VM having 1 cpu core, 2GB ram, individual disk and Gbps bridged connectivity). I am using virtualbox as VMM. This workload reads good number of random small files (64MB each) concurrently from all the HDFS datanodes, throuh clients running on same set of VMs. I am using FsShell cat to read the files, and I see these checksum errors: 12/05/22 10:10:12 INFO fs.FSInputChecker: Found checksum error: b[3072, 3584]=cb93678dc0259c978731af408f2cb493b510c948b45039a4853688fd21c2a070fc03ff7b807f33d20100080027 cf09e308002761d4480800450005dc2af04000400633ca816169cf816169d0c35a87c1b090973e78aa5ef880100e24446b0101080a020fcf7b020fcea7d85a506ff1eaea5383eea539137745249aebc25e86d0feac89 c4e0c9b91bc09ee146af7e9bd103c8269486a8c748091cfc42e178f461d9127f6c9676f47fa6863bb19f2e51142725ae643ffdfbe7027798e1f11314d9aa877db99a86db25f2f6d18d5b86062de737147b918e829fb178cf bbb57e932ab082197b1f4fa4315eae67210018c3c034b3f52481c4cebc53d1e2fd5ad4b67d87823f5e0923fa1ff579de88768f79a6df5f86a8a7eb3a68b3366063408b7292eef8f909580e3866676838ba8417bb810d9a9e 8d12c49de4522214e1c6a22b64394a1e60e020b12d5803d2b6a53fe64d00b85dc63c67a8a94758f71a7a06a786e168ea234030806026ffed07770ba6d407437a4a83b96c2b3a3c767d834a19c438a0d6f56ca6fc9099d375 ae1f95839c62f36a466818eb816d4d3ef6f3951ce3a19a3364a827bac8fd70833587c89084b847e4ceeae48df9256ef629c6325f67872478838777885f930710b71c02256b0cc66242d4974fbfb0ebcf85ef6cf4b67656dc 6918bc57083dc8868e34662c98e183163a9fc82a42fddc org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52284416 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1457) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2172) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100) at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114) at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49) at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:349) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1913) at org.apache.hadoop.fs.FsShell.cat(FsShell.java:346) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1557) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1776) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895) 12/05/22 10:10:13 WARN hdfs.DFSClient: Found Checksum error for blk_2250776182612718654_6078 from XX.XX.XX.207:50010 at 52284416 12/05/22 10:10:13 INFO hdfs.DFSClient: Could not obtain block blk_2250776182612718654_6078 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... cat: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52284416 cat: Checksum error: /blk_-5591790629390980895:of:/user/hduser/15-1/part-00192 at 30324736 Hadoop FSCK does not report any corrupt block after writing the data, but after every iteration of reading the data I see new corrupt blocks (with output as above). Interestingly, higher the load (concurrent sequential reads) I put on DFS cluster chances of blocks getting corrupted increase. I (mostly) do not see any corruption happening when there is no or less contention at DFS servers for reads. I see few other people on web also faced the same problem : http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/508 http://tinyurl.com/7rsckwo It has been suggested on these threads that faulty hardware may be causing this issue, and these checksum errors are likely to tell so. So, I diagnosed my RAM (non
Re: Pragmatic cluster backup strategies?
Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Re: Pragmatic cluster backup strategies?
Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem. --Bobby Evans On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote: Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to create an offsite backup becomes a problem unless we build another 10 machine cluster (too expensive right now). Not sure if it will help but we have planned the cabinet into an upper and lower half with separate redundant power, then we plan to put half of the cluster in the top, half in the bottom, effectively 2 racks, so in theory we could lose half the cluster and still have the copies of all the blocks with a replication factor of 3? Apart form the data centre burning down or some other disaster that would render the machines totally unrecoverable, is this approach good enough? I realise this is a very open question and everyone's circumstances are different, but I'm wondering what other peoples experiences/opinions are for backing up cluster data? Thanks Darrell.
Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)
Hello Hadoop community, I have been trying to set up a double node Hadoop cluster (following the instructions in - http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) and am very close to running it apart from one small glitch - when I start the dfs (using start-dfs.sh), it says: 10.63.88.53: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out 10.63.88.109: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out 10.63.88.109: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out 10.63.88.109: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out 10.63.88.53: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out which looks like it's been successful in starting all the nodes. However, when I check them out by running 'jps', this is what I see: 27531 SecondaryNameNode 27879 Jps As you can see, there is no datanode and name node. I have been racking my brains at this for quite a while now. Checked all the inputs and every thing. Any one know what the problem might be? -- Thanks in advance, Rohit
about hadoop webapps
I have another question. I want to use hadoop's class and Xml message get about Hadoop's NameNode DataNode Job etc in my Application monitor it,so I want to deployment a WEB Application(structs 2.0) in Hadoop's webapps, i'm reading something about hadoop's src, but i could't found good function to solve it,and Do you have good suggest or users community about this?
Help with DFSClient Exception.
Hi, We are frequently observing the exception java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2. Giving up. on our cluster. The exception occurs during writing a file. We are using Hadoop 0.20.2. It's ~250 nodes cluster and on average 1 box goes down every 3 days. Detailed stack trace : 12/05/27 23:26:54 INFO mapred.JobClient: Task Id : attempt_201205232329_28133_r_02_0, Status : FAILED java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2. Giving up. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.Child.main(Child.java:170) Our investigation: We have min replication factor set to 2. As mentioned here (http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html) , A call to complete() will not return true until all the file's blocks have been replicated the minimum number of times. Thus, DataNode failures may cause a client to call complete() several times before succeeding, we should retry complete() several times. The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls complete() function and retries it for 20 times. But in spite of that file blocks are not replicated minimum number of times. The retry count is not configurable. Changing min replication factor to 1 is also not a good idea since there are continuously jobs running on our cluster. Do we have any solution / workaround for this problem? What is min replication factor in general used in industry? Let me know if any further inputs required. Thanks, -Akshay
How to mapreduce in the scenario
Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)
Can you see logs for nn and dn Sent from my iPhone On May 27, 2012, at 1:21 PM, Rohit Pandey rohitpandey...@gmail.com wrote: Hello Hadoop community, I have been trying to set up a double node Hadoop cluster (following the instructions in - http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) and am very close to running it apart from one small glitch - when I start the dfs (using start-dfs.sh), it says: 10.63.88.53: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out 10.63.88.109: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out 10.63.88.109: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out 10.63.88.109: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out 10.63.88.53: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out which looks like it's been successful in starting all the nodes. However, when I check them out by running 'jps', this is what I see: 27531 SecondaryNameNode 27879 Jps As you can see, there is no datanode and name node. I have been racking my brains at this for quite a while now. Checked all the inputs and every thing. Any one know what the problem might be? -- Thanks in advance, Rohit
Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)
Rohit, The SNN may start and run infinitely without doing any work. The NN and DN have probably not started cause the NN has an issue (perhaps NN name directory isn't formatted) and the DN can't find the NN (or has data directory issues as well). So this isn't a glitch but a real issue you'll have to take a look at your logs for. On Sun, May 27, 2012 at 10:51 PM, Rohit Pandey rohitpandey...@gmail.com wrote: Hello Hadoop community, I have been trying to set up a double node Hadoop cluster (following the instructions in - http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) and am very close to running it apart from one small glitch - when I start the dfs (using start-dfs.sh), it says: 10.63.88.53: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out 10.63.88.109: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out 10.63.88.109: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out 10.63.88.109: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out 10.63.88.53: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out which looks like it's been successful in starting all the nodes. However, when I check them out by running 'jps', this is what I see: 27531 SecondaryNameNode 27879 Jps As you can see, there is no datanode and name node. I have been racking my brains at this for quite a while now. Checked all the inputs and every thing. Any one know what the problem might be? -- Thanks in advance, Rohit -- Harsh J
Best Practices for Upgrading Hadoop Version?
Hi, I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3. I'm running a pretty small cluster of just 4 nodes, and it's not really being used by too many people at the moment, so I'm OK if things get dirty or it goes offline for a bit. I was looking at the tutorial at wiki.apache.org http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it seems either outdated, or missing information. Namely, from what I've noticed so far, it doesn't specify what user any of the commands should be run as. Since I'm sure this is something a lot of people have needed to do, Is there a better tutorial somewhere for updating Hadoop version in general? Eli
Re: distributed cache symlink
Should be ./q_map . Koji On 5/29/12 7:38 AM, Alan Miller alan.mil...@synopsys.com wrote: I'm trying to use the DistributedCache but having an issue resolving the symlinks to my files. My Driver class writes some hashmaps to files in the DC like this: Path tPath = new Path(/data/cache/fd, UUID.randomUUID().toString()); os = new ObjectOutputStream(fs.create(tPath)); os.writeObject(myHashMap); os.close(); URI uri = new URI(tPath.toString() + # + q_map); DistributedCache.addCacheFile(uri, config); DistributedCache.createSymlink(config); But what Path() do I need to access to read the symlinks? I tried variations of q_map, work/q_map but neither works. The files are definitely there because I can set a config var to the path and read the files in my reducer. For example, in my Driver class I set a variable via config.set(q_map, tPath.toString()); And then in my Reducer's setup() I do something like Path q_map_path = new Path(config.get(q_map_path)); if (fs.exists(q_map_path)) { HashMapString,String qMap = loadmap(conf,q_map_path); } I tried to resolve the path to the symlinks via ${mapred.local.dir}/work but that doesn't work either. In the STDOUT of my mapper attempt I see: 2012-05-29 03:59:54,369 - INFO [main:TaskRunner@759] - Creating symlink: /tmp/hadoop-mapred/mapred/local/taskTracker/distcache/-3168904771265144450 _-884848596_406879224/varuna010/data/cache/fd/6dc9d5c0-98be-4105-bd59-b344 924dd989 - /tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826 _0020/attempt_201205250826_0020_m_00_0/work/q_map Which says it's creating the symlinks, BUT I also see this output: mapred.local.dir: /tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826 _0020/attempt_201205250826_0020_m_00_0 job.local.dir: /tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826 _0020/work mapred.task.id: attempt_201205250826_0020_m_00_0 Path [work/q_map] does not exist Path [/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_20120525082 6_0020/attempt_201205250826_0020_m_00_0/work/q_map] does not exist Which is from this code in my mapper's setup() method: try { System.out.printf(mapred.local.dir: %s\n, conf.get(mapred.local.dir)); System.out.printf( job.local.dir: %s\n, conf.get(job.local.dir)); System.out.printf( mapred.task.id: %s\n, conf.get(mapred.task.id)); fs = FileSystem.get(conf); Path symlink = new Path(work/q_map); Path fullpath = new Path(conf.get(mapred.local.dir) + /work/q_map); System.out.printf(Path [%s] ,symlink.toString()); if (fs.exists(symlink)) { System.out.println(exists); } else { System.out.println(does not exist); } System.out.printf(Path [%s] ,fullpath.toString()); if (fs.exists(fullpath)) { System.out.println(exists); } else { System.out.println(does not exist); } } catch (IOException e1) { e1.printStackTrace(); } Regards, Alan
Re: How to mapreduce in the scenario
Yes you can do it. In pig you would write something like A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id; STORE C into ‘c.txt’ Hive can do it similarly too. Or you could write your own directly in map/redcue or using the data_join jar. --Bobby Evans On 5/29/12 4:08 AM, lzg lzg_...@163.com wrote: Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
Re: different input/output formats
Hi Mark public void map(LongWritable offset, Text val,OutputCollector FloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f then it will work.* } let me know the status after the change On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: different input/output formats
Thanks for the reply but I already tried this option, and is the error: java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:60) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Mark On Tue, May 29, 2012 at 1:05 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark public void map(LongWritable offset, Text val,OutputCollector FloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f then it will work.* } let me know the status after the change On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: different input/output formats
Hi Mark See the out put for that same Application . I am not getting any error. On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: different input/output formats
Hi Samir, can you email me your main class.. or if you can check mine, it is as follows: public class SortByNorm1 extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir outputDir\n); ToolRunner.printGenericCommandUsage(System.err); return -1; } JobConf conf = new JobConf(new Configuration(),SortByNorm1.class); conf.setJobName(SortDocByNorm1); conf.setMapperClass(Norm1Mapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setReducerClass(Norm1Reducer.class); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new SortByNorm1(), args); System.exit(exitCode); } On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark See the out put for that same Application . I am not getting any error. On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.comwrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: How to mapreduce in the scenario
Hi, Mike, Nitin, Devaraj, Soumya, samir, Robert Thank you all for your suggestions. Actually, I want to know if hadoop has any advantage than routine database in performance for solving this kind of problem ( join data ). Best Regards, Gump On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee soumya.sbaner...@gmail.com wrote: Hi, You can also try to use the Hadoop Reduce Side Join functionality. Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and Reduce classes to do the same. Regards, Soumya. On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote: Hi Gump, Mapreduce fits well for solving these types(joins) of problem. I hope this will help you to solve the described problem.. 1. Mapoutput key and value classes : Write a map out put key class(Text.class), value class(CombinedValue.class). Here value class should be able to hold the values from both the files(a.txt and b.txt) as shown below. class CombinedValue implements WritableComparator { String name; int age; String address; boolean isLeft; // flag to identify from which file } 2. Mapper : Write a map() function which can parse from both the files(a.txt, b.txt) and produces common output key and value class. 3. Partitioner : Write the partitioner in such a way that it will Send all the (key, value) pairs to same reducer which are having same key. 4. Reducer : In the reduce() function, you will receive the records from both the files and you can combine those easily. Thanks Devaraj From: liuzhg [liu...@cernet.com] Sent: Tuesday, May 29, 2012 3:45 PM To: common-user@hadoop.apache.org Subject: How to mapreduce in the scenario Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump
about rebalance
Hi, I add 5 new datanode and I want to do the rebalance, and I started the rebalance on the namenode, and it gave me the notice that starting balancer, logging to /hadoop/logs/hadoop-hdfs-balancer-hadoop220.out and today I check the log file and the detail is that Another balancer is running. Exiting... Balancing took 5.0203 minutes 1) I am not sure that whether I should start the rebalance on the namenode or on each new datanode. 2) should I set the bandwidth on each datanode or just only on the namenode 3) If the rebalance started, whether the data on others' would be decreased 4)whether the log details means the balancer was killed by another one. If you have some suggestion, please give me some notice , thank you Best Regards Malone 2012-05-30 Yingnan.Ma Eyingnan...@ipinyou.com MSN: mayingnan_b...@hotmail.com QQ: 230624226 北京市朝阳区八里庄西里100号东区 住邦2000,1号楼A座2101室,100025 北京・上海・硅谷 http://www.ipinyou.com
Re: How to mapreduce in the scenario
if you have huge dataset (huge meaning that around tera bytes or at the least few GBs) then yes, hadoop has the advantage of distributed systems and is much faster but on a smaller set of records it is not as good as RDBMS On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote: Hi, Mike, Nitin, Devaraj, Soumya, samir, Robert Thank you all for your suggestions. Actually, I want to know if hadoop has any advantage than routine database in performance for solving this kind of problem ( join data ). Best Regards, Gump On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee soumya.sbaner...@gmail.com wrote: Hi, You can also try to use the Hadoop Reduce Side Join functionality. Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and Reduce classes to do the same. Regards, Soumya. On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote: Hi Gump, Mapreduce fits well for solving these types(joins) of problem. I hope this will help you to solve the described problem.. 1. Mapoutput key and value classes : Write a map out put key class(Text.class), value class(CombinedValue.class). Here value class should be able to hold the values from both the files(a.txt and b.txt) as shown below. class CombinedValue implements WritableComparator { String name; int age; String address; boolean isLeft; // flag to identify from which file } 2. Mapper : Write a map() function which can parse from both the files(a.txt, b.txt) and produces common output key and value class. 3. Partitioner : Write the partitioner in such a way that it will Send all the (key, value) pairs to same reducer which are having same key. 4. Reducer : In the reduce() function, you will receive the records from both the files and you can combine those easily. Thanks Devaraj From: liuzhg [liu...@cernet.com] Sent: Tuesday, May 29, 2012 3:45 PM To: common-user@hadoop.apache.org Subject: How to mapreduce in the scenario Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,... id3,address3,... c.txt id1,name1,age1,address1,... id2,name2,age2,address2,... I know that it can be done well by database. But I want to handle it with hadoop if possible. Can hadoop meet the requirement? Any suggestion can help me. Thank you very much! Best Regards, Gump -- Nitin Pawar
RE: about rebalance
1) I am not sure that whether I should start the rebalance on the namenode or on each new datanode. You can run the balancer in any node. It is not suggested to run in namenode and would be better to run in a node which has less load. 2) should I set the bandwidth on each datanode or just only on the namenode Each data node has a limited bandwidth for rebalancing. The default value for the bandwidth is 5MB/s. 3) If the rebalance started, whether the data on others' would be decreased Yes, after the balancer run, data will be moved from over utilized nodes to under utilized nodes. 4)whether the log details means the balancer was killed by another one. We cannot run multiple balancers at a time. It is allowed to run only one balancer at any time in the cluster to avoid data corruption. You can refer the below document fot more details. https://issues.apache.org/jira/secure/attachment/12368261/RebalanceDesign6.pdf Thanks Devaraj From: yingnan.ma [yingnan...@ipinyou.com] Sent: Wednesday, May 30, 2012 7:06 AM To: common-user Subject: about rebalance Hi, I add 5 new datanode and I want to do the rebalance, and I started the rebalance on the namenode, and it gave me the notice that starting balancer, logging to /hadoop/logs/hadoop-hdfs-balancer-hadoop220.out and today I check the log file and the detail is that Another balancer is running. Exiting... Balancing took 5.0203 minutes 1) I am not sure that whether I should start the rebalance on the namenode or on each new datanode. 2) should I set the bandwidth on each datanode or just only on the namenode 3) If the rebalance started, whether the data on others' would be decreased 4)whether the log details means the balancer was killed by another one. If you have some suggestion, please give me some notice , thank you Best Regards Malone 2012-05-30 Yingnan.Ma Eyingnan...@ipinyou.com MSN: mayingnan_b...@hotmail.com QQ: 230624226 北京市朝阳区八里庄西里100号东区 住邦2000,1号楼A座2101室,100025 北京・上海・硅谷 http://www.ipinyou.com