how to decode the metadata file of a block
Hi, Can somebody give me some insight of how to read the contents of metadata file using hdfs api's and the encoding that's being used. Thanks, Vidur -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Is it possible ....!!!
Hi, I wanted to ask if it is possible to intercept every communication that takes place between hadoop's map reduce task i.e between JobTracker and TaskTracker and make it pass through my own communication library. So, if JobTracker and TaskTracker talk through http or rpc, i would like to intercept the call and let it pass through my communication library. If it is possible can anyone tell me that which set of classes i need to look at hadoop's distribution. Similarly, for the hdfs, is it possible to let all the communication that is happening between namenode and datanode to pass through my communication library. Reason for doing that is that i want all the communication to happen through a communication library that resolves every communication problem that we can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By using that library all the headache of communication will be gone. So, we will be able to use hadoop quite easily and there will be no communication problems. Thats my master's project. So, i want to know how to start and where to look for. I would really appreciate a reply. Regards, Ahmad Shahzad
Re: Is it possible ....!!! COOL!
Hey, This is a really neat idea if anyone has a way to do this, could you share? I'll bet this could be very interesting! Thanks... Best, HAL Hi, I wanted to ask if it is possible to intercept every communication that takes place between hadoop's map reduce task i.e between JobTracker and TaskTracker and make it pass through my own communication library. So, if JobTracker and TaskTracker talk through http or rpc, i would like to intercept the call and let it pass through my communication library. If it is possible can anyone tell me that which set of classes i need to look at hadoop's distribution. Similarly, for the hdfs, is it possible to let all the communication that is happening between namenode and datanode to pass through my communication library. Reason for doing that is that i want all the communication to happen through a communication library that resolves every communication problem that we can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By using that library all the headache of communication will be gone. So, we will be able to use hadoop quite easily and there will be no communication problems. Thats my master's project. So, i want to know how to start and where to look for. I would really appreciate a reply. Regards, Ahmad Shahzad
Re: Is it possible ....!!! COOL!
Sounds like it could be a SPOF. On Thu, Jun 10, 2010 at 7:47 AM, hmar...@umbc.edu wrote: Hey, This is a really neat idea if anyone has a way to do this, could you share? I'll bet this could be very interesting! Thanks... Best, HAL Hi, I wanted to ask if it is possible to intercept every communication that takes place between hadoop's map reduce task i.e between JobTracker and TaskTracker and make it pass through my own communication library. So, if JobTracker and TaskTracker talk through http or rpc, i would like to intercept the call and let it pass through my communication library. If it is possible can anyone tell me that which set of classes i need to look at hadoop's distribution. Similarly, for the hdfs, is it possible to let all the communication that is happening between namenode and datanode to pass through my communication library. Reason for doing that is that i want all the communication to happen through a communication library that resolves every communication problem that we can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By using that library all the headache of communication will be gone. So, we will be able to use hadoop quite easily and there will be no communication problems. Thats my master's project. So, i want to know how to start and where to look for. I would really appreciate a reply. Regards, Ahmad Shahzad
Appending and seeking files while writing
Hi. Was the append functionality finally added to 0.20.1 version? Also, is the ability to seek file being written and write data in other place also supported? Thanks in advance!
Java run-time error while executing my application - unable to find the files on the HDFS
Hello friends, I have built my own java application that performs some map-reduce operations on the input files. I have loaded my files into HDFS whose path is as follows: /user/sam/input/1.txt /user/sam/input/corrected /user/sam/input/in when i used the command $hadoop dfs -cat /user/sam/input/1.txt.. it outputs the contents of the file correctly. My application uses the files on HDFS as java strings as follows String str = hdfs://192.168.1.1:9000/user/sam/input String file1 = str + 1.txt String file2 = str + Corrected Here file1 file2 are fed as input to my mapper functions. After i started my daemons, i ran my application as follows: $hadoop jar maximum.jar /user/sam/input/in output It is generating an error as follows Java.io.FileNotFoundException: hdfs://192.168.1.1:9000/user/sam/input/1.txt (No such file or directory) But, when i type $hadoop dfs -cat hdfs://192.168.1.1:9000/user/sam/input/1.txt . it outputs the contents of the file correctly. I tried other possible ways as follows: String str = /user/sam/input/ String str = hdfs:/user/sam/input But none of the above paths works. Could anyone point out the possible mistake. Any kind of suggestions are welcome. Thanks, Bharath -- View this message in context: http://old.nabble.com/Java-run-time-error-while-executing-my-application---unable-to-find-the-files-on-the-HDFS-tp28843314p28843314.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Is it possible ....!!!
On Jun 10, 2010, at 3:25 AM, Ahmad Shahzad wrote: Reason for doing that is that i want all the communication to happen through a communication library that resolves every communication problem that we can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By using that library all the headache of communication will be gone. So, we will be able to use hadoop quite easily and there will be no communication problems. I know Owen pointed you towards using proxies, but anything remotely complex would probably be better in an interposer library, as then it is application agnostic.
Help:how to read a xml file in hadoop framework
Dear all, I need to read a xml file in my application. When i run it as an application in eclipse, it runs correct, but when I run it on Hadoop, it gives the error message: 10/06/10 15:52:52 INFO input.FileInputFormat: Total input paths to process : 22 10/06/10 15:52:53 INFO mapred.JobClient: Running job: job_201006101455_0010 10/06/10 15:52:54 INFO mapred.JobClient: map 0% reduce 0% 10/06/10 15:53:07 INFO mapred.JobClient: Task Id : attempt_201006101455_0010_m_00_0, Status : FAILED java.lang.NullPointerException at WordCountMapper.map(WordCountMapper.java:61) at WordCountMapper.map(WordCountMapper.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) attempt_201006101455_0010_m_00_0: java.io.FileNotFoundException: /tmp/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_201006101455_0010/attempt_201006101455_0010_m_00_0/work/readme.xml (No such file or directory) attempt_201006101455_0010_m_00_0: at java.io.FileInputStream.open(Native Method) attempt_201006101455_0010_m_00_0: at java.io.FileInputStream.init(FileInputStream.java:106) . From the error message, We can see the xml file can't be found. Hope you can help me, thanks. Best regards Jander
Re: Is it possible ....!!!
You can define your own socket factory by setting the configuration parameter: hadoop.rpc.socket.factory.class.default to a class name of a SocketFactory. It is also possible to define socket factories on a protocol by protocol basis. Look at the code in NetUtils.getSocketFactory. -- Owen
Re: Is it possible ....!!!
Aaron Kimball wrote: Hadoop has some classes for controlling how sockets are used. See org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory. The socket factory implementation chosen is controlled by the hadoop.rpc.socket.factory.class.default configuration parameter. You could probably write your own SocketFactory that gives back socket implementations that tee the conversation to another port, or to a file, etc. So, it's possible, but I don't know that anyone's implemented this. I think others may have examined Hadoop's protocols via wireshark or other external tools, but those don't have much insight into Hadoop's internals. (Neither, for that matter, would the socket factory. You'd probably need to be pretty clever to introspect as to exactly what type of message is being sent and actually do semantic analysis, etc.) also worry about anything opening a URL, for which there are JVM-level factories, and Jetty which opens its own listeners, though presumably its the clients you'd want to play with. I'm going to be honest and say this is a fairly ambitious project for a master's thesis because you are going to be nestling deep into code across the system, possibly making changes whose benefits people who run well managed datacentres won't see the benefit of (they don't have connectivity problems as they set up the machines and the network properly, it's only people like me whose home desktop is badly configured ( https://issues.apache.org/jira/browse/HADOOP-3426 ) Now, what might be handy is better diagnostics of the configuration, 1. code to run on every machine to test the network, look at the config, play with DNS, detect problems and report them with meaningful errors that point to wiki pages with hints 2. every service which opens ports to log this event somewhere (ideally to a service base class) so instead of trying to work out which ports hadoop is using by playing with netstat -p and jps -v, you can make a query of the nodes (command line, signal and GET /ports) and get each services list of active protocols, ports and IP addresses as text or JSON. 3. some class to take that JSON list and then try to access the various things, log failures 4. Some MR jobs to run the code in (3) and see what happens 5. Some MR jobs whose aim in life is to measure network bandwidth and do stats on round trip times. Just a thought :) See also some thoughts of mine on Hadoop/university collaboration http://www.slideshare.net/steve_l/hadoop-and-universities
Re: just because you can, it doesn't mean you should....
All, Okay, I was being facetious earlier with the 'COOL' comment. This is a very bad idea. Well, not so much bad, but think about the ramifications of what you are proposing. Putting a 'comm' code lib together that facilitates comms and 'helps' with architecture issues also creates a a SPOF (as another gent pointed out); moreover, it creates a nice target for exploitation as the lib will undoubtedly become a repository of embedded passwords, alternate dummy accounts, bypass routes, and all sorts of goop to make things 'easier'. And since is has to be world readable, and easy to get access to, it will be very tough to protect - or easy to DoS/DDoS. Anything and everything from random timing attacks, substitution spoofs, TOUTOCs, you name it. This whole thing is already a very nice open highway to distribute embedded and tunneled 'items' of a certain unnatural nature, don't try to override what little security you have already by 'punching holes in the firewall' and other silly stuff. Long run, what might be better is a discovery agent that provides continual validation of paths and service availability specific to Hadoop and sub programs. That way any outage or problem can be immediately addressed or brought to the attention of the SysAds/Networkers. Like a service monitoring program. Just don't make it simple for the 'hats out there to own you in under five minutes flat (especially with an rpc or soap call to some lib or flat file - and ssh/ssl abso-lu-tely does not matter, trust me). You can disagree, and I really don't mean to be a 'buzz kill', but if you ask your local 'Sherrif', I think you'll be advised not to pursue this path too heavily. Have a good computational day... Best, Hal Hadoop has some classes for controlling how sockets are used. See org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory. The socket factory implementation chosen is controlled by the hadoop.rpc.socket.factory.class.default configuration parameter. You could probably write your own SocketFactory that gives back socket implementations that tee the conversation to another port, or to a file, etc. So, it's possible, but I don't know that anyone's implemented this. I think others may have examined Hadoop's protocols via wireshark or other external tools, but those don't have much insight into Hadoop's internals. (Neither, for that matter, would the socket factory. You'd probably need to be pretty clever to introspect as to exactly what type of message is being sent and actually do semantic analysis, etc.) Allen's suggestion is probably more correct, but might incur additional work on your part. Cheers, - Aaron On Thu, Jun 10, 2010 at 3:54 PM, Allen Wittenauer awittena...@linkedin.comwrote: On Jun 10, 2010, at 3:25 AM, Ahmad Shahzad wrote: Reason for doing that is that i want all the communication to happen through a communication library that resolves every communication problem that we can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By using that library all the headache of communication will be gone. So, we will be able to use hadoop quite easily and there will be no communication problems. I know Owen pointed you towards using proxies, but anything remotely complex would probably be better in an interposer library, as then it is application agnostic.
Re: the same key in different reducers
Hi and thank you for the answers. I didn't check the email and now I see 7 answers. It is really great. Let me explain in more details why I am asking so strange question :-) As I wrote before I write to HBase using Hadoop Job. Actually the writing process executes in reducers part of HADOOP job. Assuming that I have 3 reducers (all of them writes to HBase) and suppose 1 reducer and 3 reducer has the same key. In this case I need to check: does HBase already contains such key ( it required select operation from HBase). If yes I have to merge already inserted record and after that writes it back to HBase. BUT in my case information organized in such way that I have no problem with the same keys. So I can save expensive HBase select operation , meaning using only insert operations. But in order to use only insert operation I need to know that every and every reducer have unique output key ( K3 is unique output key for every and every reducer) input: InputFormatK1,V1 mapper: MapperK1,V1,K2,V2 combiner: ReducerK2,V2,K2,V2 reducer: ReducerK2,V2,K3,V3 output: RecordWriterK3,V3 On Thu, Jun 10, 2010 at 12:40 AM, James Seigel ja...@tynt.com wrote: Oleg, Are you wanting to have them in different reducers? If so then you can write a Comparable object to make that happen. If you want them to be on the same reducer, then that is what hadoop will do. :) On 2010-06-09, at 3:06 PM, Ted Yu wrote: Can you disclose more about how K3 is generated. From your description below, it is possible. On Wed, Jun 9, 2010 at 1:17 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , My hadoop job writes results of map/reduce to HBase. I have 3 reducers. Here is a sequence of input and output parameters for Mapper , Combiner and Reducer *input: InputFormatK1,V1 mapper: MapperK1,V1,K2,V2 combiner: ReducerK2,V2,K2,V2 reducer: ReducerK2,V2,K3,V3 output: RecordWriterK3,V3 *My question: Is it possible that more than one reducer has the same output key K3. Meaning in case I have 3 reducers is it possible that reducer1K3 -* 1* , V3 [1,2,3] reducer2K3 - 2 , V3 [5,6,9] reducer3K3 - *1* , V3 [10,15,22] As you can see reducer1 has K3 - 1 and reducer3 has K3 - 1. So is that case possible or every and every reducer has unique output key? Thanks in advance Oleg.
Re: Delivery Status Notification (Failure)
Hi Simon, MapReduce is a framework developed by Google that uses a programming model based in two functions called Map and Reduce, Both the framework and the programming model are called MapReduce, right? Hadoop is an open-source implementation of MapReduce. HTH, -- Edson Ramiro Lucas Filho http://www.inf.ufpr.br/erlf07/ On 10 June 2010 17:40, Simon Narowki simon.naro...@gmail.com wrote: Thanks Abhishek for your answer. But sorry still I don't understand... What do you mean by the the runtime/programming support needed for MapReduce? Could you please mention some other implementations of MapReduce? Cheers Simon On Thu, Jun 10, 2010 at 10:35 PM, abhishek sharma absha...@usc.edu wrote: Hadoop is an open source implementation of the runtime/programming support needed for MapReduce. Several different implementations of MapReduce are possible. Google has its own that is different from Hadoop. Abhishek On Thu, Jun 10, 2010 at 1:32 PM, Simon Narowki simon.naro...@gmail.com wrote: Dear all, I am a new Hadoop user and am confused a little bit about the difference between Hadoop and MapReduce. Could anyone please clear me? Thanks! Simon
copyToLocal
Hi, so ok am using copyToLocal through an automation script we have and seeing odd results. I am not sure if this is something I am doing wrong, defect, or known good reason for it. Let me know I would like to correct this in either my own script, happy to give a try in the fs code fixing a bug or my own brain in understanding because there is some good reason for this. scenario hadoop fs -copyToLocal event/2010_06_10/81ae7c24745211df9f6d002590008422 /data/2010_06_10/81ae7c24745211df9f6d002590008422 is resulting in my part files showing up in * /data/2010_06_10/81ae7c24745211df9f6d002590008422/81ae7c24745211df9f6d002590008422 * if i try [Hadoop 0.20.1] hadoop fs -copyToLocal event/2010_06_10/81ae7c24745211df9f6d002590008422 /data/2010_06_10 is resulting in my part files showing up in just */data/2010_06_10* (with no creation of the UUID directory like it did before) my desired result is to have the files from *event/2010_06_10/81ae7c24745211df9f6d002590008422 *and end up in */data/2010_06_10/81ae7c24745211df9f6d002590008422* with the trailing directory creating itself like it does in the first scenario but not duplicating it as it doesweirdly. /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Re: Delivery Status Notification (Failure)
Hadoop is an open source implementation of the runtime/programming support needed for MapReduce. Several different implementations of MapReduce are possible. Google has its own that is different from Hadoop. Abhishek On Thu, Jun 10, 2010 at 1:32 PM, Simon Narowki simon.naro...@gmail.com wrote: Dear all, I am a new Hadoop user and am confused a little bit about the difference between Hadoop and MapReduce. Could anyone please clear me? Thanks! Simon
present at seajug?
hey guys, anyone from your group interested in presenting on hadoop or something related to seajug by any chance? cheers, -- Nimret Sandhu http://www.nimret.com http://www.nimsoft.biz On Wednesday, June 09, 2010 06:41:07 pm Sean Jensen-Grey wrote: Hello Fellow Hadoopists, We are meeting at 7:15 pm on June 17th at the University Heights Community Center 5031 University Way NE Seattle WA 98105 Room #110 We are looking for people to present. So you would like to get the word out please contact either myself or Chris Wilkes. The meetings are informal and highly conversational. If you have questions about Hadoop and map reduce this is a great place to ask them. Sean Jensen-Grey se...@seattlehadoop.org Chris Wilkes cwil...@seattlehadoop.org Sean Chris Seattle Hadoop Distributing Computing User Meeting == Bringing Hadoopists Together On the 3rd Thursday of the Month We focus predominately on distributed data processing using a map reduce style. The meetings are open to all and free of charge. When: Thursday June 17th, 7:15 prompt start - 8:45 Where: University Heights Community Center, Room 110 Outline for June 17th: Hands on Pig UDF: Chris Wilkes will walk through code samples on how to create and use your own User Defined Functions with Pig. Compute Bound map reduce: Sean Jensen-Grey will show some strategies for running compute bound tasks on Hadoop. Please sign up to the list annou...@seattlehadoop.org for late breaking meeting information and post meeting communication. Subscribe via email seattlehadoop-announce+subscr...@googlegroups.com or http://groups.google.com/group/seattlehadoop-announce Regards, Sean Chris http://seattlehadoop.org/
Re: Delivery Status Notification (Failure)
You can find more about MapReduce here: http://labs.google.com/papers/mapreduce.html Some of the implementations (like Hadoop) are listed on this page: http://en.wikipedia.org/wiki/MapReduce Zeev On Thu, Jun 10, 2010 at 1:32 PM, Simon Narowki simon.naro...@gmail.comwrote: Dear all, I am a new Hadoop user and am confused a little bit about the difference between Hadoop and MapReduce. Could anyone please clear me? Thanks! Simon
Re: Delivery Status Notification (Failure)
Hi Edson, Thank you for the answer. That's right MapReduce is the Google framework based on two functions Map and Reduce. If I understood it correctly, Hadoop is an implementation of Map and Reduce functions in MapReduce. My question is: Does Hadoop includes MapReduce framework of Google as well? Regards Simon On Thu, Jun 10, 2010 at 10:44 PM, Edson Ramiro erlfi...@gmail.com wrote: Hi Simon, MapReduce is a framework developed by Google that uses a programming model based in two functions called Map and Reduce, Both the framework and the programming model are called MapReduce, right? Hadoop is an open-source implementation of MapReduce. HTH, -- Edson Ramiro Lucas Filho http://www.inf.ufpr.br/erlf07/ On 10 June 2010 17:40, Simon Narowki simon.naro...@gmail.com wrote: Thanks Abhishek for your answer. But sorry still I don't understand... What do you mean by the the runtime/programming support needed for MapReduce? Could you please mention some other implementations of MapReduce? Cheers Simon On Thu, Jun 10, 2010 at 10:35 PM, abhishek sharma absha...@usc.edu wrote: Hadoop is an open source implementation of the runtime/programming support needed for MapReduce. Several different implementations of MapReduce are possible. Google has its own that is different from Hadoop. Abhishek On Thu, Jun 10, 2010 at 1:32 PM, Simon Narowki simon.naro...@gmail.com wrote: Dear all, I am a new Hadoop user and am confused a little bit about the difference between Hadoop and MapReduce. Could anyone please clear me? Thanks! Simon
Delivery Status Notification (Failure)
Dear all, I am a new Hadoop user and am confused a little bit about the difference between Hadoop and MapReduce. Could anyone please clear me? Thanks! Simon
Re: copyToLocal
I think 3 weeks of no sleep caused this. My automation script failed leaving the directory there so when it re-ran THEN it caused this weirdness. I guess copyToLocal if it sees a directory already existing it then appends the directory as a child to the local (so in my first scenario /data/2010_06_10/81ae7c24745211df9f6d002590008422 already was existing because my script created it the first time). Still odd behavior, whatever it is fine sorry to bother... i just added to my automation script to remove the directory before i do a copyToLocal. On Thu, Jun 10, 2010 at 7:12 PM, Joseph Stein crypt...@gmail.com wrote: Hi, so ok am using copyToLocal through an automation script we have and seeing odd results. I am not sure if this is something I am doing wrong, defect, or known good reason for it. Let me know I would like to correct this in either my own script, happy to give a try in the fs code fixing a bug or my own brain in understanding because there is some good reason for this. scenario hadoop fs -copyToLocal event/2010_06_10/81ae7c24745211df9f6d002590008422 /data/2010_06_10/81ae7c24745211df9f6d002590008422 is resulting in my part files showing up in * /data/2010_06_10/81ae7c24745211df9f6d002590008422/81ae7c24745211df9f6d002590008422 * if i try [Hadoop 0.20.1] hadoop fs -copyToLocal event/2010_06_10/81ae7c24745211df9f6d002590008422 /data/2010_06_10 is resulting in my part files showing up in just */data/2010_06_10* (with no creation of the UUID directory like it did before) my desired result is to have the files from *event/2010_06_10/81ae7c24745211df9f6d002590008422 *and end up in */data/2010_06_10/81ae7c24745211df9f6d002590008422* with the trailing directory creating itself like it does in the first scenario but not duplicating it as it doesweirdly. /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */ -- /* Joe Stein http://www.linkedin.com/in/charmalloc */