Re: how to get all different values for each key
Hey, I feel HashSet is a good method to dedup. To increase the overall efficiency you could also look into Combiner running the same Reducer code. That would ensure less data in the sort-shuffle phase. Regards, Matthew On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang wangjx...@gmail.com wrote: hi,harsh After map, I can get all values for one key, but I want dedup these values, only get all unique values. now I just do it like the image. I think the following code is not efficient.(using a HashSet to dedup) Thanks:) private static class MyReducer extends ReducerLongWritable,LongWritable,LongWritable,LongsWritable { HashSetLong uids=new HashSetLong(); LongsWritable unique_uids=new LongsWritable(); public void reduce(LongWritable key,IterableLongWritable values,Context context)throws IOException,InterruptedException { uids.clear(); for(LongWritable v:values) { uids.add(v.get()); } int size=uids.size(); long[] l=new long[size]; int i=0; for(long uid:uids) { l[i]=uid; i++; } unique_uids.Set(l); context.write(key,unique_uids); } } 2011/8/3 Harsh J ha...@cloudera.com Use MapReduce :) If map output: (key, value) Then reduce input becomes: (key, [iterator of values across all maps with (key, value)]) I believe this is very similar to the wordcount example, but minus the summing. For a given key, you get all the values that carry that key in the reducer. Have you tried to run a simple program to achieve this before asking? Or is something specifically not working? On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang wangjx...@gmail.com wrote: HI, I hava many key,value pairs now, and want to get all different values for each key, which way is efficient for this work. such as input : 1,2 1,3 1,4 1,3 2,1 2,2 output: 1,2/3/4 2,1/2 Thanks! walter -- Harsh J
Global array in OutputFormat
Hi Guys I intend to record the Write pattern of a Job using the following record : timestamp, size of buffer written. Inorder to obtain this, I was thinking of maintaining a global buffer (CollectionString) and keep adding to the buffer whenever there is write called via the OutputFormat class. But I am not really able to figure out under which instance (class hierarchy) to declare such a static buffer which could be accessible by all OutputFormat write streams. Please help me if you ve got some idea on this. Thanks, Matthew John
Re: Benchmarks with different workloads
I am looking for a compute intensive benchmark (cpu usage 60% ) for my hadoop cluster. If there is something readily available, it would be great. Thanks, Matthew On Tue, May 31, 2011 at 8:30 PM, Cristina Abad cristina.a...@gmail.comwrote: You could try SWIM [1]. -Cristina [1] Yanpei Chen, Archana Ganapathi, Rean Griffith, Randy Katz . SWIM - Statistical Workload Injector for MapReduce. Available at: http://www.eecs.berkeley.edu/~ychen2/SWIM.html -- Forwarded message -- From: Matthew John tmatthewjohn1...@gmail.com To: common-user common-user@hadoop.apache.org Date: Tue, 31 May 2011 20:01:25 +0530 Subject: Benchmarks with different workloads Hi , I am looking out for Hadoop benchmarks that could characterize the following workloads : 1) IO intensive workload 2) CPU intensive workload 3) Mixed (IO + CPU) workloads Some one please throw some pointers on these!! Thanks, Matthew
IO benchmark ingesting data into HDFS
Hi all, I wanted to use an IO benchmark that reads/writes Data from/into the HDFS using MapReduce. TestDFSIO. I thought, does this. But what I understand is that TestDFSIO merely creates the files in a temp folder in the local filesystem of the TaskTracker nodes. Is this correct? How can such an approach test the IOPs given by a IO intensive MapReduce workload ? Matthew
Benchmarks with different workloads
Hi , I am looking out for Hadoop benchmarks that could characterize the following workloads : 1) IO intensive workload 2) CPU intensive workload 3) Mixed (IO + CPU) workloads Some one please throw some pointers on these!! Thanks, Matthew
Host-address or Hostname
Hi all, The String[] that is output by the InputSplit.getLocations() gives the list of nodes where the input split resides. But the node detail is either represented as the ip-address or the hostname (for eg - an entry in the list could be either 10.72.147.109 or mattHDFS1 (hostname). Is it possible to make this consistent. I am trying to do some work by parsing an ID number embedded in the Hostname and this mixed representation is giving me hell lot of problems. How to resolve this ? Thanks, Matthew
Re: Host-address or Hostname
Is it possible to get a Host-address to Host-name mapping in the JIP ? Someone please help me with this! Thanks, Matthew On Thu, May 12, 2011 at 5:36 PM, Matthew John tmatthewjohn1...@gmail.comwrote: Hi all, The String[] that is output by the InputSplit.getLocations() gives the list of nodes where the input split resides. But the node detail is either represented as the ip-address or the hostname (for eg - an entry in the list could be either 10.72.147.109 or mattHDFS1 (hostname). Is it possible to make this consistent. I am trying to do some work by parsing an ID number embedded in the Hostname and this mixed representation is giving me hell lot of problems. How to resolve this ? Thanks, Matthew
Bad connection to FS. command aborted
Hi all! I have been trying to figure out why I m getting this error! All that I did was : 1) Use a single node cluster 2) Made some modifications in the core (in some MapRed modules). Successfully compiled it 3) Tried bin/start-dfs.sh alone. All the required daemons (NN and DN) are up. The NameNode and DataNode logs are nt showing any errors/exceptions. Only interesting thing I found was : *WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.72.147.109:40048 got version 94 expected version 3 * in the NameNode logs. Someone please help me out of this! Matthew
Re: Bad connection to FS. command aborted
I did a ant jar after modifying two files in the mapred module. This, from what I understand, creates a hadoop-*-core.jar in the build folder. Now I assume that will be used henceforth for any execution. So how can this be a problem if I am running a single-node cluster. Version mismatch with whom ? On Wed, May 11, 2011 at 7:07 PM, Habermaas, William william.haberm...@fatwire.com wrote: The Hadoop IPCs are version specific. That is done to prevent an older version from talking to a newer one. Even if nothing has changed in the internal protocols the version check is enforced. Make sure the new hadoop-core.jar from your modification is on the classpath used by the hadoop shell script. Bill -Original Message- From: Matthew John [mailto:tmatthewjohn1...@gmail.com] Sent: Wednesday, May 11, 2011 9:27 AM To: common-user Subject: Bad connection to FS. command aborted Hi all! I have been trying to figure out why I m getting this error! All that I did was : 1) Use a single node cluster 2) Made some modifications in the core (in some MapRed modules). Successfully compiled it 3) Tried bin/start-dfs.sh alone. All the required daemons (NN and DN) are up. The NameNode and DataNode logs are nt showing any errors/exceptions. Only interesting thing I found was : *WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.72.147.109:40048 got version 94 expected version 3 * in the NameNode logs. Someone please help me out of this! Matthew
Re: Bad connection to FS. command aborted
org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2011-05-11 19:34:36,622 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting 2011-05-11 19:34:36,631 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310: starting 2011-05-11 19:34:36,639 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310: starting 2011-05-11 19:34:36,640 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2011-05-11 19:34:36,655 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 54310: starting 2011-05-11 19:34:36,656 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310: starting 2011-05-11 19:34:36,658 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 54310: starting 2011-05-11 19:34:36,658 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 54310: starting 2011-05-11 19:34:36,669 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting 2011-05-11 19:37:36,548 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 10.72.147.109:50010storage DS-1515207802-10.72.147.109-50010-1305118592183 2011-05-11 19:37:36,551 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.72.147.109:50010 Thanks, Matthew On Wed, May 11, 2011 at 7:13 PM, Habermaas, William william.haberm...@fatwire.com wrote: If the hadoop script is picking up a different hadoop-core jar then the classes that ipc to the NN will be using a different version. Bill -Original Message- From: Matthew John [mailto:tmatthewjohn1...@gmail.com] Sent: Wednesday, May 11, 2011 9:41 AM To: common-user@hadoop.apache.org Subject: Re: Bad connection to FS. command aborted I did a ant jar after modifying two files in the mapred module. This, from what I understand, creates a hadoop-*-core.jar in the build folder. Now I assume that will be used henceforth for any execution. So how can this be a problem if I am running a single-node cluster. Version mismatch with whom ? On Wed, May 11, 2011 at 7:07 PM, Habermaas, William william.haberm...@fatwire.com wrote: The Hadoop IPCs are version specific. That is done to prevent an older version from talking to a newer one. Even if nothing has changed in the internal protocols the version check is enforced. Make sure the new hadoop-core.jar from your modification is on the classpath used by the hadoop shell script. Bill -Original Message- From: Matthew John [mailto:tmatthewjohn1...@gmail.com] Sent: Wednesday, May 11, 2011 9:27 AM To: common-user Subject: Bad connection to FS. command aborted Hi all! I have been trying to figure out why I m getting this error! All that I did was : 1) Use a single node cluster 2) Made some modifications in the core (in some MapRed modules). Successfully compiled it 3) Tried bin/start-dfs.sh alone. All the required daemons (NN and DN) are up. The NameNode and DataNode logs are nt showing any errors/exceptions. Only interesting thing I found was : *WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.72.147.109:40048 got version 94 expected version 3 * in the NameNode logs. Someone please help me out of this! Matthew
Which datanode serves the data for MR
Hi all, I wanted to know details such as In an MR job, which tasktracker (node-level) works on data (inputsplit) from which datanode (node-level) ? Can some logs provide data on it? Or do I need to print this data - if yes, what to print and how to print ? Thanks, Matthew
bin/start-dfs/mapred.sh with input slave file
Hi all, I see that there is an option to provide a slaves_file as input to bin/start-dfs.sh and bin/start-mapred.sh so that slaves are parsed from this input file rather than the default conf/slaves. Can someone please help me with the syntax for this. I am not able to figure this out. Thanks, Matthew John
Tweak the Daemon start-up
Hi all, Assume I have got (m+n = p) p nodes (excluding the NameNode) in a hadoop cluster. I wanted to initialize the cluster with TaskTracker alone running on m nodes and DataNode alone running on the rest n nodes. How can I achieve such a configuration ? Can I do this by modifying the bin/start-all.sh ? Suggestions please.. Matthew John
Re: HDFS - MapReduce coupling
someone kindly give some pointers on this!! On Mon, May 2, 2011 at 12:46 PM, Matthew John tmatthewjohn1...@gmail.comwrote: Any documentations on how the different daemons do the write/read on HDFS and Local File System (direct), I mean the different protocols used in the interactions. I basically wanted to figure out how intricate the coupling between the Storage (HDFS + Local) and other processes in the Hadoop infrastructure is. On Mon, May 2, 2011 at 12:26 PM, Ted Dunning tdunn...@maprtech.comwrote: Yes. There is quite a bit of need for the local file system in clustered mode. For one think, all of the shuffle intermediate files are on local disk. For another, the distributed cache is actually stored on local disk. HFDS is a frail vessel that cannot cope with all the needs. On Sun, May 1, 2011 at 11:48 PM, Matthew John tmatthewjohn1...@gmail.com wrote: ... 2) Does the Hadoop system utilize the local storage directly for any purpose (without going through the HDFS) in clustered mode?
Read and Write throughputs via JVM
Hi all, I wanted to figure out the Read and Write throughputs that happens in a Map task (Read - reading from the input splits, Write - writing the map output back) inside a JVM. Do we have any counters that can help me with this? Or where exactly should I focus on tweaking the code to add some additional time stamp outputs (for example - time stamp maybe at the start and end of Map read). Thanks, Matthew John
HDFS Compatiblity
Hi all, Can HDFS run over a RAW DISK which is mounted over a mount point with no FIle System ? Or does it interact only with POSIX compliant File sytem ? thanks, Matthew
DFSIO benchmark
Can some one provide pointers/ links for DFSio Benchmarks to check the IO performance of HDFS ?? Thanks, Matthew John
Awareness of Map tasks
Hi all, Had some queries on Map task's awareness. From what I understand, every map task instance is destined to process the data in a specific Input split (can be across HDFS blocks). 1) Do these map tasks have a unique instance number? If yes, are they mapped to its specific input splits and the mapping is done using what parameters (say for eg. map task number to input file byte offset ?) ? where exactly is this hash-map preserved (at what level - jobtracker, tasktracker or each tasks) ? 2) coming to a practical scenario, when I run hadoop in local mode. I run a mapreduce job with 10 maps. Since there is an inherent jvm parallelism (say the node can afford to run 2 map task jvms simultaneously) I assume that there are some map tasks that run concurrently. Since HDFS doesnot play a role in this case, how is the map task instance - to - input split mapping mechanism carried out ? Or do we have a concept of input split at all (will all the maps start scanning from the start of the input file) ? Please help me with these queries.. Thanks, Matthew John
Hadoop code base splits
Hi, Can someone provide me some pointers on the following details of Hadoop code base: 1) breakdown of HDFS code base (approximate lines of code) into following modules: - HDFS at the Datanodes - Namenode - Zookeeper - MapReduce based - Any other relevant split 2) breakdown of Hbase code into following modules: - HMaster - RegionServers - MapReduce - Any other relevant split Matthew John
Iostat on Hadoop
Hi all, Can someone give pointers on using Iostat to account for IO overheads (disk read/writes) in a MapReduce job. Matthew John
Re: hadoop installation problem(single-node)
hey Manish, Are u giving the commands in the Hadoop_home directory ? if yes please give bin/hadoop namenode -format dont forget to append bin/ before ur commands because all the scripts reside in the bin directory. Matthew On Wed, Mar 2, 2011 at 2:29 PM, Manish Yadav manish.ya...@orkash.com wrote: Dear Sir/Madam I'm very new to hadoop. I'm trying to install hadoop on my computer. I followed a weblink and try to install it. I want to install hadoop on my single node cluster. i 'm using Ubuntu 10.04 64-bit as my operating system . I have installed java in /usr/java/jdk1.6.0_24. the step i take to install hadoop are following 1: Make a group hadoop and a user hadoop with home directory in hadoop directory i have a directory called projects and download hadoop binary there than extract them there; i configured the ssh also. than i made changes to some file which are following. i'm attaching them with this male please check them . 1: hadoop_env_sh 2:core-site.xml 3mapreduce-site.xml 4 hdfs-site. xml 5 hadoop's usre .bashrc 6 hadoop'user .profile After making changes to these fie ,I just enter the hadoop account and enter the few command following thing happen : hadoop@ws40-man-lin:~$ echo $HADOOP_HOME /home/hadoop/project/hadoop-0.20.0 hadoop@ws40-man-lin:~$ hadoop namenode -format hadoop: command not found hadoop@ws40-man-lin:~$ namenode -format namenode: command not found hadoop@ws40-man-lin:~$ now I'm completely stuck i don't know what to do? please help me as there is no more help around the net. i' m attaching the files also which i changed can u tell me the exact configuration which i should use to install hadoop.
Re: hadoop installation problem(single-node)
Hey Manish, I am not very sure if you have got your configurations correct including the javapath. Can u try re-installing hadoop following the guidelines given in the following link step by step. That would take care of any glitches possible. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Thanks, Matthew On Thu, Mar 3, 2011 at 10:42 AM, manish.yadav manish.ya...@orkash.com wrote: thanks for the help now the command is working but I got the following errors .Will u help me to solve these error im giving you the error list which i faced in installing hadoop on single node cluster all the configuration files are attached to the earlier post i just use the command hadoop@ws40-man-lin:~/project/hadoop-0.20.0$ bin/hadoop namenode -format and i get following result Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/server/namenode/NameNode Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.server.namenode.NameNode at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) Could not find the main class: org.apache.hadoop.hdfs.server.namenode.NameNode. Program will exit. now what i'm doing wrong -- View this message in context: http://lucene.472066.n3.nabble.com/hadoop-installation-problem-single-node-tp2613742p2623014.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Cost of bytecode execution in MapReduce
Hi Ted, Can u provide a link to the same ? Not able to find it :( . On Thu, Feb 17, 2011 at 9:54 PM, Ted Yu yuzhih...@gmail.com wrote: There was a discussion thread about why hadoop was developed in Java. Please read it. On Wed, Feb 16, 2011 at 10:39 PM, Matthew John tmatthewjohn1...@gmail.comwrote: hi Ted, wanted to know if its development environment specific. Can u throw some light on whether there is any inherent bytecode ececution cost ? I am not using any specific development environment now (like Eclipse) . Matthew On Thu, Feb 17, 2011 at 11:52 AM, Ted Yu yuzhih...@gmail.com wrote: Is your target development environment using C++ ? On Wed, Feb 16, 2011 at 9:49 PM, Matthew John tmatthewjohn1...@gmail.comwrote: Hi all, I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs any fixed cost of ByteCode execution. And how do the mappers (say of WordCount MR) look like in detail (in bytecode detail) ?? Any good pointers to this ? Thanks, Matthew John
Re: Cost of bytecode execution in MapReduce
Hi Ted, I want to basically analyse the cost functions of the MapReduce framework in Hadoop. That would include a good understanding of the byte code execution costs which comes with Mappers and Reducers. I know it might change for different MRs. So I am thinking of taking WordCount and analysing it in-depth. The intention is trying to optimize / tweak MapReduce for different workloads and commodity resources. It will be great if someone can provide some links to work already done. Or help me with some framework which enables bytecode level analysis ( I guess Eclipse could be a good option. But never tried it ). Thanks, Matthew John On Fri, Feb 18, 2011 at 2:54 AM, Ted Yu yuzhih...@gmail.com wrote: Are you investigating alternative map-reduce framework ? Please read: http://www.craighenderson.co.uk/mapreduce/ On Thu, Feb 17, 2011 at 9:45 AM, Matthew John tmatthewjohn1...@gmail.comwrote: Hi Ted, Can u provide a link to the same ? Not able to find it :( . On Thu, Feb 17, 2011 at 9:54 PM, Ted Yu yuzhih...@gmail.com wrote: There was a discussion thread about why hadoop was developed in Java. Please read it. On Wed, Feb 16, 2011 at 10:39 PM, Matthew John tmatthewjohn1...@gmail.comwrote: hi Ted, wanted to know if its development environment specific. Can u throw some light on whether there is any inherent bytecode ececution cost ? I am not using any specific development environment now (like Eclipse) . Matthew On Thu, Feb 17, 2011 at 11:52 AM, Ted Yu yuzhih...@gmail.com wrote: Is your target development environment using C++ ? On Wed, Feb 16, 2011 at 9:49 PM, Matthew John tmatthewjohn1...@gmail.comwrote: Hi all, I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs any fixed cost of ByteCode execution. And how do the mappers (say of WordCount MR) look like in detail (in bytecode detail) ?? Any good pointers to this ? Thanks, Matthew John
Cost of bytecode execution in MapReduce
Hi all, I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs any fixed cost of ByteCode execution. And how do the mappers (say of WordCount MR) look like in detail (in bytecode detail) ?? Any good pointers to this ? Thanks, Matthew John
Re: Cost of bytecode execution in MapReduce
hi Ted, wanted to know if its development environment specific. Can u throw some light on whether there is any inherent bytecode ececution cost ? I am not using any specific development environment now (like Eclipse) . Matthew On Thu, Feb 17, 2011 at 11:52 AM, Ted Yu yuzhih...@gmail.com wrote: Is your target development environment using C++ ? On Wed, Feb 16, 2011 at 9:49 PM, Matthew John tmatthewjohn1...@gmail.comwrote: Hi all, I wanted to know if the Map/Reduce (Mapper and Reducer) code incurs any fixed cost of ByteCode execution. And how do the mappers (say of WordCount MR) look like in detail (in bytecode detail) ?? Any good pointers to this ? Thanks, Matthew John
Mechanism of MapReduce in Hadoop
Hi all, I want to know if anyone had already done an in-depth analysis of the MapReduce mechanism. Has anyone really gone into bytecode level understanding of the Map and Reduce mechanism. It would be good if we can take a simple MapReduce (say WordCount) and then try the analysis. Please send me pointers to if there s already some work done in this respect. Or please help me with how to proceed with the same analysis if you feel a specific technique/software/development environment has ready plugins to help in this regard. thanks, Matthew John
Hbase documentations
Hi guys, can someone send me a good documentation on Hbase (other than the hadoop wiki). I am also looking for a good Hbase tutorial. Regards, Matthew
Re: Could I write outputs in multiple directories?
Hi Junyoung Kim, You can try out MultipleOutputs.addNamedOutput() . The second parameter u pass in is supposed to be the filename to be which you are writing the reducer output. Therefore if your output folder is X (using setOutputPath() ), you can try giving A/output, B/output, C/output in the 2nd parameter space. It should write the corresponding data to X/A/output , X/B/output and X/C/output respectively I guess. In the reducer, depending on the key , you can use getCollector() to write it to different output paths. For eg: if (Key == A) multipleoutputs.getCollector(A/output,reporter).collect(Key,Value); Regards, Matthew On Mon, Feb 14, 2011 at 11:27 AM, Jun Young Kim juneng...@gmail.com wrote: Hi, As I understand, a Hadoop can write multiple files in a directory. but, it can't write output files in multiple directories. isn't it? MultipleOutputs for generating multiple files. FileInputFormat.addInputPaths for setting several input files simultaneously. How could I do if I want to write outputs files in multiple directories depends on it's key? for example) A type key - MMdd/A/output B type Key - MMdd/B/output C type Key - MMdd/C/output thanks. -- Junyoung Kim (juneng...@gmail.com)
some doubts Hadoop MR
Hi all, I had some doubts regarding the functioning of Hadoop MapReduce : 1) I understand that every MapReduce job is parameterized using an XML file (with all the job configurations). So whenever I set certain parameters using my MR code (say I set splitsize to be 32kb) it does get reflected in the job (number of mappers). How exactly does that happen ? Does the parameters coded in the MR module override the default parameters set in the configuration XML ? And how does the JobTracker ensure that the configuration is followed by all the TaskTrackers ? What is the mechanism followed ? 2) Assume I am running cascading (chained) MR modules. In this case I feel there is a huge overhead when output of MR1 is written back to HDFS and then read from there as input of MR2.Can this be avoided ? (maybe store it in some memory without hitting the HDFS and NameNode ) Please let me know if there s some means of exercising this because it will increase the efficiency of chained MR to a great extent. Matthew
Strange byte [] size conflict
Hi all, I have a BytesWritable key that comes to the mapper. If I give key.getLength(), it returns 32. then I tried creating a new byte [] array initializing its size to 32. (byte [] keybytes = new bytes [32];) and I tried giving : keybytes = key.getBytes(); now keybytes.length (which should return 32) is returning 48 ! I dont understand why this is happening ! Please help me with this. ! Thanks, Matthew
Map-Reduce-Reduce
Hi all, I was working on a MapReduce program which does BytesWritable dataprocessing. But currently I am basically running two MapReduces consecutively to get the final output : Input (MapReduce1)--- Intermediate (MapReduce2)--- Output Here I am running MapReduce2 only to sort the intermediate data on the basis of a Key comparator logic. I wanted to cut short the number of MapReduces to just one. I have figured out a logic to do the same. But the only problem is that in my logic I need to run a sort on the Reduce output to get the final output. the flow looks like this : Input (MapReduce1) Output (not sorted) I want to know if its possible to attach one more Reduce module to the dataflow so that it can perform the inherent sort before the 2nd reduce call. It would look like : Input --(Map)--- MapOutput ---(Reduce1)--Output (not sorted) ---(Reduce2 - for which Reduce 1 acts as a Mapper)--- Output Please let me know if there can be some means of sorting the output without invoking a separate MapReduce just for the sake of sorting it . Thanks , Matthew
Re: help for using mapreduce to run different code?
Hi Jander, If I understand what u want , u would like to run the map instances of two different mapreduces (so obviously different mapper codes) simultaneously on the same machine. If I am correct, it has got more to do with the number of simultaneous mapper instances setting (I guess its default 2 or 4). And there should be a way to divide the map instances among the two MR modules (to fill up the slot of 4)u want to run together. Please correct me if I am wrong. Wanted to try clearing the air regarding the Query :) :) . Matthew On Wed, Dec 29, 2010 at 5:47 AM, maha m...@umail.ucsb.edu wrote: Hi Jander, You mean write Map in another language? like python or C, then yes. Check this http://hadoop.apache.org/common/docs/r0.18.0/streaming.html for Hadoop Streaming. Maha On Dec 28, 2010, at 2:53 PM, Jander g wrote: Hi, all Whether Hadoop supports the map function running different code? If yes, how to realize this? Thanks in advance! -- Regards, Jander
hdfs with raid
Hi all, Got to know about a hdfs with raid implementation from the following documentation : http://wiki.apache.org/hadoop/HDFS-RAID In the documentation, it says u can find the hadoop-*-raid.jar file which has got the libraries to run the raid-hdfs. Where to get this file ? Searched a lot , but could not get my hands on it .. Thanks, Matthew
Re: InputFormat for a big file
//So can you guide me to write a InputFormat which splits the file //into multiple Splits more the number of mappers u assign , more the number of input splits in the mapreduce.. in effect, the number of inputsplits is equal to the number of mappers assigned. that should take care of the problem i guess... Matthew On Fri, Dec 17, 2010 at 9:28 PM, madhu phatak phatak@gmail.com wrote: Hi I have a very large file of size 1.4 GB. Each line of the file is a number . I want to find the sum all those numbers. I wanted to use NLineInputFormat as a InputFormat but it sends only one line to the Mapper which is very in efficient. So can you guide me to write a InputFormat which splits the file into multiple Splits and each mapper can read multiple line from each split Regards Madhukar
Hadoop 0.20.2 with eclipse in windows
Hi all, I have been working with Hadoop0.20.2 in linux nodes. Now I want to try the same version with eclipse on a windows xp machine. Could someone provide a tutorial/guidelines on how to install this setup. thanks, Matthew
Re: Hadoop 0.20.2 with eclipse in windows
I tried installing using this link, but as in the tutorial when I try to run bin/hadoop namenode -format it gives the following error : bin/hadoop: line 2 : $'\r' : command not found and many such statements.. I ve given the local jdk folder as the java_home. Not sure why this is showing up. Ve not used Cygwin till now.. Matthew On Tue, Dec 14, 2010 at 9:38 AM, Harsh J qwertyman...@gmail.com wrote: Hi, On Tue, Dec 14, 2010 at 9:22 AM, Matthew John tmatthewjohn1...@gmail.com wrote: Hi all, I have been working with Hadoop0.20.2 in linux nodes. Now I want to try the same version with eclipse on a windows xp machine. Could someone provide a tutorial/guidelines on how to install this setup. This page's instruction still works for running a Hadoop cluster on Windows + the Plugin w/ Cygwin: http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html thanks, Matthew -- Harsh J www.harshj.com
Hadoop Certification Progamme
Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your Hadoop expertise. Matthew
Tweaking the File write in HDFS
Hi all , I have been working with MapReduce and HDFS for sometime. So the procedure what I normally follow is : 1) copy in the input file from Local File System to HDFS 2) run the map reduce module 3) copy the output file back to the Local File System from the HDFS But I feel , step 1 and 3 is adding a lot of overhead to the entire process !! My queries are : 1) I am getting the files into the Local File System by establishing a port connection with another node. So can I ensure that the data which is ported into the hadoop node is directly written to the HDFS instead of going through the Local File System and then performing a CopyFromLocal ??? 2) Can I copy the reduce output (which creates the final output file) directly to the Local File System instead of injecting it to the HDFS (effectively into different nodes in HDFS), so that I can minimize the overhead ?? I expect this procedure to take much lesser time than copying to the HDFS and then performing a CopyToLocal.. Finally I should be able to send this file back to another node using socket communication.. Looking forward to your suggestions !! Thanks, Matthew John
Multiple Input
Hi all, I modified a MapReduce code which had only a single Input path to accomodate Multiple Inputs.. The changes I made (in Driver file) : Path FpdbInputPath = new Path(args[0]); Path ClogInputPath = new Path(args[1]); FpdbInputPath = FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job)); ClogInputPath = ClogInputPath.makeQualified(ClogInputPath.getFileSystem(job)); MultipleInputs.addInputPath(job, FpdbInputPath, Dup1InputFormat.class, Dup1FpdbMapper.class); MultipleInputs.addInputPath(job, ClogInputPath, Dup1InputFormat.class, Dup1ClogMapper.class); But when I run the program it is giving the exception : java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152) and so on. Is it because there is a default input directory set and when it finds there is nothing there it gives out the error ?? Please help me out of this.. Thanks , Matthew
Multiple input not working
Hi all, I modified a MapReduce code which had only a single Input path to accomodate Multiple Inputs.. The changes I made (in Driver file) : Path FpdbInputPath = new Path(args[0]); Path ClogInputPath = new Path(args[1]); FpdbInputPath = FpdbInputPath.makeQualified(FpdbInputPath.getFileSystem(job)); ClogInputPath = ClogInputPath.makeQualified(ClogInputPath.getFileSystem(job)); MultipleInputs.addInputPath(job, FpdbInputPath, Dup1InputFormat.class, Dup1FpdbMapper.class); MultipleInputs.addInputPath(job, ClogInputPath, Dup1InputFormat.class, Dup1ClogMapper.class); But when I run the program it is giving the exception : java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152) and so on. Is it because there is a default input directory set and when it finds there is nothing there it gives out the error ?? Please help me out of this.. Note : I am using Hadoop - 0.20.2 .. Is it by anyways because of the version that MultipleInputs is not updating the map.input.dir with the paths added ? Thanks , Matthew
Reduce groups
Hi all, The number of Reducer groups in my MapReduce is always the same as the number of records output by the MapReduce. So what I understand is every record from the Shuffle/Sort is going to different Reducer.reduce. How can I change this? My key is BytesWritable and I tried writing my own Comparator and set it in setOutputValueGroupingClass but still not more than one record is entering the same reduce group. Someone please tell me the mechanism behind this so that I can fix this problem . I am not caring about Partitioner since I am using a single reducer. Thanks, Matthew
Multiple Input Data Processing using MapReduce
Hi all , I have been recently working on a task where I need to take in two input (types) files , compare them and produce a result from it using a logic. But as I understand simple MapReduce implementations are for processing a single input type. The closest implementation I could think of similar to my work is Join MapReduce. But I am not able to understand much from the example provided in Hadoop .. Can someone provide a good pointer to such multiple input data processing ( or Join ) in mapreduce . It will also be great if you can send in some sample code for the same. Thanks , Matthew
doubts
Hi all , Had some doubts : 1) what happens when a mapper running in node A needs data from a block it does nt have ? ( the block might be present in some other node in the cluster ) 2) in the Sort/Shuffle phase is just a logical representation of all map outputs together sorted rite ? and again, what happens when reduce in Node C needs access of some map outputs not in its memory? Matthew .
Re: Easy Question
hi Maha, try the folowing : goto ur dfs.data.dir/current You will find a file VERSION.. just modify the namespace id in it with your namespace id found in the log ( in this prev post -- 200395975 ).. restart hadoop.. (bin/start-all.sh) ... see if all the daemons are up.. regards, Matthew
changing SequenceFile format
Hi guys, I wanted to take in file with input : key1value1key2value2.. binary sequence file (key and value length are constant) as input for the Sort (examples) . But as I understand the data in a standard Sequencefile of hadoop is in the format : RecordlengthKeylengthKeyValue. . Where should I modify the code so as to use my inputfile as input to the recordreader. Please pour in your views .. Matthew
Re: changing SequenceFile format
When it comes to Writer, I can see the append, appendRaw methods.. But the next methods (many ! ) in Reader is confusing !. Can you further info on it ? Matthew
Error: Java heap space
Hi all, I tried to run a customised sort with following details :: * I have a metafile to be sorted. So on testing basis, I created a SequenceFile format of the metafile by appending a SequenceFile generated header with the record part (i kept it in the same sequence-- record length, key length, key, value) of the metafile. * I also implemented the writables for the key and value in my records. * I also implemented the input/output formats for my records (not sure whether it is correct) * I tried running this customized Sort with the new parameters and inputfile. I also gave no. of maps , reduces both as 1. I am getting the following error :: *Task Id : attempt_201009082009_0006_m_00_0, Status : FAILED* *Error: Java heap space* **Someone please throw some light on this... thanks, Matthew John
Re: How to rebuild Hadoop ??
Thanks Jeff ! Following what you have said, I build my hadoop core jar first (command - ant jar). That created a hadoop-core.jar in the build. Now can you please tell me how to use this as dependable for the building of examples.jar. Because if I give ant example , it gives errors like the new classes I ve included in the core are not found. I suppose thats because its using the old hadoop-core.jar . Thanks, Matthew John
Re: How to rebuild Hadoop ??
Hey Jeff , I gave the command : bin/hadoop jar hadoop-0.20.2-examples.jar sort -libjars ./build/hadoop-0.20.3-dev-core.jar -inFormat org.apache.hadoop.mapred.MetafileInputFormat -outFormat org.apache.hadoop.mapred.MetafileOutputFormat -outKey org.apache.hadoop.io.FpMetaId -outValue org.apache.hadoop.io.FpMetadata fp_input fp_output where hadoop-0.20.3-dev-core.jar is the new core jar (using command ant jar) whereas hadoop-0.20.2-examples.jar is still the same old examples jar file (I couldnot make the new examples jar using ant examples since I doesnt have the latest dependencies on the new classes i have defined). the other parameters are the new classes I want to use for running Sort. I feel I should make the new examples jar but dont know how to .. :( :( .. please tell me how to give new core jar as a parameter to run the ant examples. I am getting the following errors when i ran the command.. : java.lang.ClassNotFoundException: org.apache.hadoop.mapred.MetafileInputFormat.. and so on ... Thanks, Matthew
Re: How to rebuild Hadoop ??
target name=examples depends=jar, compile-examples description=Make the Hadoop examples jar. jar jarfile=${build.dir}/${final.name}-examples.jar basedir=${build.examples} manifest attribute name=Main-Class value=org/apache/hadoop/examples/ExampleDriver/ /manifest /jar /target this is part of the build.xml which creates the example.jar. Here (what I understand) its given that it depends on jar which is the hadoop.core.jar. I have a feeling its still depending on the older version of core.jar and so its not able to find the classes which are not updated in the older version of the core.jar. Therefore it gives ClassNotFound. I want to make a new Example jar which depends on the new core.jar. Please guide me on that and let me know if my understanding is wrong. Thanks, Matthew John
Re: How to rebuild Hadoop ??
Hey Guys ! , Finally my examples.jar got built :) :) .. It was just a small error - dint initialize the package for some of the newly written files :P .. Now i will run the command : bin/hadoop jar hadoop-0.20.2-examples.jarnew one sort -inFormat org.apache.hadoop.mapred.MetafileInputFormat -outFormat org.apache.hadoop.mapred.MetafileOutputFormat -outKey org.apache.hadoop.io.FpMetaId -outValue org.apache.hadoop.io.FpMetadata fp_input fp_output and see what happens !! Thanks a lot for your time.. Matthew
Re: Sort with customized input/output !!
Thanks for the reply Ted !! What I understand is that a SequenceFile will have a header followed by the records in a format : Recordlength,Keylength,Key,Value with a sync marker coming at some regular interval.. It would be great if someone can take a look at the following.. Q 1) The thing is my file is basically in the format : header ( a different one) followed by Record (Key Value). In this case the size of Record and Key is fixed.I would like to know* if I can modify the core code to make the SequenceFile format like this *. If yes what code should I look at ?? Q 2) *What is a Sync marker (can we define it )* ? Obviously my file would not be having this. Can someone suggest a way to get around this obstacle. My final aim is to take this file in , sort it with respect to Key and print the sorted file .. Thanks, Matthew
Re: SequenceFile Header
Hi Edward , Thanks for your reply. My aim is not to generate a SequenceFile. It is to take a file (of a certain format) and sort it. So I guess I should create a input SequenceFile from the original file and feed it to the Sort as input. Now the output will again be SequenceFile format and I will have to convert it back to my original file format. So I am right now more concerned about step 1 (conversion of original file to input sequence file) and step 3 (conversion of output sequence file to original file format) .. It would be great if you can suggest some ways of doing that. Also please correct me if my approach is wrong.. Thanks, Matthew
Sort with customized input/output !!
Hey , M pretty new to Hadoop . I need to Sort a Metafile (TBs) and thought of using Hadoop Sort (in examples) for it. My input metafile looks like this -- binary stream (only 1's and 0's). It basically contains records of 40 bytes. Every record goes like this : long a; key -- 8 bytes. The rest of the structure will be the value -- 32 bytes long b; int c; int d; int e; int unprocessed; int compress_attempted; int gatherer; I have created a *FpMetaId.java (extends BytesWritable)* corresponding to the value and *FpMetadata.java (extends BytesWritable)* corresponding to the key. My sole aim is to get these records (40 bytes) sorted with the fp (double) as the key. And I need to write these sorted records back into a metafile (exactly my old metafile but with sorted records binaries only). I also implemented :: *MetafileInputFormat.java ( extends SequenceFileAsBinaryInputFormat) * --- file making an input file format compatible to my record. *MetafileOutputFormatK, V extends SequenceFileOutputFormat* --- file making the output file format compatible to my record. *MetafileRecordReader.java (extends SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader )* --- file implementing the record reader compatible to my record. MetafileRecordWriter class has been implemented with in my MetafileOutputFormat.java file. Let me kindly get you through the sequence of events which followed : 1) I resolved all the errors in the writable classes (FpMetaId, FpMetadata) and in/out formats (MetafileInputFormat, MetafileOutputFormat,) and RecordReaders I implemented. 2) Writables I copied to /io folder. Other new files were copied to /mapred folder. I successfully built it. 3) I modified the Sort file (the function I want to run with FpMetaId as key and FpMetadata as value and imported these new classes in the file.) I changed default conf settings to these required Writables and RecordReaders.. I built hadoop using ant command after this. It successfully got built. *Q). Does this ensure all the new changes have got reflected on the jar. ( am I ready to go execute the sort function ?? )* 4) As I had already mentioned before, I am working with sequential file format (binary) with a datastructure (key,value) repeating. So I wrote a C code which generates random values for my datastructure and populated a file , sequentially writing (binary) my (key,value)datastructure. I gave this as my input for the sort which should sort my (key,values) with respect to keys. I got the error : fp_input not a SequenceFile (fp_input is my input file). I thought Seqfiles will just be stream of binaries.. Does it contain any specific format ? *Command used : bin/hadoop jar hadoop-0.20.2-examples.jar sort fp_input fp_output* *Q) What does this imply ? I have no clue how to proceed further. Again, is it because my jar file used to execute doesnt have the latest libraries ? I could not get any good tutorials on this. * It would be great if someone can offer an helping hand to this noob. Thanks, Matthew John
How to rebuild Hadoop ??
Hi all, I wrote some new writable files corresponding to my data input. I added them to /src/org//io/ where all the writables reside. Similarly, I also wrote input/output format files and a recordreader and added them to src/mapred/./mapred/ where all related files reside. I want to run the Sort function (in examples) with these new classes (writables, recreader, i/oformat). So I also modified the Sort to incorporate these files and import them in the Sort.java file. After all this, I gave a ant clean and then ant command to build everything fresh. But nothing really happened I guess because when I run the program , it give ClassNotFoundException for the classes I give as parameters in the command. Some one please help me out !! How to modify the core/ files (incorporate more core io/mapred files) in HADOOP !! Thanks, Matthew John
Re: How to rebuild Hadoop ??
Thanks a lot Jeff ! The problem is that everytime I build (using ant ) there is a build folder created. But there is no examples.jar created inside that. I wanted to add some files into io package and mapred package. So I suppose I should put the files appropriately ( inside io and mapred folder respectively). I want to run the Sort in examples.jar using these added classes. I guess I can import these new files in the Sort code and build the entire thing again. But I am not able to figure out how to rebuild these core containing jar and examples jar with the modified sort.