Re: Are SequenceFiles split? If so, how?
In addition to what Aaron mentioned, you can configure the minimum split size in hadoop-site.xml to have smaller or larger input splits depending on your application. -Jim On Mon, Apr 20, 2009 at 12:18 AM, Aaron Kimball aa...@cloudera.com wrote: Yes, there can be more than one InputSplit per SequenceFile. The file will be split more-or-less along 64 MB boundaries. (the actual edges of the splits will be adjusted to hit the next block of key-value pairs, so it might be a few kilobytes off.) The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks()) as a hint, not a set-in-stone metric. (The number of reduce tasks, though, is always 100% user-controlled.) If you need exact control over the number of map tasks, you'll need to subclass it and modify this behavior. That having been said -- are you sure you actually need to precisely control this value? Or is it enough to know how many splits were created? - Aaron On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman b.wag...@comcast.net wrote: Suppose a SequenceFile (containing keys and values that are BytesWritable) is used as input. Will it be divided into InputSplits? If so, what's the criteria use for splitting? I'm interested in this because I need to control the number of map tasks used, which (if I understand it correctly), is equal to the number of InputSplits. thanks, bw
Re: getting DiskErrorException during map
Yes, here is how it looks: property namehadoop.tmp.dir/name value/scratch/local/jim/hadoop-${user.name}/value /property so I don't know why it still writes to /tmp. As a temporary workaround, I created a symbolic link from /tmp/hadoop-jim to /scratch/... and it works fine now but if you think this might be a considered as a bug, I can report it. Thanks, Jim On Thu, Apr 16, 2009 at 12:44 PM, Alex Loddengaard a...@cloudera.comwrote: Have you set hadoop.tmp.dir away from /tmp as well? If hadoop.tmp.dir is set somewhere in /scratch vs. /tmp, then I'm not sure why Hadoop would be writing to /tmp. Hope this helps! Alex On Wed, Apr 15, 2009 at 2:37 PM, Jim Twensky jim.twen...@gmail.com wrote: Alex, Yes, I bounced the Hadoop daemons after I changed the configuration files. I also tried setting $HADOOP_CONF_DIR to the directory where my hadop-site.xml file resides but it didn't work. However, I'm sure that HADOOP_CONF_DIR is not the issue because other properties that I changed in hadoop-site.xml seem to be properly set. Also, here is a section from my hadoop-site.xml file: property namehadoop.tmp.dir/name value/scratch/local/jim/hadoop-${user.name}/value /property property namemapred.local.dir/name value/scratch/local/jim/hadoop-${user.name }/mapred/local/value /property I also created /scratch/local/jim/hadoop-jim/mapred/local on each task tracker since I know directories that do not exist are ignored. When I manually ssh to the task trackers, I can see the directory /scratch/local/jim/hadoop-jim/dfs is automatically created so is it seems like hadoop.tmp.dir is set properly. However, hadoop still creates /tmp/hadoop-jim/mapred/local and uses that directory for the local storage. I'm starting to suspect that mapred.local.dir is overwritten to a default value of /tmp/hadoop-${user.name} somewhere inside the binaries. -jim On Tue, Apr 14, 2009 at 4:07 PM, Alex Loddengaard a...@cloudera.com wrote: First, did you bounce the Hadoop daemons after you changed the configuration files? I think you'll have to do this. Second, I believe 0.19.1 has hadoop-default.xml baked into the jar. Try setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives. For whatever reason your hadoop-site.xml (and the hadoop-default.xml you tried to change) are probably not being loaded. $HADOOP_CONF_DIR should fix this. Good luck! Alex On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky jim.twen...@gmail.com wrote: Thank you Alex, you are right. There are quotas on the systems that I'm working. However, I tried to change mapred.local.dir as follows: --inside hadoop-site.xml: property namemapred.child.tmp/name value/scratch/local/jim/value /property property namehadoop.tmp.dir/name value/scratch/local/jim/value /property property namemapred.local.dir/name value/scratch/local/jim/value /property and observed that the intermediate map outputs are still being written under /tmp/hadoop-jim/mapred/local I'm confused at this point since I also tried setting these values directly inside the hadoop-default.xml and that didn't work either. Is there any other property that I'm supposed to change? I tried searching for /tmp in the hadoop-default.xml file but couldn't find anything else. Thanks, Jim On Tue, Apr 7, 2009 at 9:35 PM, Alex Loddengaard a...@cloudera.com wrote: The getLocalPathForWrite function that throws this Exception assumes that you have space on the disks that mapred.local.dir is configured on. Can you verify with `df` that those disks have space available? You might also try moving mapred.local.dir off of /tmp if it's configured to use /tmp right now; I believe some systems have quotas on /tmp. Hope this helps. Alex On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky jim.twen...@gmail.com wrote: Hi, I'm using Hadoop 0.19.1 and I have a very small test cluster with 9 nodes, 8 of them being task trackers. I'm getting the following error and my jobs keep failing when map processes start hitting 30%: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335
Re: Hadoop basic question
http://wiki.apache.org/hadoop/FAQ#7 On Thu, Apr 16, 2009 at 6:52 PM, Jae Joo jaejo...@gmail.com wrote: Will anyone guide me how to avoid the the single point failure of master node. This is what I know. If the master node is donw by some reason, the hadoop system is down and there is no way to have failover system for master node. Please correct me if I am not understanding correctly. Jae
Re: getting DiskErrorException during map
Alex, Yes, I bounced the Hadoop daemons after I changed the configuration files. I also tried setting $HADOOP_CONF_DIR to the directory where my hadop-site.xml file resides but it didn't work. However, I'm sure that HADOOP_CONF_DIR is not the issue because other properties that I changed in hadoop-site.xml seem to be properly set. Also, here is a section from my hadoop-site.xml file: property namehadoop.tmp.dir/name value/scratch/local/jim/hadoop-${user.name}/value /property property namemapred.local.dir/name value/scratch/local/jim/hadoop-${user.name}/mapred/local/value /property I also created /scratch/local/jim/hadoop-jim/mapred/local on each task tracker since I know directories that do not exist are ignored. When I manually ssh to the task trackers, I can see the directory /scratch/local/jim/hadoop-jim/dfs is automatically created so is it seems like hadoop.tmp.dir is set properly. However, hadoop still creates /tmp/hadoop-jim/mapred/local and uses that directory for the local storage. I'm starting to suspect that mapred.local.dir is overwritten to a default value of /tmp/hadoop-${user.name} somewhere inside the binaries. -jim On Tue, Apr 14, 2009 at 4:07 PM, Alex Loddengaard a...@cloudera.com wrote: First, did you bounce the Hadoop daemons after you changed the configuration files? I think you'll have to do this. Second, I believe 0.19.1 has hadoop-default.xml baked into the jar. Try setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives. For whatever reason your hadoop-site.xml (and the hadoop-default.xml you tried to change) are probably not being loaded. $HADOOP_CONF_DIR should fix this. Good luck! Alex On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky jim.twen...@gmail.com wrote: Thank you Alex, you are right. There are quotas on the systems that I'm working. However, I tried to change mapred.local.dir as follows: --inside hadoop-site.xml: property namemapred.child.tmp/name value/scratch/local/jim/value /property property namehadoop.tmp.dir/name value/scratch/local/jim/value /property property namemapred.local.dir/name value/scratch/local/jim/value /property and observed that the intermediate map outputs are still being written under /tmp/hadoop-jim/mapred/local I'm confused at this point since I also tried setting these values directly inside the hadoop-default.xml and that didn't work either. Is there any other property that I'm supposed to change? I tried searching for /tmp in the hadoop-default.xml file but couldn't find anything else. Thanks, Jim On Tue, Apr 7, 2009 at 9:35 PM, Alex Loddengaard a...@cloudera.com wrote: The getLocalPathForWrite function that throws this Exception assumes that you have space on the disks that mapred.local.dir is configured on. Can you verify with `df` that those disks have space available? You might also try moving mapred.local.dir off of /tmp if it's configured to use /tmp right now; I believe some systems have quotas on /tmp. Hope this helps. Alex On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky jim.twen...@gmail.com wrote: Hi, I'm using Hadoop 0.19.1 and I have a very small test cluster with 9 nodes, 8 of them being task trackers. I'm getting the following error and my jobs keep failing when map processes start hitting 30%: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.Child.main(Child.java:158) I googled many blogs and web pages but I could neither understand why this happens nor found a solution to this. What does that error message mean and how can avoid it, any suggestions? Thanks in advance, -jim
Re: Total number of records processed in mapper
Hi Andy, Take a look at this piece of code: Counters counters = job.getCounters(); counters.findCounter(org.apache.hadoop.mapred.Task$Counter, REDUCE_INPUT_RECORDS).getCounter() This is for reduce input records but I believe there is also a counter for reduce output records. You should dig into the source code to find out what it is because unfortunately, the default counters associated with the map/reduce jobs are not public yet. -Jim On Tue, Apr 14, 2009 at 11:19 AM, Andy Liu andyliu1...@gmail.com wrote: Is there a way for all the reducers to have access to the total number of records that were processed in the Map phase? For example, I'm trying to perform a simple document frequency calculation. During the map phase, I emit word, 1 pairs for every unique word in every document. During the reduce phase, I sum the values for each word group. Then I want to divide that value by the total number of documents. I suppose I can create a whole separate m/r job whose sole purpose is to count all the records, then pass that number on. Is there a more straighforward way of doing this? Andy
Re: Map-Reduce Slow Down
Mithila, You said all the slaves were being utilized in the 3 node cluster. Which application did you run to test that and what was your input size? If you tried the word count application on a 516 MB input file on both cluster setups, than some of your nodes in the 15 node cluster may not be running at all. Generally, one map job is assigned to each input split and if you are running your cluster with the defaults, the splits are 64 MB each. I got confused when you said the Namenode seemed to do all the work. Can you check conf/slaves and make sure you put the names of all task trackers there? I also suggest comparing both clusters with a larger input size, say at least 5 GB, to really see a difference. Jim On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball aa...@cloudera.com wrote: in hadoop-*-examples.jar, use randomwriter to generate the data and sort to sort it. - Aaron On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi forpan...@gmail.com wrote: Your data is too small I guess for 15 clusters ..So it might be overhead time of these clusters making your total MR jobs more time consuming. I guess you will have to try with larger set of data.. Pankil On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra mnage...@asu.edu wrote: Aaron That could be the issue, my data is just 516MB - wouldn't this see a bit of speed up? Could you guide me to the example? I ll run my cluster on it and see what I get. Also for my program I had a java timer running to record the time taken to complete execution. Does Hadoop have an inbuilt timer? Mithila On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball aa...@cloudera.com wrote: Virtually none of the examples that ship with Hadoop are designed to showcase its speed. Hadoop's speedup comes from its ability to process very large volumes of data (starting around, say, tens of GB per job, and going up in orders of magnitude from there). So if you are timing the pi calculator (or something like that), its results won't necessarily be very consistent. If a job doesn't have enough fragments of data to allocate one per each node, some of the nodes will also just go unused. The best example for you to run is to use randomwriter to fill up your cluster with several GB of random data and then run the sort program. If that doesn't scale up performance from 3 nodes to 15, then you've definitely got something strange going on. - Aaron On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra mnage...@asu.edu wrote: Hey all I recently setup a three node hadoop cluster and ran an examples on it. It was pretty fast, and all the three nodes were being used (I checked the log files to make sure that the slaves are utilized). Now I ve setup another cluster consisting of 15 nodes. I ran the same example, but instead of speeding up, the map-reduce task seems to take forever! The slaves are not being used for some reason. This second cluster has a lower, per node processing power, but should that make any difference? How can I ensure that the data is being mapped to all the nodes? Presently, the only node that seems to be doing all the work is the Master node. Does 15 nodes in a cluster increase the network cost? What can I do to setup the cluster to function more efficiently? Thanks! Mithila Nagendra Arizona State University
Re: Grouping Values for Reducer Input
I'm not sure if this is exactly what you want but, can you emit map records as: cat, doc5 - 3 cat, doc1 - 1 cat, doc5 - 1 and so on.. This way, your reducers will get the intermediate key,value pairs as cat, doc5 - 3 cat, doc5 - 1 cat, doc1 - 1 then you can split the keys (cat, doc*) inside the reducer and perform your additions. -Jim On Mon, Apr 13, 2009 at 4:53 PM, Streckfus, William [USA] streckfus_will...@bah.com wrote: Hi Everyone, I'm working on a relatively simple MapReduce job with a slight complication with regards to the ordering of my key/values heading into the reducer. The output from the mapper might be something like cat - doc5, 1 cat - doc1, 1 cat - doc5, 3 ... Here, 'cat' is my key and the value is the document ID and the count (my own WritableComparable.) Originally I was going to create a HashMap in the reduce method and add an entry for each document ID and sum the counts for each. I realized the method would be better if the values were in order like so: cat - doc1, 1 cat - doc5, 1 cat - doc5, 3 ... Using this style I can continue summing until I reach a new document ID and just collect the output at this point thus avoiding data structures and object creation costs. I tried setting JobConf.setOutputValueGroupingComparator() but this didn't seem to do anything. In fact, I threw an exception from the Comparator I supplied but this never showed up when running the job. My map output value consists of a UTF and a Long so perhaps the Comparator I'm using (identical to Text.Comparator) is incorrect: *public* *int* compare(*byte*[] b1, *int* s1, *int* l1, *byte*[] b2, *int*s2, *int* l2) { *int* n1 = WritableUtils.*decodeVIntSize*(b1[s1]); *int* n2 = WritableUtils.*decodeVIntSize*(b2[s2]); *return* *compareBytes*(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2); } In my final output I'm basically running into the same word - documentID being output multiple times. So for the above example I have multiple lines with cat - doc5, X. Reducer method just in case: *public* *void* reduce(Text key, IteratorTermFrequencyWritable values, OutputCollectorText, TermFrequencyWritable output, Reporter reporter) * throws* IOException { *long* sum = 0; String lastDocID = *null*; // Iterate through all values *while*(values.hasNext()) { TermFrequencyWritable value = values.next(); // Encountered new document ID = record and reset *if*(!value.getDocumentID().equals(lastDocID)) { // Ignore first go through *if*(sum != 0) { termFrequency.setDocumentID(lastDocID); termFrequency.setFrequency(sum); output.collect(key, termFrequency); } sum = 0; lastDocID = value.getDocumentID(); } sum += value.getFrequency(); } // Record last one termFrequency.setDocumentID(lastDocID); termFrequency.setFrequency(sum); output.collect(key, termFrequency); } Any ideas (Using Hadoop .19.1)? Thanks, - Bill
Re: Grouping Values for Reducer Input
Oh, I forgot to tell that you should change your partitioner to send all the keys in the form of cat,* to the same reducer but it seems like Jeremy has been much faster than me :) -Jim On Mon, Apr 13, 2009 at 5:24 PM, Jim Twensky jim.twen...@gmail.com wrote: I'm not sure if this is exactly what you want but, can you emit map records as: cat, doc5 - 3 cat, doc1 - 1 cat, doc5 - 1 and so on.. This way, your reducers will get the intermediate key,value pairs as cat, doc5 - 3 cat, doc5 - 1 cat, doc1 - 1 then you can split the keys (cat, doc*) inside the reducer and perform your additions. -Jim On Mon, Apr 13, 2009 at 4:53 PM, Streckfus, William [USA] streckfus_will...@bah.com wrote: Hi Everyone, I'm working on a relatively simple MapReduce job with a slight complication with regards to the ordering of my key/values heading into the reducer. The output from the mapper might be something like cat - doc5, 1 cat - doc1, 1 cat - doc5, 3 ... Here, 'cat' is my key and the value is the document ID and the count (my own WritableComparable.) Originally I was going to create a HashMap in the reduce method and add an entry for each document ID and sum the counts for each. I realized the method would be better if the values were in order like so: cat - doc1, 1 cat - doc5, 1 cat - doc5, 3 ... Using this style I can continue summing until I reach a new document ID and just collect the output at this point thus avoiding data structures and object creation costs. I tried setting JobConf.setOutputValueGroupingComparator() but this didn't seem to do anything. In fact, I threw an exception from the Comparator I supplied but this never showed up when running the job. My map output value consists of a UTF and a Long so perhaps the Comparator I'm using (identical to Text.Comparator) is incorrect: *public* *int* compare(*byte*[] b1, *int* s1, *int* l1, *byte*[] b2, *int * s2, *int* l2) { *int* n1 = WritableUtils.*decodeVIntSize*(b1[s1]); *int* n2 = WritableUtils.*decodeVIntSize*(b2[s2]); *return* *compareBytes*(b1, s1 + n1, l1 - n1, b2, s2 + n2, l2 - n2); } In my final output I'm basically running into the same word - documentID being output multiple times. So for the above example I have multiple lines with cat - doc5, X. Reducer method just in case: *public* *void* reduce(Text key, IteratorTermFrequencyWritable values, OutputCollectorText, TermFrequencyWritable output, Reporter reporter) * throws* IOException { *long* sum = 0; String lastDocID = *null*; // Iterate through all values *while*(values.hasNext()) { TermFrequencyWritable value = values.next(); // Encountered new document ID = record and reset *if*(!value.getDocumentID().equals(lastDocID)) { // Ignore first go through *if*(sum != 0) { termFrequency.setDocumentID(lastDocID); termFrequency.setFrequency(sum); output.collect(key, termFrequency); } sum = 0; lastDocID = value.getDocumentID(); } sum += value.getFrequency(); } // Record last one termFrequency.setDocumentID(lastDocID); termFrequency.setFrequency(sum); output.collect(key, termFrequency); } Any ideas (Using Hadoop .19.1)? Thanks, - Bill
Re: Map-Reduce Slow Down
Can you ssh between the nodes? -jim On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra mnage...@asu.edu wrote: Thanks Aaron. Jim: The three clusters I setup had ubuntu running on them and the dfs was accessed at port 54310. The new cluster which I ve setup has Red Hat Linux release 7.2 (Enigma)running on it. Now when I try to access the dfs from one of the slaves i get the following response: dfs cannot be accessed. When I access the DFS throught the master there s no problem. So I feel there a problem with the port. Any ideas? I did check the list of slaves, it looks fine to me. Mithila On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky jim.twen...@gmail.com wrote: Mithila, You said all the slaves were being utilized in the 3 node cluster. Which application did you run to test that and what was your input size? If you tried the word count application on a 516 MB input file on both cluster setups, than some of your nodes in the 15 node cluster may not be running at all. Generally, one map job is assigned to each input split and if you are running your cluster with the defaults, the splits are 64 MB each. I got confused when you said the Namenode seemed to do all the work. Can you check conf/slaves and make sure you put the names of all task trackers there? I also suggest comparing both clusters with a larger input size, say at least 5 GB, to really see a difference. Jim On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball aa...@cloudera.com wrote: in hadoop-*-examples.jar, use randomwriter to generate the data and sort to sort it. - Aaron On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi forpan...@gmail.com wrote: Your data is too small I guess for 15 clusters ..So it might be overhead time of these clusters making your total MR jobs more time consuming. I guess you will have to try with larger set of data.. Pankil On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra mnage...@asu.edu wrote: Aaron That could be the issue, my data is just 516MB - wouldn't this see a bit of speed up? Could you guide me to the example? I ll run my cluster on it and see what I get. Also for my program I had a java timer running to record the time taken to complete execution. Does Hadoop have an inbuilt timer? Mithila On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball aa...@cloudera.com wrote: Virtually none of the examples that ship with Hadoop are designed to showcase its speed. Hadoop's speedup comes from its ability to process very large volumes of data (starting around, say, tens of GB per job, and going up in orders of magnitude from there). So if you are timing the pi calculator (or something like that), its results won't necessarily be very consistent. If a job doesn't have enough fragments of data to allocate one per each node, some of the nodes will also just go unused. The best example for you to run is to use randomwriter to fill up your cluster with several GB of random data and then run the sort program. If that doesn't scale up performance from 3 nodes to 15, then you've definitely got something strange going on. - Aaron On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra mnage...@asu.edu wrote: Hey all I recently setup a three node hadoop cluster and ran an examples on it. It was pretty fast, and all the three nodes were being used (I checked the log files to make sure that the slaves are utilized). Now I ve setup another cluster consisting of 15 nodes. I ran the same example, but instead of speeding up, the map-reduce task seems to take forever! The slaves are not being used for some reason. This second cluster has a lower, per node processing power, but should that make any difference? How can I ensure that the data is being mapped to all the nodes? Presently, the only node that seems to be doing all the work is the Master node. Does 15 nodes in a cluster increase the network cost? What can I do to setup the cluster to function more efficiently? Thanks! Mithila Nagendra Arizona State University
getting DiskErrorException during map
Hi, I'm using Hadoop 0.19.1 and I have a very small test cluster with 9 nodes, 8 of them being task trackers. I'm getting the following error and my jobs keep failing when map processes start hitting 30%: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.Child.main(Child.java:158) I googled many blogs and web pages but I could neither understand why this happens nor found a solution to this. What does that error message mean and how can avoid it, any suggestions? Thanks in advance, -jim
Re: Please help!
See the original Map Reduce paper by Google at http://labs.google.com/papers/mapreduce.html and please don't spam the list. -jim On Tue, Mar 31, 2009 at 6:15 PM, Hadooper kusanagiyang.had...@gmail.comwrote: Dear developers, Is there any detailed example of how Hadoop processes input? Article http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives a good idea, but I want to see input data being passed from class to class, and how each class manipulates data. The purpose is to analyze the time and space complexity of Hadoop as a generalized computational model/algorithm. I tried to search the web and could not find more detail. Any pointer/hint? Thanks a million. -- Cheers! Hadoop core
Re: Using HDFS for common purpose
You may also want to have a look at this to reach a decision based on your needs: http://www.swaroopch.com/notes/Distributed_Storage_Systems Jim On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky jim.twen...@gmail.com wrote: Rasit, What kind of data will you be storing on Hbase or directly on HDFS? Do you aim to use it as a data source to do some key/value lookups for small strings/numbers or do you want to store larger files labeled with some sort of a key and retrieve them during a map reduce run? Jim On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray jl...@streamy.com wrote: Perhaps what you are looking for is HBase? http://hbase.org HBase is a column-oriented, distributed store that sits on top of HDFS and provides random access. JG -Original Message- From: Rasit OZDAS [mailto:rasitoz...@gmail.com] Sent: Tuesday, January 27, 2009 1:20 AM To: core-user@hadoop.apache.org Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr; hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr; hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr Subject: Using HDFS for common purpose Hi, I wanted to ask, if HDFS is a good solution just as a distributed db (no running jobs, only get and put commands) A review says that HDFS is not designed for low latency and besides, it's implemented in Java. Do these disadvantages prevent us using it? Or could somebody suggest a better (faster) one? Thanks in advance.. Rasit
Re: Suitable for Hadoop?
Ricky, Hadoop was formerly optimized for large files, usually files of size larger than one input split. However, there is an input format called MultiFileInputFormat which can be used to utilize Hadoop to work efficiently on smaller files. You can also set the isSplittable method of an input format to false and ensure that a file is not split into pieces but rather processed by only one mapper. Jim On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho r...@adobe.com wrote: Hmmm ... From a space efficiency perspective, given HDFS (with large block size) is expecting large files, is Hadoop optimized for processing large number of small files ? Does each file take up at least 1 block ? or multiple files can sit on the same block. Rgds, Ricky -Original Message- From: Zak, Richard [USA] [mailto:zak_rich...@bah.com] Sent: Wednesday, January 21, 2009 6:42 AM To: core-user@hadoop.apache.org Subject: RE: Suitable for Hadoop? You can do that. I did a Map/Reduce job for about 6 GB of PDFs to concatenate them, and the New York times used Hadoop to process a few TB of PDFs. What I would do is this: - Use the iText library, a Java library for PDF manipulation (don't know what you would use for reading Word docs) - Don't use any Reducers - Have the input be a text file with the directory(ies) to process, since the mapper takes in file contents (and you don't want to read in one line of binary) - Have the map process all contents for that one given directory from the input text file - Break down the documents into more directories to go easier on the memory - Use Amazon's EC2, and the scripts in hadoop_dir/src/contrib/ec2/bin/ (there is a script which passes environment variables to launched instances, modify the script to allow Hadoop to use more memory by setting the HADOOP_HEAPSIZE environment variable and having the variable properly passed) I realize this isn't the strong point of Map/Reduce or Hadoop, but it still uses the HDFS in a beneficial manner, and the distributed part is very helpful! Richard J. Zak -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, January 21, 2009 08:08 To: core-user@hadoop.apache.org Subject: Suitable for Hadoop? Hi, I have a task to process large quantities of files by converting them into other formats. Each file is processed as a whole and converted to a target format. Since there are 100's of GB of data I thought it suitable for Hadoop, but the problem is, I don't think the files can be broken apart and processed. For example, how would mapreduce work to convert a Word Document to PDF if the file is reduced to blocks? I'm not sure that's possible, or is it? thanks for any advice Darren
Re: Indexed Hashtables
Delip, Why do you think Hbase will be an overkill? I do something similar to what you're trying to do with Hbase and I haven't encountered any significant problems so far. Can you give some more info on the size of the data you have? Jim On Wed, Jan 14, 2009 at 8:47 PM, Delip Rao delip...@gmail.com wrote: Hi, I need to lookup a large number of key/value pairs in my map(). Is there any indexed hashtable available as a part of Hadoop I/O API? I find Hbase an overkill for my application; something on the lines of HashStore (www.cellspark.com/hashstore.html) should be fine. Thanks, Delip
Re: Merging reducer outputs into a single part-00000 file
Owen and Rasit, Thank you for the responses. I've figured that mapred.reduce.tasks was set to 1 in my hadoop-default xml and I didn't overwrite it in my hadoop-site.xml configuration file. Jim On Wed, Jan 14, 2009 at 11:23 AM, Owen O'Malley omal...@apache.org wrote: On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote: Jim, As far as I know, there is no operation done after Reducer. Correct, other than output promotion, which moves the output file to the final filename. But if you are a little experienced, you already know these. Ordered list means one final file, or am I missing something? There is no value and a lot of cost associated with creating a single file for the output. The question is how you want the keys divided between the reduces (and therefore output files). The default partitioner hashes the key and mods by the number of reduces, which stripes the keys across the output files. You can use the mapred.lib.InputSampler to generate good partition keys and mapred.lib.TotalOrderPartitioner to get completely sorted output based on the partition keys. With the total order partitioner, each reduce gets an increasing range of keys and thus has all of the nice properties of a single file without the costs. -- Owen
Merging reducer outputs into a single part-00000 file
Hello, The original map-reduce paper states: After successful completion, the output of the map-reduce execution is available in the R output files (one per reduce task, with file names as specified by the user). However, when using Hadoop's TextOutputFormat, all the reducer outputs are combined in a single file called part-0. I was wondering how and when this merging process is done. When the reducer calls output.collect(key,value), is this record written to a local temporary output file in the reducer's disk and then these local files (a total of R) are later merged into one single file with a final thread or is it directly written to the final output file (part-0)? I am asking this because I'd like to get an ordered sample of the final output data, ie. one record per every 1000 records or something similar and I don't want to run a serial process that iterates on the final output file. Thanks, Jim
Re: Combiner run specification and questions
Hello Saptarshi, E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? It depends on whether or not you use the method JobConf.setCombinerClass() or not. If you don't, Hadoop does not run any combiners by default. If you use your reducer class as the combiner, you must make sure that your mapper and reducer outputs are of same type. Because otherwise you will get a runtime error about types not matching. In your case, I strongly recommend you to use a combiner to reduce the size of the intermediate data. My understanding is that, combiners are just local reducers that run right after the completion of the map step. Jim On Fri, Jan 2, 2009 at 11:57 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, I would just like to confirm, when does the Combiner run(since it might not be run at all,see below). I read somewhere that it is run, if there is at least one reduce (which in my case i can be sure of). I also read, that the combiner is an optimization. However, it is also a chance for a function to transform the key/value (keeping the class the same i.e the combiner semantics are not changed) and deal with a smaller set ( this could be done in the reducer but the number of values for a key might be relatively large). However, I guess it would be a mistake for reducer to expect its input coming from a combiner? E.g if there are only 10 value corresponding to a key(as outputted by the mapper), will these 10 values go straight to the reducer or to the reducer via the combiner? Here I am assuming my reduce operations does not need all the values for a key to work(so that a combiner can be used) i.e additive operations. Thank you Saptarshi On Sun, Nov 16, 2008 at 6:18 PM, Owen O'Malley omal...@apache.org wrote: The Combiner may be called 0, 1, or many times on each key between the mapper and reducer. Combiners are just an application specific optimization that compress the intermediate output. They should not have side effects or transform the types. Unfortunately, since there isn't a separate interface for Combiners, there is isn't a great place to document this requirement. I've just filed HADOOP-4668 to improve the documentation. -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Shared thread safe variables?
Aaron, I actually do something different than word count. I count all possible phrases for every sentence in my corpus. So for instance, if I have a sentence like Hello world, my mappers emit: Hello 1 World 1 Hello World 1 As you can easily realize, for longer sentences the number of intermediate records grow much more than the original input size. Anyway, I did what I said last week based on your previous replies and it worked well. Thank you for the advice. Jim On Wed, Dec 31, 2008 at 4:06 AM, Aaron Kimball aa...@cloudera.com wrote: Hmm. Check your math on the data set size. Your input corpus may be a few (dozen, hundred) TB, but how many distinct words are there? The output data set should be at least a thousand times smaller. If you've got the hardware to do that initial word count step on a few TB of data, the second pass will not be a major performance concern. MapReduce is, to borrow from a tired analogy, a lot like driving a freight train. The raw speed of any given algorithm on it might not sound impressive, but even if its got a much higher constant-factor of time associated with it, the ability to provide nearly-flat parallelism as your data set grows really large more than makes up for it in the long run. - Aaron On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky jim.twen...@gmail.com wrote: Hello again, I think I found an answer to my question. If I write a new WritableComparable object that extends IntWritable and then overwrite the compareTo method, I can change the sorting order from ascending to descending. That will solve my problem for getting the top 100 most frequent words at each combiner/reducer. Jim On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky jim.twen...@gmail.com wrote: Hi Aaron, Thanks for the advice. I actually thought of using multiple combiners and a single reducer but I was worried about the key sorting phase to be a vaste for my purpose. If the input is just a bunch of (word,count) pairs which is in the order of TeraBytes, wouldn't sorting be an overkill? That's why I thought a single serial program might perform better but I'm not sure how long it would take to sort the keys in such a case so probably it is nothing beyond speculation and I should go and give it a try to see how well it performs. Secondly, I didn't quite understand how I can take advantage of the sorted keys if I use an inverting mapper that transforms (k,v) -- (v,k) pairs. In both cases, the combiners and the single reducer will still have to iterate over all the (v,k) pairs to find the top 100 right? Or is there a way to say something like Give me the last 100 keys at each reducer/combiner? Thanks in advance, Jim On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball aa...@cloudera.com wrote: (Addendum to my own post -- an identity mapper is probably not what you want. You'd actually want an inverting mapper that transforms (k, v) -- (v, k), to take advantage of the key-based sorting.) - Aaron On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball aa...@cloudera.com wrote: Hi Jim, The ability to perform locking of shared mutable state is a distinct anti-goal of the MapReduce paradigm. One of the major benefits of writing MapReduce programs is knowing that you don't have to worry about deadlock in your code. If mappers could lock objects, then the failure and restart semantics of individual tasks would be vastly more complicated. (What happens if a map task crashes after it obtains a lock? Does it automatically release the lock? Does some rollback mechanism undo everything that happened after the lock was acquired? How would that work if--by definition--the mapper node is no longer available?) A word frequency histogram function can certainly be written in MapReduce without such state. You've got the right intuition, but a serial program is not necessarily the best answer. Take the existing word count program. This converts bags of words into (word, count) pairs. Then pass this through a second pass, via an identity mapper to a set of combiners that each emit the 100 most frequent words, to a single reducer that emits the 100 most frequent words obtained by the combiners. Many other more complicated problems which seem to require shared state, in truth, only require a second (or n+1'th) MapReduce pass. Adding multiple passes is a very valid technique for building more complex dataflows. Cheers, - Aaron On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky jim.twen...@gmail.com wrote: Hello, I was wondering if Hadoop provides thread safe shared variables that can be accessed from individual mappers/reducers along with a proper locking mechanism. To clarify things
Re: Shared thread safe variables?
Hello again, I think I found an answer to my question. If I write a new WritableComparable object that extends IntWritable and then overwrite the compareTo method, I can change the sorting order from ascending to descending. That will solve my problem for getting the top 100 most frequent words at each combiner/reducer. Jim On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky jim.twen...@gmail.com wrote: Hi Aaron, Thanks for the advice. I actually thought of using multiple combiners and a single reducer but I was worried about the key sorting phase to be a vaste for my purpose. If the input is just a bunch of (word,count) pairs which is in the order of TeraBytes, wouldn't sorting be an overkill? That's why I thought a single serial program might perform better but I'm not sure how long it would take to sort the keys in such a case so probably it is nothing beyond speculation and I should go and give it a try to see how well it performs. Secondly, I didn't quite understand how I can take advantage of the sorted keys if I use an inverting mapper that transforms (k,v) -- (v,k) pairs. In both cases, the combiners and the single reducer will still have to iterate over all the (v,k) pairs to find the top 100 right? Or is there a way to say something like Give me the last 100 keys at each reducer/combiner? Thanks in advance, Jim On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball aa...@cloudera.com wrote: (Addendum to my own post -- an identity mapper is probably not what you want. You'd actually want an inverting mapper that transforms (k, v) -- (v, k), to take advantage of the key-based sorting.) - Aaron On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball aa...@cloudera.com wrote: Hi Jim, The ability to perform locking of shared mutable state is a distinct anti-goal of the MapReduce paradigm. One of the major benefits of writing MapReduce programs is knowing that you don't have to worry about deadlock in your code. If mappers could lock objects, then the failure and restart semantics of individual tasks would be vastly more complicated. (What happens if a map task crashes after it obtains a lock? Does it automatically release the lock? Does some rollback mechanism undo everything that happened after the lock was acquired? How would that work if--by definition--the mapper node is no longer available?) A word frequency histogram function can certainly be written in MapReduce without such state. You've got the right intuition, but a serial program is not necessarily the best answer. Take the existing word count program. This converts bags of words into (word, count) pairs. Then pass this through a second pass, via an identity mapper to a set of combiners that each emit the 100 most frequent words, to a single reducer that emits the 100 most frequent words obtained by the combiners. Many other more complicated problems which seem to require shared state, in truth, only require a second (or n+1'th) MapReduce pass. Adding multiple passes is a very valid technique for building more complex dataflows. Cheers, - Aaron On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky jim.twen...@gmail.com wrote: Hello, I was wondering if Hadoop provides thread safe shared variables that can be accessed from individual mappers/reducers along with a proper locking mechanism. To clarify things, let's say that in the word count example, I want to know the word that has the highest frequency and how many times it occured. I believe that the latter can be done using the counters that come with the Hadoop framework but I don't know how to get the word itself as a String. Of course, the problem can be more complicated like the top 100 words or so. I thought of writing a serial program which can go over the final output of the word count but this wouldn't be a good idea if the output file gets too large. However, if there is a way to define and use shared variables, this would be really easy to do on the fly during the word count's reduce phase. Thanks, Jim
Shared thread safe variables?
Hello, I was wondering if Hadoop provides thread safe shared variables that can be accessed from individual mappers/reducers along with a proper locking mechanism. To clarify things, let's say that in the word count example, I want to know the word that has the highest frequency and how many times it occured. I believe that the latter can be done using the counters that come with the Hadoop framework but I don't know how to get the word itself as a String. Of course, the problem can be more complicated like the top 100 words or so. I thought of writing a serial program which can go over the final output of the word count but this wouldn't be a good idea if the output file gets too large. However, if there is a way to define and use shared variables, this would be really easy to do on the fly during the word count's reduce phase. Thanks, Jim
Re: Shared thread safe variables?
Hi Aaron, Thanks for the advice. I actually thought of using multiple combiners and a single reducer but I was worried about the key sorting phase to be a vaste for my purpose. If the input is just a bunch of (word,count) pairs which is in the order of TeraBytes, wouldn't sorting be an overkill? That's why I thought a single serial program might perform better but I'm not sure how long it would take to sort the keys in such a case so probably it is nothing beyond speculation and I should go and give it a try to see how well it performs. Secondly, I didn't quite understand how I can take advantage of the sorted keys if I use an inverting mapper that transforms (k,v) -- (v,k) pairs. In both cases, the combiners and the single reducer will still have to iterate over all the (v,k) pairs to find the top 100 right? Or is there a way to say something like Give me the last 100 keys at each reducer/combiner? Thanks in advance, Jim On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball aa...@cloudera.com wrote: (Addendum to my own post -- an identity mapper is probably not what you want. You'd actually want an inverting mapper that transforms (k, v) -- (v, k), to take advantage of the key-based sorting.) - Aaron On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball aa...@cloudera.com wrote: Hi Jim, The ability to perform locking of shared mutable state is a distinct anti-goal of the MapReduce paradigm. One of the major benefits of writing MapReduce programs is knowing that you don't have to worry about deadlock in your code. If mappers could lock objects, then the failure and restart semantics of individual tasks would be vastly more complicated. (What happens if a map task crashes after it obtains a lock? Does it automatically release the lock? Does some rollback mechanism undo everything that happened after the lock was acquired? How would that work if--by definition--the mapper node is no longer available?) A word frequency histogram function can certainly be written in MapReduce without such state. You've got the right intuition, but a serial program is not necessarily the best answer. Take the existing word count program. This converts bags of words into (word, count) pairs. Then pass this through a second pass, via an identity mapper to a set of combiners that each emit the 100 most frequent words, to a single reducer that emits the 100 most frequent words obtained by the combiners. Many other more complicated problems which seem to require shared state, in truth, only require a second (or n+1'th) MapReduce pass. Adding multiple passes is a very valid technique for building more complex dataflows. Cheers, - Aaron On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky jim.twen...@gmail.com wrote: Hello, I was wondering if Hadoop provides thread safe shared variables that can be accessed from individual mappers/reducers along with a proper locking mechanism. To clarify things, let's say that in the word count example, I want to know the word that has the highest frequency and how many times it occured. I believe that the latter can be done using the counters that come with the Hadoop framework but I don't know how to get the word itself as a String. Of course, the problem can be more complicated like the top 100 words or so. I thought of writing a serial program which can go over the final output of the word count but this wouldn't be a good idea if the output file gets too large. However, if there is a way to define and use shared variables, this would be really easy to do on the fly during the word count's reduce phase. Thanks, Jim
Re: Predefined counters
Hello Tom, Thanks for the swift response. That really worked, I'll vote for what you suggested now. Cheers, Jim On Mon, Dec 22, 2008 at 5:09 AM, Tom White t...@cloudera.com wrote: Hi Jim, Try something like: Counters counters = job.getCounters(); counters.findCounter(org.apache.hadoop.mapred.Task$Counter, REDUCE_INPUT_RECORDS).getCounter() The pre-defined counters are unfortunately not public and are not in one place in the source code, so you'll need to hunt to find them (search the source for the counter name you see in the web UI). I opened https://issues.apache.org/jira/browse/HADOOP-4043 a while back to address the fact they are not public. Please consider voting for it if you think it would be useful. Cheers, Tom On Mon, Dec 22, 2008 at 2:47 AM, Jim Twensky jim.twen...@gmail.com wrote: Hello, I need to collect some statistics using some of the counters defined by the Map/Reduce framework such as Reduce input records. I know I should use the getCounter method from Counters.Counter but I couldn't figure how to use it. Can someone give me a two line example of how to read the values for those counters and where I can find the names/groups of the predefined counters? Thanks in advance, Jim
Predefined counters
Hello, I need to collect some statistics using some of the counters defined by the Map/Reduce framework such as Reduce input records. I know I should use the getCounter method from Counters.Counter but I couldn't figure how to use it. Can someone give me a two line example of how to read the values for those counters and where I can find the names/groups of the predefined counters? Thanks in advance, Jim
Re: Can hadoop sort by values rather than keys?
Sorting according to keys is a requirement for the map/reduce algorithm. I'd suggest running a second map/reduce phase on the output files of your application and use the values as keys in that second phase. I know that will increase the running time, but this is how I do it when I need to get my output files sorted according to their values rather then keys. Jim On Wed, Sep 24, 2008 at 9:28 PM, Qin Gao [EMAIL PROTECTED] wrote: Why not use the value as keys. On Wed, Sep 24, 2008 at 10:22 PM, Jeremy Chow [EMAIL PROTECTED] wrote: Hi list, The default way hadoop doing its sorting is by keys , can it sort by values rather than keys? Regards, Jeremy -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
Re: debugging hadoop application!
As far as I know, there is a Hadoop plug-in for Eclipse but it is not possible to debug when running on a real cluster. If you want to add watches and expressions to trace your programs or profile your code, I'd suggest looking at the log files or use other tracing tools such as xtrace ( http://www.x-trace.net/wiki/doku.php). Somebody please correct me if I'm wrong. Jim On Wed, Sep 24, 2008 at 4:41 PM, Gerardo Velez [EMAIL PROTECTED]wrote: Hi everybody! I'm a newbie on hadoop and after follow up some hadoop examples and studied them. I will start my own application but I got a question. Is there anyway I could debug my own hadoop application? Actually I've been working on IntelliJ IDE, but I'm feeling comfortable with netbeans and eclipse as well. Note: So far I've attached jboss server on IntelliJ using this: -Xdebug -Xnoagent -Djava.compiler=NONE -Xrunjdwp:transport=dt_shmem,server=y,suspend=n,address=javadebug I just attached last config line to jboss execution file (run.sh). I was wondering if I could do ahything similar to this for hadoop! Thanks in advance!
Re: installing hadoop on a OS X cluster
Apparently you have one node with 2 processors where each processor has 4 cores. What do you want to use Hadoop for? If you have a single disk drive and multiple cores on one node then pseudo distributed environment seems like the best approach to me as long as you are not dealing with large amounts of data. If you have a single disk drive and huge amount of data to process, then the disk drive might be a bottleneck for your applications. Hadoop is usually used for data intensive applications whereas your hardware seems more like to be designed for cpu intensive job considering 8 cores on a single node. Tim On Wed, Sep 10, 2008 at 4:59 PM, Sandy [EMAIL PROTECTED] wrote: I am starting an install of hadoop on a new cluster. However, I am a little confused what set of instructions I should follow, having only installed and played around with hadoop on a single node ubuntu box with 2 cores (on a single board) and 2 GB of RAM. The new machine has 2 internal nodes, each with 4 cores. I would like to run Hadoop to run in a distributed context over these 8 cores. One of my biggest issues is the definition of the word node. From the Hadoop wiki and documentation, it seems that node means machine, and not a board. So, by this definition, our cluster is really one node. Is this correct? If this is the case, then I shouldn't be using the cluster setup instructions, located here: http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html But this one: http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html Which is what I've been doing. But what should the operation be? I don't think it should be standalone. Should it be Psuedo-distributed? If so, how can I guarantee that it will be spread over all the 8 processors? What is necessary for the hadoop-site.xml file? Here are the specs of the machine. -Mac Pro RAID Card 065-7214 -Two 3.0GHz Quad-Core Intel Xeon (8-core) 065-7534 -16GB RAM (4 x 4GB) 065-7179 -1TB 7200-rpm Serial ATA 3Gb/s 065-7544 -1TB 7200-rpm Serial ATA 3Gb/s 065-7546 -1TB 7200-rpm Serial ATA 3Gb/s 065-7193 -1TB 7200-rpm Serial ATA 3Gb/s 065-7548 Could someone please point me to the correct mode of operation/instructions to install things correctly on this machine? I found some information how to install on a OS X machine in the archives, but they are a touch outdated and seems to be missing some things. Thank you very much for you time. -SM
Re: Hadoop Streaming and Multiline Input
If I understand your question correctly, you need to write your own FileInputFormat. Please see http://hadoop.apache.org/core/docs/r0.18.0/api/index.html for details. Regards, Tim On Sat, Sep 6, 2008 at 9:20 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Is is possible to set a multiline text input in streaming to be used as a single record? For example say I wanted to scan a webpage for a specific regex that is multiline, is this possible in streaming? Dennis
Question on Streaming
Hello, I need to use Hadoop Streaming to run several instances of a single program on different files. Before doing it, I wrote a simple test application as the mapper, which basically outputs the standard input without doing anything useful. So it looks like the following: ---echo.sh-- echo Running mapper, input is $1 ---echo.sh-- For the input, I created a single text file input.txt that has number from 1 to 10 on each line, so it goes like: ---input.txt--- 1 2 .. 10 ---input.txt--- I uploaded input.txt on hdfs://stream/ directory and then ran Hadoop Streaming utility as follows: bin/hadoop jar hadoop-0.18.0-streaming.jar \ -input /stream \ -output /trash \ -mapper echo.sh \ -file echo.sh \ -jobconf mapred.reduce.tasks=0 and from what I understood in the streaming tutorial, I expected that each mapper would run an instance of echo.sh with one of the lines in input.txt so I expected to get an output in the form of Running mapper, input is 2 Running mapper, input is 5 ... and so on but I got only two output files, part-0 and part-1 that contain the string Running mapper, input is . As far as I see, the mappers ran the mapper script echo.sh without the standard input. I basicly followed the tutorial and I'm confused now so could you please tell me what I'm missing here? Thanks in advance, Jim
Different Map and Reduce output types - weird error message
Hello, I am working on a Hadoop application that produces different (key,value) types after the map and reduce phases so I'm aware that I need to use JobConf.setMapOutputKeyClass and JobConf.setMapOutputValueClass. However, I still keep getting the following runtime error when I run my application: java.io.IOException: wrong value class: org.apache.hadoop.io.FloatWritable is not class org.apache.hadoop.io.IntWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:938) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.collect(MapTask.java:414) at test.DistributionCreator$Reduce.reduce(DistributionCreator.java:104) at test.DistributionCreator$Reduce.reduce(DistributionCreator.java:85) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:439) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:418) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:604) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:193) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1804) My mapper class goes like: public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, IntWritable, IntWritable { (...) public void map(LongWritable key, Text value, OutputCollectorIntWritable, IntWritable output, Reporter reporter) throws IOException { (...) } } and my Reducer goes like: public static class Reduce extends MapReduceBase implements ReducerIntWritable, IntWritable, IntWritable, FloatWritable { (...) public void reduce(IntWritable key, IteratorIntWritable values, OutputCollectorIntWritable, FloatWritable output, Reporter reporter) throws IOException { float sum = 0; (...) output.collect(key, new FloatWritable(sum)); } } and the corresponding part of my configuration goes as follows: conf.setMapOutputValueClass(IntWritable.class); conf.setMapOutputKeyClass(IntWritable.class); conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(FloatWritable.class); which I believe is consistent with the mapper and the reducer classes. Can you please let me know what I'm missing here? Thanks in advance, Jim
Re: Different Map and Reduce output types - weird error message
Here is the relevant part of my mapper: (...) private final static IntWritable one = new IntWritable(1); private IntWritable bound = new IntWritable(); (...) while(...) { output.collect(bound,one); } so I'm not sure why my mapper tries to output a FloatWritable. On Fri, Aug 29, 2008 at 6:17 PM, Owen O'Malley [EMAIL PROTECTED] wrote: The error message is saying that your map tried to output a FloatWritable.
Re: Different Map and Reduce output types - weird error message
I think I've found the problem. When I removed the following line: conf.setCombinerClass(Reduce.class); everything worked fine. During the map phase, when the combiner uses the Reduce.class as the Reducer, the final map (key,value) pairs are attempted to be written as the Reducer output types, which contradict with the specified Mapper output types. If I'm correct, am I supposed to write a separate reducer for the local combiner in order to speed things up? Jim On Fri, Aug 29, 2008 at 6:30 PM, Jim Twensky [EMAIL PROTECTED] wrote: Here is the relevant part of my mapper: (...) private final static IntWritable one = new IntWritable(1); private IntWritable bound = new IntWritable(); (...) while(...) { output.collect(bound,one); } so I'm not sure why my mapper tries to output a FloatWritable. On Fri, Aug 29, 2008 at 6:17 PM, Owen O'Malley [EMAIL PROTECTED] wrote: The error message is saying that your map tried to output a FloatWritable.