Re: How to stop a mapper within a map-reduce job when you detect bad input
Hello, The MapRunner classes looks promising. I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package. Is this going to ported to mapreduce or is it no longer being supported? Thanks! ~Ed On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote: If it occurs eventually as your record reader reads it, then you may use a MapRunner class instead of a Mapper IFace/Subclass. This way, you may try/catch over the record reader itself, and call your map function only on valid next()s. I think this ought to work. You can set it via JobConf.setMapRunnerClass(...). Ref: MapRunner API @ http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote: Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed -- Harsh J www.harshj.com
Re: How to stop a mapper within a map-reduce job when you detect bad input
Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs before) and it doesn't look like MapRunner is deprecated so I'll try catching the error there and will report back if it's a good solution. Thanks! ~Ed On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote: Hello, The MapRunner classes looks promising. I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package. Is this going to ported to mapreduce or is it no longer being supported? Thanks! ~Ed On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote: If it occurs eventually as your record reader reads it, then you may use a MapRunner class instead of a Mapper IFace/Subclass. This way, you may try/catch over the record reader itself, and call your map function only on valid next()s. I think this ought to work. You can set it via JobConf.setMapRunnerClass(...). Ref: MapRunner API @ http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote: Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed -- Harsh J www.harshj.com
Re: How to stop a mapper within a map-reduce job when you detect bad input
Sorry to keep spamming this thread. It looks like the correct way to implement MapRunnable using the new mapreduce classes (instead of the deprecated mapred) is to override the run() method of the mapper class. This is actually nice and convenient since everyone should already be using Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT, VALUEOUT for their mappers. ~Ed On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote: Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs before) and it doesn't look like MapRunner is deprecated so I'll try catching the error there and will report back if it's a good solution. Thanks! ~Ed On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote: Hello, The MapRunner classes looks promising. I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package. Is this going to ported to mapreduce or is it no longer being supported? Thanks! ~Ed On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote: If it occurs eventually as your record reader reads it, then you may use a MapRunner class instead of a Mapper IFace/Subclass. This way, you may try/catch over the record reader itself, and call your map function only on valid next()s. I think this ought to work. You can set it via JobConf.setMapRunnerClass(...). Ref: MapRunner API @ http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote: Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed -- Harsh J www.harshj.com
Re: How to stop a mapper within a map-reduce job when you detect bad input
Thanks Tom! Didn't see your post before posting =) On Thu, Oct 21, 2010 at 1:28 PM, ed hadoopn...@gmail.com wrote: Sorry to keep spamming this thread. It looks like the correct way to implement MapRunnable using the new mapreduce classes (instead of the deprecated mapred) is to override the run() method of the mapper class. This is actually nice and convenient since everyone should already be using Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT, VALUEOUT for their mappers. ~Ed On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote: Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs before) and it doesn't look like MapRunner is deprecated so I'll try catching the error there and will report back if it's a good solution. Thanks! ~Ed On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote: Hello, The MapRunner classes looks promising. I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package. Is this going to ported to mapreduce or is it no longer being supported? Thanks! ~Ed On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote: If it occurs eventually as your record reader reads it, then you may use a MapRunner class instead of a Mapper IFace/Subclass. This way, you may try/catch over the record reader itself, and call your map function only on valid next()s. I think this ought to work. You can set it via JobConf.setMapRunnerClass(...). Ref: MapRunner API @ http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote: Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed -- Harsh J www.harshj.com
Re: How to stop a mapper within a map-reduce job when you detect bad input
I overwrote the run() method in the mapper with a run() method (below) that catches the EOFException. The mapper and reducer now complete but the outputted lzo file from the reducer throws an Unexpected End of File error when decompressing it indicating something did not clean up properly. I can't think of why this could be happening as the map() method should only be called on input that was properly decompressed (anything that can't be decompressed will throw an Exception that is being caught). The reducer then should not even know that the mapper hit an EOFException in the input gzip file, and yet the output lzo file still has the unexpected end of file problem (I'm using the kevinweil lzo libraries). Is there some call that needs to be made that will close out the mapper and ensure that the lzo output from the reducer is formatted properly? Thank you! @Override public void run(Context context) throw InterruptedException{ try{ setup(context); while(context.nextKeyValue()){ map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); } catch(EOFException){ logError(context, EOFException: Corrupt gzip file + mFileName); } } On Thu, Oct 21, 2010 at 1:29 PM, ed hadoopn...@gmail.com wrote: Thanks Tom! Didn't see your post before posting =) On Thu, Oct 21, 2010 at 1:28 PM, ed hadoopn...@gmail.com wrote: Sorry to keep spamming this thread. It looks like the correct way to implement MapRunnable using the new mapreduce classes (instead of the deprecated mapred) is to override the run() method of the mapper class. This is actually nice and convenient since everyone should already be using Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT, VALUEOUT for their mappers. ~Ed On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote: Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs before) and it doesn't look like MapRunner is deprecated so I'll try catching the error there and will report back if it's a good solution. Thanks! ~Ed On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote: Hello, The MapRunner classes looks promising. I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package. Is this going to ported to mapreduce or is it no longer being supported? Thanks! ~Ed On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.comwrote: If it occurs eventually as your record reader reads it, then you may use a MapRunner class instead of a Mapper IFace/Subclass. This way, you may try/catch over the record reader itself, and call your map function only on valid next()s. I think this ought to work. You can set it via JobConf.setMapRunnerClass(...). Ref: MapRunner API @ http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote: Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed -- Harsh J www.harshj.com
Re: Setting num reduce tasks
You could also try job.setNumReduceTasks(yourNumber); ~Ed On Thu, Oct 21, 2010 at 4:45 PM, Alex Kozlov ale...@cloudera.com wrote: Hi Matt, it might be that the parameter does not end up in the final configuration for a number of reasons. Can you check the job config xml in jt:/var/log/hadoop/history or in the JT UI and see what the mapred.reduce.tasks setting is? -- Alex K On Thu, Oct 21, 2010 at 1:39 PM, Matt Tanquary matt.tanqu...@gmail.com wrote: I am using the following to set my number of reduce tasks, however when I run my job it's always using just 1 reducer. conf.setInt(mapred.reduce.tasks, 20); 1 reducer will never finish this job. Please help me to understand why the setting I choose is not used. Thanks, -M@
LZO Compression Libraries don't appear to work properly with MultipleOutputs
Hello everyone, I am having problems using MultipleOutputs with LZO compression (could be a bug or something wrong in my own code). In my driver I set MultipleOutputs.addNamedOutput(job, test, TextOutputFormat.class, NullWritable.class, Text.class); In my reducer I have: MultipleOutputsNullWritable, Text mOutput = new MultipleOutputsNullWritable, Text(context); public String generateFileName(Key key){ return custom_file_name; } Then in the reduce() method I have: mOutput.write(mNullWritable, mValue, generateFileName(key)); This results in creating LZO files that do not decompress properly (lzop -d throws the error lzop: unexpected end of file: outputFile.lzo) If I switch back to the regular context.write(mNullWritable, mValue); everything works fine. Am I forgetting a step needed when using MultipleOutputs or is this a bug/non-feature of using LZO compression in Hadoop. Thank you! ~Ed
Re: How to stop a mapper within a map-reduce job when you detect bad input
So the overwritten run() method was a red herring. The real problem appears to be that I use MultipleOutputs (the new mapreduce API version) for my reducer output. I posted a different thread since it's not really related to the original question here. For everyone that was curious, it turns our overriding the run() method and catching the EOFException works beautifully for processing files that might be corrupt or have errors. Thanks! ~Ed On Thu, Oct 21, 2010 at 2:07 PM, ed hadoopn...@gmail.com wrote: I overwrote the run() method in the mapper with a run() method (below) that catches the EOFException. The mapper and reducer now complete but the outputted lzo file from the reducer throws an Unexpected End of File error when decompressing it indicating something did not clean up properly. I can't think of why this could be happening as the map() method should only be called on input that was properly decompressed (anything that can't be decompressed will throw an Exception that is being caught). The reducer then should not even know that the mapper hit an EOFException in the input gzip file, and yet the output lzo file still has the unexpected end of file problem (I'm using the kevinweil lzo libraries). Is there some call that needs to be made that will close out the mapper and ensure that the lzo output from the reducer is formatted properly? Thank you! @Override public void run(Context context) throw InterruptedException{ try{ setup(context); while(context.nextKeyValue()){ map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); } catch(EOFException){ logError(context, EOFException: Corrupt gzip file + mFileName); } } On Thu, Oct 21, 2010 at 1:29 PM, ed hadoopn...@gmail.com wrote: Thanks Tom! Didn't see your post before posting =) On Thu, Oct 21, 2010 at 1:28 PM, ed hadoopn...@gmail.com wrote: Sorry to keep spamming this thread. It looks like the correct way to implement MapRunnable using the new mapreduce classes (instead of the deprecated mapred) is to override the run() method of the mapper class. This is actually nice and convenient since everyone should already be using Mapper class (org.apache.hadoop.mapreduce.MaperKEYIN, VALUEIN, KEYOUT, VALUEOUT for their mappers. ~Ed On Thu, Oct 21, 2010 at 12:14 PM, ed hadoopn...@gmail.com wrote: Just checked the Hadoop 0.21.0 API docs (I was looking in the wrong docs before) and it doesn't look like MapRunner is deprecated so I'll try catching the error there and will report back if it's a good solution. Thanks! ~Ed On Thu, Oct 21, 2010 at 11:23 AM, ed hadoopn...@gmail.com wrote: Hello, The MapRunner classes looks promising. I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package. Is this going to ported to mapreduce or is it no longer being supported? Thanks! ~Ed On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.comwrote: If it occurs eventually as your record reader reads it, then you may use a MapRunner class instead of a Mapper IFace/Subclass. This way, you may try/catch over the record reader itself, and call your map function only on valid next()s. I think this ought to work. You can set it via JobConf.setMapRunnerClass(...). Ref: MapRunner API @ http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote: Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed -- Harsh J www.harshj.com
Re: LZO Compression Libraries don't appear to work properly with MultipleOutputs
Hi Todd, I don't have the code in front of me right but I was looking over the API docs and it looks like I forgot to call close() on the MultipleOutput. I'll post back if that fixes the problem. If not I'll put together a unit test. Thanks! ~Ed On Thu, Oct 21, 2010 at 6:31 PM, Todd Lipcon t...@cloudera.com wrote: Hi Ed, Sounds like this might be a bug, either in MultipleOutputs or in LZO. Does it work properly with gzip compression? Which LZO implementation are you using? The one from google code or the more up to date one from github (either kevinweil's or mine)? Any chance you could write a unit test that shows the issue? Thanks -Todd On Thu, Oct 21, 2010 at 2:52 PM, ed hadoopn...@gmail.com wrote: Hello everyone, I am having problems using MultipleOutputs with LZO compression (could be a bug or something wrong in my own code). In my driver I set MultipleOutputs.addNamedOutput(job, test, TextOutputFormat.class, NullWritable.class, Text.class); In my reducer I have: MultipleOutputsNullWritable, Text mOutput = new MultipleOutputsNullWritable, Text(context); public String generateFileName(Key key){ return custom_file_name; } Then in the reduce() method I have: mOutput.write(mNullWritable, mValue, generateFileName(key)); This results in creating LZO files that do not decompress properly (lzop -d throws the error lzop: unexpected end of file: outputFile.lzo) If I switch back to the regular context.write(mNullWritable, mValue); everything works fine. Am I forgetting a step needed when using MultipleOutputs or is this a bug/non-feature of using LZO compression in Hadoop. Thank you! ~Ed -- Todd Lipcon Software Engineer, Cloudera
Re: Upgrading Hadoop from CDH3b3 to CDH3
I don't think there is a stable CDH3 yet although we've been using CDH3B2 and it has been pretty stable for us. (at least I don't see it available on their website and they JUST announced CDH3B3 last week at HadoopWorld. ~Ed On Wed, Oct 20, 2010 at 5:57 AM, Abhinay Mehta abhinay.me...@gmail.comwrote: Hi all, We currently have Cloudera's Hadoop beta 3 installed on our cluster, we would like to upgrade to the latest stable release CDH3. Is there documentation or recommended steps on how to do this? We found some docs on how to upgrade from CDH2 and CDHb2 to CDHb3 here: https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+or+CDH3b2+to+CDH3b3 Are the same steps recommended to upgrade to CDH3? I'm hoping it's a lot easier to upgrade from beta3 to the latest stable version than that document states? Thank you. Abhinay Mehta
Re: Reduce function
Keys are partitioned among the reducers using a partition function which is specified in the aptly named Partitioner class. By default, Hadoop will hash the key (and probably mods the hash by the number of reducers) to determine which reducer to send your key to (I say probably because I haven't looked at the actual code). What this means for you is that if you set a custom bit in the key field, keys with different bits are not guaranteed to go to the same reducers even if they rest of key is the same. For example Key1 = (DataX+BitA) -- Reducer1 Key2 = (DataX+BitB) -- Reducer2 What you want is for any key with the same Data to go to the same reducer regardless of the bit value. To do this you need to write your own partitioner class and set your job to use that class using job.setPartitionerClass(MyCustomPartitioner.class) Your custom partitioner will need to break apart your key and only hash on the DataX part of it. The partitioner class is really easy to override and will look something like this: public class MyCustomPartitioner extends PartitionerKey, Value { public int getPartition(Key key, Value value, int numPartitions){ //split my key so that the bit flag is removed //take the modified key and mod it by numPartitions return the result } } Of course Key and Value would be whatever Key and Value class you're using. Hope that helps. ~Ed On Mon, Oct 18, 2010 at 8:58 PM, Brad Tofel b...@archive.org wrote: Whoops, just re-read your message, and see you may be asking about targeting a reduce callback function, not a reduce task.. If that's the case, I'm not sure I understand what your bit/tag is for, and what you're trying to do with it. Can you provide a concrete example (not necessarily code) of some keys which need to group together? Is there a way to embed the bit within the value, so keys are always common? If you really need to fake out the system so different keys arrive in the same reduce, you might be able to do it with a combination of: org.apache.hadoop.mapreduce.Job .setSortComparatorClass() .setGroupingComparatorClass() .setPartitionerClass() Brad On 10/18/2010 05:41 PM, Brad Tofel wrote: The Partitioner implementation used with your job should define which reduce target receives a given map output key. I don't know if an existing Partitioner implementation exists which meets your needs, but it's not a very complex interface to develop, if nothing existing works for you. Brad On 10/18/2010 04:43 PM, Shi Yu wrote: How many tags you have? If you have several number of tags, you'd better create a Vector class to hold those tags. And define sum function to increment the values of tags. Then the value class should be your new Vector class. That's better and more decent than the Textpair approach. Shi On 2010-10-18 5:19, Matthew John wrote: Hi all, I had a small doubt regarding the reduce module. What I understand is that after the shuffle / sort phase , all the records with the same key value goes into a reduce function. If thats the case, what is the attribute of the Writable key which ensures that all the keys go to the same reduce ? I am working on a reduce side Join where I need to tag all the keys with a bit which might vary but still want all those records to go into same reduce. In Hadoop the Definitive Guide, pg. 235 they are using TextPair for the key. But I dont understand how the keys with different tag information goes into the same reduce. Matthew
Re: io.sort.mb maximum limit
HI Donovan, This is sort of tangential to your question but we tried upping our io.sort.mb to a really high value and it actually resulted in slower performance (I think we bumped it up to 1400MB and this was slower than leaving it at 256 on a machine with 32GB of RAM). I'm not entirely sure why this was the case. It could have been a garbage collection issue or some other secondary effect that was slowing things down. Keep in mind that Hadoop will always spill map outputs to disk no matter how large your sort buffer is in case the reducer crashes, the data needs to exist on disk somewhere for the next reducer making the attempt so it might be counterproductive to try and eliminate spills. ~Ed On Tue, Oct 19, 2010 at 8:02 AM, Donovan Hide donovanh...@gmail.com wrote: Hi, is there a reason why the io.sort.mb setting is hard-coded to the maximum of 2047MB? MapTask.java 789-791 if ((sortmb 0x7FF) != sortmb) { throw new IOException(Invalid \io.sort.mb\: + sortmb); } Given that the EC2 High-Memory Quadruple Extra Large Instance has 68.4GB of memory and 8 cores, it would make sense to be able to set the io.sort.mb to close to 8GB. I have map task that outputs 144,586,867 records of average size 12 bytes, and a greater than 2047MB sort buffer would allow me to prevent the inevitable spills. I know I can reduce the size of the map inputs to solve the problem, but 2047MB seems a bit arbitrary given the spec of EC2 instances. Cheers, Donovan.
How to stop a mapper within a map-reduce job when you detect bad input
Hello, I have a simple map-reduce job that reads in zipped files and converts them to lzo compression. Some of the files are not properly zipped which results in Hadoop throwing an java.io.EOFException: Unexpected end of input stream error and causes the job to fail. Is there a way to catch this exception and tell hadoop to just ignore the file and move on? I think the exception is being thrown by the class reading in the Gzip file and not my mapper class. Is this correct? Is there a way to handle this type of error gracefully? Thank you! ~Ed
Re: Set number Reducer per machines.
Ah yes, It looks like both the mapper and reducer are using a map structure which will be created on the heap. All the values from the reducer are being inserted into the map structure. If you have lots of values for a single key then you're going to run out of heap memory really fast. Do you have a rough estimate for the number of values per key? We had this problem when we first started using map-reduce (we'd create large arrays in the reducer to hold data to sort). Turns out this is generally a very bad idea (it's particularly bad when the number of values per key is not bounded since sometimes you're algorithm will work and other times you'll get out of memory errors). In our case we redesigned our algorithm to not require holding lots of values in memory by taking advantage of Hadoop's sorting capability and secondary sorting capability. My guess is you won't be able to use the cloud9 mapper and reducer unless your data changes so that the number of unique values per key is much lower. It's also possible that you're running out of heap space in the mapper as your create the map there. How many items are in the terms array? I String[] terms = text.split(\\s+); Sorry that's probably not much help to you. ~Ed On Wed, Oct 6, 2010 at 8:04 AM, Pramy Bhats pramybh...@googlemail.comwrote: Hi Ed, I was using the following file for mapreduce job. Cloud9/src/dist/edu/umd/cloud9/example/cooccur/ComputeCooccurrenceMatrixStripes.java thanks, --Pramod On Tue, Oct 5, 2010 at 10:51 PM, ed hadoopn...@gmail.com wrote: What are the exact files you are using for the mapper and reducer from the cloud9 package? On Tue, Oct 5, 2010 at 2:15 PM, Pramy Bhats pramybh...@googlemail.com wrote: Hi Ed, I was trying to benchmark some application code available online. http://github.com/lintool/Cloud9 For the program computing concurrentmatrix strips. However, the code itself is problematic because it throws heap-space error for even very small data sets. thanks, --Pramod On Tue, Oct 5, 2010 at 5:50 PM, ed hadoopn...@gmail.com wrote: Hi Pramod, How much memory does each node in your cluster have? What type of processors do those nodes have? (dual core, quad core, dual quad core? etc..) In what step are you seeing the heap space error (mapper or reducer?) It's quite possible that you're mapper or reducer code could be improved to reduce heap space usage. ~Ed On Tue, Oct 5, 2010 at 10:05 AM, Marcos Medrado Rubinelli marc...@buscape-inc.com wrote: You can set the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties in your mapred-site.xml file, but you may also want to check your current mapred.child.java.opts and mapred.child.ulimit values to make sure they aren't overriding the 4GB you set globally. Cheers, Marcos Hi, I am trying to run a job on my hadoop cluster, where I get consistently get heap space error. I increased the heap-space to 4 GB in hadoop-env.sh and reboot the cluster. However, I still get the heap space error. One of things, I want to try is to reduce the number of map / reduce process per machine. Currently each machine can have 2 maps and 2 reduce process running. I want to configure the hadoop to run 1 map and 1 reduce per machine to give more heap space per process. How can I configure the number of maps and number of reducer per node ? thanks in advance, -- Pramod
Re: Read/Writing into HDFS
I haven't tried it out yet but you theoretically can mount HDFS as a standard file system in linux using Fuse http://wiki.apache.org/hadoop/MountableHDFS If you're using Cloudera's distro of Hadoop it should come with fuse prepackaged for you: https://wiki.cloudera.com/display/DOC/Mountable+HDFS ~Ed On Thu, Sep 30, 2010 at 7:59 AM, Adarsh Sharma adarsh.sha...@orkash.comwrote: Dear all, I have set up a Hadoop cluster of 10 nodes. I want to know that how we can read/write file from HDFS (simple). Yes I know there are commands, i read the whole HDFS commands. bin/hadoop -copyFromLocal tells that the file should be in localfilesystem. But I want to know that how we can read these files from the cluster. What are the different ways to read files from HDFS. *Can a extra node ( other than the cluster nodes ) read file from the cluster. If yes , how? *Thanks in Advance* *
Re: How to config Map only job to read .gz input files and output result in .lzo
I've had luck doing the following in main (assuming lzo is setup properly) (I'm using Hadoop 20.2) FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, com.hadoop.compression.lzo.LzopCodec.class) Make sure kevin weil's jar file is accessible when building your jar, and is available on the cluster. You should see Lzo being loaded each time you run a job at the beginning Something like: INFO lzo.GPLNaitveCodeLoader: Loaded native gpl library INFO lzo.LzoCodec: Succesfully loaded initialized native-lzo library (you should see both lines to make sure hadoop sees your jar and native library) Hope that works! ~Ed On Tue, Sep 28, 2010 at 3:06 PM, Steve Kuo kuosen...@gmail.com wrote: We have TB worth of XML data in .gz format where each file is about 20 MB. This dataset is not expected to change. My goal is to write a map-only job to read in one .gz file at a time and output the result in .lzo format. Since there are a large number of .gz files, the map parallelism is expected to be maximized. I am using Kevin Weil's LZO distribution and there does not seem to be a LzoTextOutputFormat. When I got lzo to work before, I set InputFormatClass to LzoTextInputFormat.class and map's output got lzo compressed automatically. What does one configure for LZO output. Current Job configuration code listed below does not work. XmlInputFormat is my custom input format to read XML files. job.setInputFormatClass(XmlInputFormat.class); job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); String mapredOutputCompress = conf.get(mapred.output.compress); if (true.equals(mapredOutputCompress)) // this reads input and write output in lzo format job.setInputFormatClass(LzoTextInputFormat.class);
Re: Proper blocksize and io.sort.mb setting when using compressed LZO files
Ah okay, I did not the fs.inmemory.size.mb setting in any of the default config files located here: http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html http://hadoop.apache.org/common/docs/r0.20.2/core-default.html http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html Should this be something that needs to be added? Thank you for the help! ~Ed On Mon, Sep 27, 2010 at 11:18 AM, Ted Yu yuzhih...@gmail.com wrote: The setting should be fs.inmemory.size.mb On Mon, Sep 27, 2010 at 7:15 AM, pig hadoopn...@gmail.com wrote: HI Sriguru, Thank you for the tips. Just to clarify a few things. Our machines have 32 GB of RAM. I'm planning on setting each machine to run 12 mappers and 2 reducers with the heap size set to 2048MB so total memory usage for the heap at 28GB. If this is the case should io.sort.mb be set to 70% of 2048MB (so ~1400 MB)? Also, I did not see a fs.inmemorysize.mb setting in any of the hadoop configuration files. Is that the correct setting I should be looking for? Should this also be set to 70% of the heap size or does it need to share with the io.sort.mb setting. I assume if I'm bumping up io.sort.mb that much I also need to increase io.sort.factor from the default of 10. Is there a recommended relation between these two? Thank you for your help! ~Ed On Sun, Sep 26, 2010 at 3:05 AM, Srigurunath Chakravarthi srig...@yahoo-inc.com wrote: Ed, Tuning io.sort.mb will be certainly worthwhile if you have enough RAM to allow for a higher Java heap per map task without risking swapping. Similarly, you can decrease spills on the reduce side using fs.inmemorysize.mb. You can use the following thumb rules for tuning those two: - Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% RAM across all processes (maps, reducers, TT, DN, other) - Set it small enough to avoid swap activity, but - Set it large enough to minimize disk spills. - Ensure that io.sort.factor is set large enough to allow full use of buffer space. - Balance space for output records (default 95%) record meta-data (5%). Use io.sort.spill.percent and io.sort.record.percent Your mileage may vary. We've seen job exec time improvements worth 1-3% via spill-avoidance for miscellaneous applications. Your other option of running a map per 32MB or 64MB of input should give you better performance if your map task execution time is significant (i.e., much larger than a few seconds) compared to the overhead of launching map tasks and reading input. Regards, Sriguru -Original Message- From: pig [mailto:hadoopn...@gmail.com] Sent: Saturday, September 25, 2010 2:36 AM To: common-user@hadoop.apache.org Subject: Proper blocksize and io.sort.mb setting when using compressed LZO files Hello, We just recently switched to using lzo compressed file input for our hadoop cluster using Kevin Weil's lzo library. The files are pretty uniform in size at around 200MB compressed. Our block size is 256MB. Decompressed the average LZO input file is around 1.0GB. I noticed lots of our jobs are now spilling lots of data to disk. We have almost 3x more spilled records than map input records for example. I'm guessing this is because each mapper is getting a 200 MB lzo file which decompresses into 1GB of data per mapper. Would you recommend solving this by reducing the block size to 64MB, or even 32MB and then using the LZO indexer so that a single 200MB lzo file is actually split among 3 or 4 mappers? Would it be better to play with the io.sort.mb value? Or, would it be best to play with both? Right now the io.sort.mb value is the default 200MB. Have other lzo users had to adjust their block size to compensate for the expansion of the data after decompression? Thank you for any help! ~Ed
Re: Reducer-side join example
Hi, Your question has an academic sound, so I'll give it an academic answer ;). Unfortunately, there are not really any good generalized (ie. cross join a large matrix with a large matrix) methods for doing joins in map-reduce. The fundamental reason for this is that in the general case you're comparing everything to everything, and so for each pair of possible rows, you must actually generate each pair of rows. This means every node ships all its data to every other node, no matter what (in the general case). I bring this up not because you're looking to optimize cross joining, but because it demonstrates the point that you will exploit the characteristics of your data no matter what strategy you choose, and each will have domain-specific flaws and advantages. The typical strategy for a reduce side join is to use hadoop's sorting functionality to group rows by their keys, such that the entire data set for a particular key will be resident on a single reducer. The key insight is that you're thinking about the join as a sorting problem. Yes this means you risk producing data sets that fill your reducers, but thats a trade-off that you accept to reduce the complexity of the original problem. If the existing join framework in hadoop (whose javadocs are quite thorough) is inadequate, you shouldn't be afraid to invent, implement, and test join strategies that are specific to your domain. On Tue, Apr 6, 2010 at 11:01 AM, M B machac...@gmail.com wrote: Thanks, I appreciate the example - what happens if File A and B have many more columns (all different data types)? The logic doesn't seem to work in that case - unless we set up the values in the Map function to include the file name (maybe the output value is a HashMap or something, which might work). Also, I was asking to see a reduce-side join as we have other things going on in the Mapper and I'm not sure if we can tweak it's output (we send output to multiple places). Does anyone have an example using the contrib/DataJoin or something similar? thanks On Mon, Apr 5, 2010 at 7:03 PM, He Chen airb...@gmail.com wrote: For the Map function: Input key: default input value: File A and File B lines output key: A, B, C,(first colomn of the final result) output value: 12, 24, Car, 13, Van, SUV... Reduce function: take the Map output and do: for each key { if the value of a key is integer then same it to array1; else save it to array2 } for ith element in array1 for jth element in array2 output(key, array1[i]+\t+array2[j]); done Hope this helps. On Mon, Apr 5, 2010 at 4:10 PM, M B machac...@gmail.com wrote: Hi, I need a good java example to get me started with some joining we need to do, any examples would be appreciated. File A: Field1 Field2 A12 B13 C22 A24 File B: Field1 Field2 Field3 ACar ... BTruck... BSUV ... BVan ... So, we need to first join File A and B on Field1 (say both are string fields). The result would just be: A 12 Car ... A 24 Car ... B 13 Truck ... B 13 SUV ... B 13 Van ... and so on - with all the fields from both files returning. Once we have that, we sometimes need to then transform it so we have a single record per key (Field1): A (12,Car) (24,Car) B (13,Truck) (13,SUV) (13,Van) --however it looks, basically tuples for each key (we'll modify this later to return a conatenated set of fields from B, etc) At other times, instead of transforming to a single row, we just need to modify rows based on values. So if B.Field2 equals Van, we need to set Output.Field2 = whatever then output to file ... Are there any good examples of this in native java (we can't use pig/hive/etc)? thanks. -- Best Wishes! -- Chen He PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: question on shuffle and sort
On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote: Did all key-value pairs of the map output, which have the same key, will be sent to the same reducer tasknode? Yes, this is at the core of the MapReduce model. There is one call to the user reduce function per unique map output key. This grouping is achieved by sorting which means you see keys in increasing order. Ed
Re: Strange behavior regarding stout,stderr,syslog
On Sat, Mar 13, 2010 at 10:57 PM, patektek wrote: Hello, I am using hadoop-0.20.1. Something very strange is happening with the log files (stdout, stderr, syslog). Basically, no log files are created for most of the tasks (in HOME_HADOOP/logs/userlogs). However, when I check the history for each individual task in the website interface I can see that all the data is available for each individual task. Furthermore, the few log files available (for a few tasks) seem to have the log files from multiple tasks mingled together. Any hint? Do you have JVM reuse enabled? With it enabled, I've observed that all tasks associated with a particular JVM go to the same log. Ed
Re: sort done parallel or after copy ?
Hi Prasen, The data that reduce tasks receive during shuffle (copy) has already been sorted by map tasks, so they just have to be merged. This merge happens in parallel with the shuffle. When a reduce task's in-memory buffer of sorted map output files reaches a certain threshold, they are merged and written to disk. If the number of on-disk files created through this process exceeds 2n-1 where n=io.sort.factor (10 by default), n of these files get merged so that there are n remaining. When the shuffle ends, there can be anywhere between 0 to 2n-1 files on disk to be merged still. These get merged down to (at most) n files and a final merge goes directly into the user reduce function. Ed On Fri, Mar 5, 2010 at 12:36 AM, prasenjit mukherjee prasen@gmail.com wrote: if I understand correctly reduce has 3 stages : copy,sort,reduce. Copy happens parallely with mappers still running. Reduce has to wait till all the mappers are done. For sorting we could have 2 options : 1) Entire sorting happens after copy ( in a single shot ) OR 2) It could happen along with copy where each block is sorted and later merged ( via merge-sort ) How is it being currently done in hadoop's latest version ? -Thanks, Prasen
Re: Writing a simple sort application for Hadoop
Hi Abhishek, If you use input lines as your output keys in map, Hadoop internals will do the work for you and the keys will appear in sorted order in your reduce (you can use IdentityReducer). This needs a slight adjustment if your input lines aren't unique. If you have R reducers, this will create R sorted files. If you want a single sorted file, you can merge the R files or use 1 reducer. Another way is to use TotalOrderPartitioner which will ensure all keys in reduce N come after all keys in reduce N-1. Owen O'Malley and Arun C. Murthy's paper [1] about using Hadoop to win a sorting competition might be of interest to you. Ed [1] http://sortbenchmark.org/Yahoo2009.pdf On Sun, Feb 28, 2010 at 1:53 PM, aa...@buffalo.edu wrote: Hello, I am trying to write a simple sorting application for hadoop. This is what I have thought till now. Suppose I have 100 lines of data and 10 mappers, each of the 10 mappers will sort the data given to it. But I am unable to figure out is how to join these outputs to one big sorted array. In other words what should be the code to be written in the reduce ? Best Regards from Buffalo Abhishek Agrawal SUNY- Buffalo (716-435-7122)
Re: How are intermediate key/value pairs materialized between map and reduce?
As you noticed, your map tasks are spilling three times as many records as they are outputting. In general, if the map output buffer is large enough to hold all records in memory, these values will be equal. If there isn't enough room, as was the case with your job, the buffer makes additional intermediate spills. To fix this, you can try tuning the per-job configurables io.sort.mb and io.sort.record.percent. Look at the counters of a few map tasks to get an idea of how much data (io.sort.mb) and how many records (io.sort.record.percent) they produce. Ed On Wed, Feb 24, 2010 at 2:45 AM, Tim Kiefer tim-kie...@gmx.de wrote: Sure, I see: Map input eecords: 10,000 Map output records: 600,000 Map output bytes: 307,216,800,000 (each reacord is about 500kb - that fits the application and is to be expected) Map spilled records: 1,802,965 (ahhh... now that you ask for it - here there also is a factor of 3 between output and spilled). So - question now is: why are three times as many records spilled than actually produced by the mappers? In my map function, I do not perform any additional file writing besides the context.write() for the intermediate records. Thanks, Tim Am 24.02.2010 05:28, schrieb Amogh Vasekar: Hi, Can you let us know what is the value for : Map input records Map spilled records Map output bytes Is there any side effect file written? Thanks, Amogh On 2/23/10 8:57 PM, Tim Kiefertim-kie...@gmx.de wrote: No... 900GB is in the map column. Reduce adds another ~70GB of FILE_BYTES_WRITTEN and the total column consequently shows ~970GB. Am 23.02.2010 16:11, schrieb Ed Mazur: Hi Tim, I'm guessing a lot of these writes are happening on the reduce side. On the JT web interface, there are three columns: map, reduce, overall. Is the 900GB figure from the overall column? The value in the map column will probably be closer to what you were expecting. There are writes on the reduce side too during the shuffle and multi-pass merge. Ed 2010/2/23 Tim Kiefertim-kie...@gmx.de: Hi Gang, thanks for your reply. To clarify: I look at the statistics through the job tracker. In the webinterface for my job I have columns for map, reduce and total. What I was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map Output Bytes in the map column. About the replication factor: I would expect the exact same thing - changing to 6 has no influence on FILE_BYTES_WRITTEN. About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10. Furthermore, I have 40 mappers and map output data is ~300GB. I can't see how that ends up in a factor 3? - tim Am 23.02.2010 14:39, schrieb Gang Luo: Hi Tim, the intermediate data is materialized to local file system. Before it is available for reducers, mappers will sort them. If the buffer (io.sort.mb) is too small for the intermediate data, multi-phase sorting happen, which means you read and write the same bit more than one time. Besides, are you looking at the statistics per mapper through the job tracker, or just the information output when a job finish? If you look at the information given out at the end of the job, note that this is an overall statistics which include sorting at reduce side. It also include the amount of data written to HDFS (I am not 100% sure). And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same. -Gang Hi there, can anybody help me out on a (most likely) simple unclarity. I am wondering how intermediate key/value pairs are materialized. I have a job where the map phase produces 600,000 records and map output bytes is ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 300GB, are materialized locally by the mappers and that later on reducers pull these records (based on the key). What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as high as ~900GB. So - where does the factor 3 come from between Map output bytes and FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the file system - but that should be HDFS only?! Thanks - tim
Re: How are intermediate key/value pairs materialized between map and reduce?
Hi Tim, I'm guessing a lot of these writes are happening on the reduce side. On the JT web interface, there are three columns: map, reduce, overall. Is the 900GB figure from the overall column? The value in the map column will probably be closer to what you were expecting. There are writes on the reduce side too during the shuffle and multi-pass merge. Ed 2010/2/23 Tim Kiefer tim-kie...@gmx.de: Hi Gang, thanks for your reply. To clarify: I look at the statistics through the job tracker. In the webinterface for my job I have columns for map, reduce and total. What I was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map Output Bytes in the map column. About the replication factor: I would expect the exact same thing - changing to 6 has no influence on FILE_BYTES_WRITTEN. About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10. Furthermore, I have 40 mappers and map output data is ~300GB. I can't see how that ends up in a factor 3? - tim Am 23.02.2010 14:39, schrieb Gang Luo: Hi Tim, the intermediate data is materialized to local file system. Before it is available for reducers, mappers will sort them. If the buffer (io.sort.mb) is too small for the intermediate data, multi-phase sorting happen, which means you read and write the same bit more than one time. Besides, are you looking at the statistics per mapper through the job tracker, or just the information output when a job finish? If you look at the information given out at the end of the job, note that this is an overall statistics which include sorting at reduce side. It also include the amount of data written to HDFS (I am not 100% sure). And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same. -Gang - 原始邮件 发件人: Tim Kiefer tim-kie...@gmx.de 收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org 发送日期: 2010/2/23 (周二) 6:44:28 上午 主 题: How are intermediate key/value pairs materialized between map and reduce? Hi there, can anybody help me out on a (most likely) simple unclarity. I am wondering how intermediate key/value pairs are materialized. I have a job where the map phase produces 600,000 records and map output bytes is ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 300GB, are materialized locally by the mappers and that later on reducers pull these records (based on the key). What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as high as ~900GB. So - where does the factor 3 come from between Map output bytes and FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the file system - but that should be HDFS only?! Thanks - tim ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
Re: hadoop under cygwin issue
Brian, It looks like you're confusing your local file system with HDFS. HDFS sits on top of your file system and is where data for (non-standalone) Hadoop jobs comes from. You can poll it with fs -ls ..., so do something like hadoop fs -lsr / to see everything in HDFS. This will probably shed some light on why your first attempt failed. /user/brian/input should be a directory with several xml files. Ed On Wed, Feb 3, 2010 at 5:17 PM, Brian Wolf brw...@gmail.com wrote: Alex Kozlov wrote: Live Nodes http://localhost:50070/dfshealth.jsp#LiveNodes : 0 You datanode is dead. Look at the logs in the $HADOOP_HOME/logs directory (or where your logs are) and check the errors. Alex K On Mon, Feb 1, 2010 at 1:59 PM, Brian Wolf brw...@gmail.com wrote: Thanks for your help, Alex, I managed to get past that problem, now I have this problem: However, when I try to run this example as stated on the quickstart webpage: bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' I get this error; = java.io.IOException: Not a file: hdfs://localhost:9000/user/brian/input/conf = so it seems to default to my home directory looking for input it apparently needs an absolute filepath, however, when I run that way: $ bin/hadoop jar hadoop-*-examples.jar grep /usr/local/hadoop-0.19.2/input output 'dfs[a-z.]+' == org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/usr/local/hadoop-0.19.2/input == It still isn't happy although this part - /usr/local/hadoop-0.19.2/input - does exist Aaron, Thanks or your help. I carefully went through the steps again a couple times , and ran after this bin/hadoop namenode -format (by the way, it asks if I want to reformat, I've tried it both ways) then bin/start-dfs.sh and bin/start-all.sh and then bin/hadoop fs -put conf input now the return for this seemed cryptic: put: Target input/conf is a directory (??) and when I tried bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' It says something about 0 nodes (from log file) 2010-02-01 13:26:29,874 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=brian,None,Administrators,Users ip=/127.0.0.1 cmd=create src=/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar dst=null perm=brian:supergroup:rw-r--r-- 2010-02-01 13:26:30,045 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9000, call addBlock(/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar, DFSClient_725490811) from 127.0.0.1:3003: error: java.io.IOException: File /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar could only be replicated to 0 nodes, instead of 1 java.io.IOException: File /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1287) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) To maybe rule out something regarding ports or ssh , when I run netstat: TCP 127.0.0.1:9000 0.0.0.0:0 LISTENING TCP 127.0.0.1:9001 0.0.0.0:0 LISTENING and when I browse to http://localhost:50070/ Cluster Summary * * * 21 files and directories, 0 blocks = 21 total. Heap Size is 8.01 MB / 992.31 MB (0%) * Configured Capacity : 0 KB DFS Used : 0 KB Non DFS Used : 0 KB DFS Remaining : 0 KB DFS Used% : 100 % DFS Remaining% : 0 % Live Nodes http://localhost:50070/dfshealth.jsp#LiveNodes : 0 Dead Nodes http://localhost:50070/dfshealth.jsp#DeadNodes : 0 so I'm a bit still in the dark, I guess. Thanks Brian Aaron Kimball wrote: Brian, it looks like you missed a step in the instructions. You'll need to format the hdfs filesystem instance before starting the NameNode server: You need to run: $ bin/hadoop namenode -format .. then you can do bin/start-dfs.sh Hope this helps, - Aaron On Sat, Jan 30, 2010 at 12:27 AM, Brian Wolf brw...@gmail.com wrote: Hi, I am trying to run Hadoop 0.19.2 under cygwin as per directions on the hadoop quickstart web page. I know sshd is running and I can ssh localhost without a password. This is from my hadoop-site.xml configuration property
Re: Failed to install Hadoop on WinXP
I tried running 0.20.0 on XP too a few weeks ago and stuck at the same spot. No problems with standalone mode. Any insight would be appreciated, thanks. Ed On Wed, Jan 27, 2010 at 11:41 AM, Yura Taras yura.ta...@gmail.com wrote: Hi all I'm trying to deploy pseudo-distributed cluster on my devbox which runs under WinXP. I did following steps: 1. Installed cygwin with ssh, configured ssh 2. Downloaded hadoop and extracted it, set JAVA_HOME and HADOOP_HOME env vars (I made a symlink to java home, so it don't contain spaces) 3. Adjusted conf/hadoop-env.sh to point to correct JAVA_HOME 4. Adjusted conf files to following values: * core-site.xml: configuration property namehadoop.tmp.dir/name value/hdfs/hadoop/value descriptionA base for other temporary directories./description /property property namefs.default.name/name valuehdfs://localhost:/value /property /configuration * hdfs-site.xml: configuration property namedfs.replication/name value1/value /property /configuration * mapred-site.xml: configuration property namemapred.job.tracker/name valuelocalhost:/value /property /configuration 5. Next I execute following line: $ bin/hadoop namenode -format bin/start-all.sh bin/hadoop fs -put conf input bin/hadoop jar hadoop-0.20.1-examples.jar grep input output 'dfs[a-z.]' I receive following exception: localhost: starting tasktracker, logging to /home/ytaras/hadoop/bin/../logs/hadoop-ytaras-tasktracker-bueno.out 10/01/27 18:23:55 INFO mapred.FileInputFormat: Total input paths to process : 13 10/01/27 18:23:56 INFO mapred.JobClient: Running job: job_201001271823_0001 10/01/27 18:23:57 INFO mapred.JobClient: map 0% reduce 0% 10/01/27 18:24:09 INFO mapred.JobClient: Task Id : attempt_201001271823_0001_m_14_0, Status : FAILED java.io.FileNotFoundException: File D:/hdfs/hadoop/mapred/local/taskTracker/jobcache/job_201001271823_0001/attempt_201001271823_0001_m_14_0/work/tmp does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.mapred.TaskRunner.setupWorkDir(TaskRunner.java:519) at org.apache.hadoop.mapred.Child.main(Child.java:155) !! SKIP - above exception few times !! 10/01/27 18:24:51 INFO mapred.JobClient: Job complete: job_201001271823_0001 10/01/27 18:24:51 INFO mapred.JobClient: Counters: 0 java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.hadoop.examples.Grep.run(Grep.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.Grep.main(Grep.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Am I doing something wrong (don't say just 'use Linux' :-) )? Thanks
Re: do all mappers finish before reducer starts
You're right that the user reduce function cannot be applied until all maps have completed. The values being reported about job completion are a bit misleading in this sense. The reduce percentage you're seeing actually encompasses three parts: 1. Fetching map output data 2. Merging map output data 3. Applying the user reduce function Only the third part has the constraint of waiting for all maps; the other two can be done in parallel, hence the reduce percentage increasing before map completes. 0-33% reduce corresponds to step 1, 33-67% to step 2, and 67-100% to step 3. There is overlap between parts 1 and 2 as the reduce memory buffer fills up, merges, and spills to disk. There is also overlap between parts 2 and 3 because the final merge is fed directly into the user reduce function to minimize the amount of data written to disk. Ed On Tue, Jan 26, 2010 at 5:27 PM, adeelmahmood adeelmahm...@gmail.com wrote: I just have a conceptual question. My understanding is that all the mappers have to complete their job for the reducers to start working because mappers dont know about each other so we need values for a given key from all the different mappers so we have to wait until all mappers have collectively given the system all possible values for a key .so that then that can be passed on the reducer .. but when I ran these jobs .. almost everytime before the mappers are all done the reducers start working .. so it would say map 60% reduce 30% .. how does this works Does it finds all possibly values for a single key from all mappers .. pass that on the reducer and then works on other keys any help is appreciated -- View this message in context: http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Help on processing large amount of videos on hadoop
Hi Huazhong, Sounds like an interesting application. Here's a few tips. 1. If the frames are not independent, you should find a way to key them according to their order before dumping them in Hadoop so that they can be sorted as part of your map reduce task. BTW, the video won't appear split while its in HDFS; HDFS does use a block splitting scheme for replication and (sort of) job distribution, but this isn't mandatory, and there's lots of facilites to customize this behavior. 2. Is the audio needed? If not, it may make sense to preprocess the data as key, image, where the key is something like what I mentioned above, and the image is a custom writable that handles the data in a common format like gif, jpg, png, whatever. 3. You'll need to have some in-depth knowledge of your video codec for this. While I'm not an expert on video codecs, I think that many of them do their compression by specifying key frames that have complete data, and then representing subsequent frames as differences with a key frame. You can use a custom input format and split to split on a frame, but you will need to be an expert on your codec to do so. It might be easier to use a framework to pre-transcode the data into whatever you will use for your map reduce jobs. On Thu, Dec 17, 2009 at 2:25 PM, Huazhong Ning n...@akiira.com wrote: Hi, I set up a hadoop platform and I am going to use it to process a large amount of videos (each size is about 500M-1G). But I met some hard issues: 1. The frames in each video are not independent so we may have problems if we split the video into blocks and distribute them in HDFS. 2. The video is compressed but we hope the input to the map class is video frames. In other words we need to put the codec somewhere. 3. Our codec (third party source code) takes video file name as input. Can we get the file name? Any suggestions and comments are welcome. Thanks a lot. Ning
Re: Can hadoop 0.20.1 programs runs on Amazon Elastic Mapreduce?
Last time I checked EMR only runs 0.18.3. You can use EC2 though, which winds up being cheaper anyways. On Wed, Dec 16, 2009 at 8:51 PM, 松柳 lamfeeli...@gmail.com wrote: Hi all, I'm wondering whether Amazon starts to support the newest stable version of Hadoop, or we can still just use 0.18.3? Song Liu
Re: multiple file input
One important thing to note is that, with cross products, you'll almost always get better performance if you can fit both files on a single node's disk rather than distributing the files. On Tue, Dec 8, 2009 at 9:18 AM, laser08150815 la...@laserxyz.de wrote: pmg wrote: I am evaluating hadoop for a problem that do a Cartesian product of input from one file of 600K (File A) with another set of file set (FileB1, FileB2, FileB3) with 2 millions line in total. Each line from FileA gets compared with every line from FileB1, FileB2 etc. etc. FileB1, FileB2 etc. are in a different input directory So Two input directories 1. input1 directory with a single file of 600K records - FileA 2. input2 directory segmented into different files with 2Million records - FileB1, FileB2 etc. How can I have a map that reads a line from a FileA in directory input1 and compares the line with each line from input2? What is the best way forward? I have seen plenty of examples that maps each record from single input file and reduces into an output forward. thanks I had a similar problem and solved it by writing a custom InputFormat (see attachment). You should improve the methods ACrossBInputSplit.getLength , ACrossBRecordReader.getPos and ACrossBRecordReader.getProgress. -- View this message in context: http://old.nabble.com/multiple-file-input-tp24095358p26694569.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: RE: Using Hadoop in non-typical large scale user-driven environment
As far as replication goes, you should look at a project called pastry. Apparently some people have used hadoop mapreduce on top of it. You will need to be clever, however, in how you do your mapreduce because you probably won't want the job to eat all the users cpu time. On Dec 2, 2009 5:11 PM, Habermaas, William william.haberm...@fatwire.com wrote: Hadoop isn't going to like losing its datanodes when people shutdown their computers. More importantly, when the datanodes are running, your users will be impacted by data replication. Unlike Seti, Hadoop doesn't know when the user's screensaver is running so it will start doing things when it feels like it. Can someone else comment on whether HOD (hadoop-on-demand) would fit this scenario? Bill -Original Message- From: Maciej Trebacz [mailto: maciej.treb...@gmail.com] Sent: Wednesday,...
Re: New graphic interface for Hadoop - Contains: FileManager, Daemon Admin, Quick Stream Job Setup, etc
The tool looks interesting. You should consider providing the source for it. Is it written in a language that can run on platforms besides windows? On Nov 17, 2009 10:40 AM, Cubic cubicdes...@gmail.com wrote: Hi list. This tool is a graphic interface for Hadoop. It may improove your productivity quite a bit, especially if you intensivelly work with files inside the HDFS. Note: In my computer it is functional but it hasn't been *yet* tested in other computers. Download: Download link: http://www.dnabaser.com/hadoop-gui-shell/index.html Please feel free to send feedback. :)
Re: About Distribute Cache
Hi, What you can fit in distributed cache generally depends on the available disk space on your nodes. With most clusters 300 mb will not be a problem, but it depends on the cluster and the workload you're processing. On Sat, Nov 14, 2009 at 10:34 PM, 于凤东 fengdon...@gmail.com wrote: I have a 300MB file, want to put to the distributed cache, but I want to know does that is a large file for ditributed cache? and normally, how many size files we put into the DC?