Re: Reduce doesn't start until map finishes
On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote: This is normal behavior. The Reducer is guaranteed to receive all the results for its partition in sorted order. No reduce can start until all the maps are completed, since any running map could emit a result that would violate the order for the results it currently has. -C _Reducers_ usually start almost immediately and start downloading data emitted by mappers as they go. This is their first phase. Their second phase can start only after completion of all mappers. In their second phase, they're sorting received data, and in their third phase they're doing real reduction. -- WBR, Mikhail Yakshin
Re: Can anyone verify Hadoop FS shell command return codes?
On Mon, Feb 23, 2009 at 4:02 PM, S D wrote: I'm attempting to use Hadoop FS shell (http://hadoop .apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My challenge is that I'm unable to get the function return value of the commands I'm invoking. As an example, I try to run get as follows hadoop fs -get /user/hadoop/testFile.txt . From the command line this generally works but I need to be able to verify that it is working during execution in my ruby script. The command should return 0 on success and -1 on error. Based on http://pasadenarb.com/2007/03/ruby-shell-commands.html I am using backticks to make the hadoop call and get the return value. Here is a dialogue within irb (Ruby's interactive shell) in which the command was not successful: irb(main):001:0 `hadoop dfs -get testFile.txt .` get: null = and a dialogue within irb in which the command was successful irb(main):010:0 `hadoop dfs -get testFile.txt .` = In both cases, neither a 0 nor a 1 appeared as a return value; indeed nothing was returned. Can anyone who is using the FS command shell return values within any scripting language (Ruby, PHP, Perl, ...) please confirm that it is working as expected or send an example snippet? You seem to confuse captured output of stdout and exit status. Try analyzing $?.exitstatus in Ruby: irb(main):001:0 `true` = irb(main):002:0 $?.exitstatus = 0 irb(main):003:0 `false` = irb(main):004:0 $?.exitstatus = 1 -- WBR, Mikhail Yakshin
Re: Using Hadoop for near real-time processing of log data
Hi, Is anyone using Hadoop as more of a near/almost real-time processing of log data for their systems to aggregate stats, etc? We do, although near realtime is pretty relative subject and your mileage may vary. For example, startups / shutdowns of Hadoop jobs are pretty expensive and it could take anything from 5-10 seconds up to several minutes to get the job started and almost same thing goes for job finalization. Generally, if your near realtime would tolerate 3-4-5 minutes lag, it's possible to use Hadoop. -- WBR, Mikhail Yakshin
Re: Using Hadoop for near real-time processing of log data
On Wed, Feb 25, 2009 at 10:09 PM, Edward Capriolo Is anyone using Hadoop as more of a near/almost real-time processing of log data for their systems to aggregate stats, etc? We do, although near realtime is pretty relative subject and your mileage may vary. For example, startups / shutdowns of Hadoop jobs are pretty expensive and it could take anything from 5-10 seconds up to several minutes to get the job started and almost same thing goes for job finalization. Generally, if your near realtime would tolerate 3-4-5 minutes lag, it's possible to use Hadoop. I was thinking about this. Assuming your datasets are small would running a local jobtracker or even running the MinimMR cluster from the test case be an interesting way to run small jobs confided to one CPU? Yeah, but what's the point of using Hadoop then? i.e. we lost all the parallelism? -- WBR, Mikhail Yakshin
Re: HDD benchmark/checking tool
On Tue, Feb 3, 2009 at 8:53 PM, Dmitry Pushkarev wrote: Recently I have had a number of drive failures that slowed down processes a lot until they were discovered. It is there any easy way or tool, to check HDD performance and see if there any IO errors? Currently I wrote a simple script that looks at /var/log/messages and greps everything abnormal for /dev/sdaX. But if you have better solution I'd appreciate if you share it. If you have any hardware RAIDs you'd like to monitor/manage, good chances that you'd want to use Einarc to access them: http://www.inquisitor.ru/doc/einarc/ - in fact, it won't hurt even if you use just a bunch of HDDs or software RAIDs :) -- WBR, Mikhail Yakshin
Hadoop 0.19, Cascading 1.0 and MultipleOutputs problem
Hi, We have a system based on Hadoop 0.18 / Cascading 0.8.1 and now I'm trying to port it to Hadoop 0.19 / Cascading 1.0. The first serious problem I've got into that we're extensively using MultipleOutputs in our jobs dealing with sequence files that store Cascading's Tuples. Since Cascading 0.9, Tuples stopped being WritableComparable and implemented generic Hadoop serialization interface and framework. However, in Hadoop 0.19, MultipleOutputs require use of older WritableComparable interface. Thus, trying to do something like: MultipleOutputs.addNamedOutput(conf, output-name, MySpecialMultiSplitOutputFormat.class, Tuple.class, Tuple.class); mos = new MultipleOutputs(conf); ... mos.getCollector(output-name, reporter).collect(tuple1, tuple2); yields an error: java.lang.RuntimeException: java.lang.RuntimeException: class cascading.tuple.Tuple not org.apache.hadoop.io.WritableComparable at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752) at org.apache.hadoop.mapred.lib.MultipleOutputs.getNamedOutputKeyClass(MultipleOutputs.java:252) at org.apache.hadoop.mapred.lib.MultipleOutputs$InternalFileOutputFormat.getRecordWriter(MultipleOutputs.java:556) at org.apache.hadoop.mapred.lib.MultipleOutputs.getRecordWriter(MultipleOutputs.java:425) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:511) at org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:476) at my.namespace.MyReducer.reduce(MyReducer.java:xxx) Is there any known workaround for that? Any progress going on to make MultipleOutputs use generic Hadoop serialization? -- WBR, Mikhail Yakshin
Re: what does this error mean
On Wed, Sep 24, 2008 at 9:24 PM, Elia Mazzawi wrote: I got these errors I don't know what they mean, any help is appreciated. I suspect that either its a H/W error or the cluster is out of space to store intermediate results? there is still lots of free space left on the cluster. 08/09/24 00:23:31 INFO mapred.JobClient: map 79% reduce 24% 08/09/24 00:24:53 INFO mapred.JobClient: map 80% reduce 24% 08/09/24 00:26:28 INFO mapred.JobClient: map 80% reduce 0% 08/09/24 00:26:28 INFO mapred.JobClient: Task Id : task_200809041356_0037_r_00_2, Status : FAILED java.io.IOException: task_200809041356_0037_r_00_2The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Nope. In fact, it could be much more complicated as it gets %) Several possible causes of this error, as far as I know: * An error in partitioner (for example, look out for negative keys and default key % numReducers algorithm - strange Java modulo operator (%) yields negative values for negative keys, thus effectively sending output to non-existing reducers). * An error in group/key/value comparator. -- WBR, Mikhail Yakshin
Re: Slaves Hot-Swaping
On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote: I was wondering if there is a way to Hot-Swap Slave machines, for example, in case an Slave machine fails while the Cluster is running and I want to mount a new Slave machine to replace the old one, is there a way to tell the Master that a new Slave machine is Online without having to stop and start again the Cluster? You don't have to restart entire cluster, you just have to run datanode (DFS support) and/or tasktracker processes on fresh node. You can do it using hadoop-daemon.sh, commands are start datanode and start tasktracker respectively. There's no need for hot swapping and replacing old slave machines with new ones pretending to be old ones. You just plug new one in with new IP/hostname and it will eventually start to do tasks as all other nodes. You don't really need any hot standby or any other high-availability schemes. You just plug all possible slaves in and it will balance everything out. -- WBR, Mikhail Yakshin