Re: Reduce doesn't start until map finishes

2009-03-03 Thread Mikhail Yakshin
On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote:
 This is normal behavior. The Reducer is guaranteed to receive all the
 results for its partition in sorted order. No reduce can start until all the
 maps are completed, since any running map could emit a result that would
 violate the order for the results it currently has. -C

_Reducers_ usually start almost immediately and start downloading data
emitted by mappers as they go. This is their first phase. Their second
phase can start only after completion of all mappers. In their second
phase, they're sorting received data, and in their third phase they're
doing real reduction.

-- 
WBR, Mikhail Yakshin


Re: Can anyone verify Hadoop FS shell command return codes?

2009-02-26 Thread Mikhail Yakshin
On Mon, Feb 23, 2009 at 4:02 PM, S D wrote:
 I'm attempting to use Hadoop FS shell (http://hadoop
 .apache.org/core/docs/current/hdfs_shell.html) within a ruby script. My
 challenge is that I'm unable to get the function return value of the
 commands I'm invoking. As an example, I try to run get as follows

 hadoop fs -get /user/hadoop/testFile.txt .

 From the command line this generally works but I need to be able to verify
 that it is working during execution in my ruby script. The command should
 return 0 on success and -1 on error. Based on

 http://pasadenarb.com/2007/03/ruby-shell-commands.html

 I am using backticks to make the hadoop call and get the return value. Here
 is a dialogue within irb (Ruby's interactive shell) in which the command was
 not successful:

 irb(main):001:0 `hadoop dfs -get testFile.txt .`
 get: null
 = 

 and a dialogue within irb in which the command was successful

 irb(main):010:0 `hadoop dfs -get testFile.txt .`
 = 

 In both cases, neither a 0 nor a 1 appeared as a return value; indeed
 nothing was returned. Can anyone who is using the FS command shell return
 values within any scripting language (Ruby, PHP, Perl, ...) please confirm
 that it is working as expected or send an example snippet?

You seem to confuse captured output of stdout and exit status. Try
analyzing $?.exitstatus in Ruby:

irb(main):001:0 `true`
= 
irb(main):002:0 $?.exitstatus
= 0
irb(main):003:0 `false`
= 
irb(main):004:0 $?.exitstatus
= 1

-- 
WBR, Mikhail Yakshin


Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Mikhail Yakshin
Hi,

 Is anyone using Hadoop as more of a near/almost real-time processing
 of log data for their systems to aggregate stats, etc?

We do, although near realtime is pretty relative subject and your
mileage may vary. For example, startups / shutdowns of Hadoop jobs are
pretty expensive and it could take anything from 5-10 seconds up to
several minutes to get the job started and almost same thing goes for
job finalization. Generally, if your near realtime would tolerate
3-4-5 minutes lag, it's possible to use Hadoop.

-- 
WBR, Mikhail Yakshin


Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Mikhail Yakshin
On Wed, Feb 25, 2009 at 10:09 PM, Edward Capriolo
 Is anyone using Hadoop as more of a near/almost real-time processing
 of log data for their systems to aggregate stats, etc?

 We do, although near realtime is pretty relative subject and your
 mileage may vary. For example, startups / shutdowns of Hadoop jobs are
 pretty expensive and it could take anything from 5-10 seconds up to
 several minutes to get the job started and almost same thing goes for
 job finalization. Generally, if your near realtime would tolerate
 3-4-5 minutes lag, it's possible to use Hadoop.

 I was thinking about this. Assuming your datasets are small would
 running a local jobtracker or even running the MinimMR cluster from
 the test case be an interesting way to run small jobs confided to one
 CPU?

Yeah, but what's the point of using Hadoop then? i.e. we lost all the
parallelism?

-- 
WBR, Mikhail Yakshin


Re: HDD benchmark/checking tool

2009-02-04 Thread Mikhail Yakshin
On Tue, Feb 3, 2009 at 8:53 PM, Dmitry Pushkarev wrote:

 Recently I have had a number of drive failures that slowed down processes a
 lot until they were discovered. It is there any easy way or tool, to check
 HDD performance and see if there any IO errors?

 Currently I wrote a simple script that looks at /var/log/messages and greps
 everything abnormal for /dev/sdaX. But if you have better solution I'd
 appreciate if you share it.

If you have any hardware RAIDs you'd like to monitor/manage, good
chances that you'd want to use Einarc to access them:
http://www.inquisitor.ru/doc/einarc/ - in fact, it won't hurt even if
you use just a bunch of HDDs or software RAIDs :)

-- 
WBR, Mikhail Yakshin


Hadoop 0.19, Cascading 1.0 and MultipleOutputs problem

2009-01-28 Thread Mikhail Yakshin
Hi,

We have a system based on Hadoop 0.18 / Cascading 0.8.1 and now I'm
trying to port it to Hadoop 0.19 / Cascading 1.0. The first serious
problem I've got into that we're extensively using MultipleOutputs in
our jobs dealing with sequence files that store Cascading's Tuples.

Since Cascading 0.9, Tuples stopped being WritableComparable and
implemented generic Hadoop serialization interface and framework.
However, in Hadoop 0.19, MultipleOutputs require use of older
WritableComparable interface. Thus, trying to do something like:

MultipleOutputs.addNamedOutput(conf, output-name,
MySpecialMultiSplitOutputFormat.class, Tuple.class, Tuple.class);
mos = new MultipleOutputs(conf);
...
mos.getCollector(output-name, reporter).collect(tuple1, tuple2);

yields an error:

java.lang.RuntimeException: java.lang.RuntimeException: class
cascading.tuple.Tuple not org.apache.hadoop.io.WritableComparable
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
at 
org.apache.hadoop.mapred.lib.MultipleOutputs.getNamedOutputKeyClass(MultipleOutputs.java:252)
at 
org.apache.hadoop.mapred.lib.MultipleOutputs$InternalFileOutputFormat.getRecordWriter(MultipleOutputs.java:556)
at 
org.apache.hadoop.mapred.lib.MultipleOutputs.getRecordWriter(MultipleOutputs.java:425)
at 
org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:511)
at 
org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:476)
at my.namespace.MyReducer.reduce(MyReducer.java:xxx)

Is there any known workaround for that? Any progress going on to make
MultipleOutputs use generic Hadoop serialization?

-- 
WBR, Mikhail Yakshin


Re: what does this error mean

2008-09-24 Thread Mikhail Yakshin
On Wed, Sep 24, 2008 at 9:24 PM, Elia Mazzawi wrote:
 I got these errors I don't know what they mean, any help is appreciated.
 I suspect that either its a H/W error or the cluster is out of space to
 store intermediate results?
 there is still lots of free space left on the cluster.

 08/09/24 00:23:31 INFO mapred.JobClient:  map 79% reduce 24%
 08/09/24 00:24:53 INFO mapred.JobClient:  map 80% reduce 24%
 08/09/24 00:26:28 INFO mapred.JobClient:  map 80% reduce 0%
 08/09/24 00:26:28 INFO mapred.JobClient: Task Id :
 task_200809041356_0037_r_00_2, Status : FAILED
 java.io.IOException: task_200809041356_0037_r_00_2The reduce copier
 failed
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
   at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

Nope. In fact, it could be much more complicated as it gets %)

Several possible causes of this error, as far as I know:

* An error in partitioner (for example, look out for negative keys and
default key % numReducers algorithm - strange Java modulo operator
(%) yields negative values for negative keys, thus effectively sending
output to non-existing reducers).
* An error in group/key/value comparator.

-- 
WBR, Mikhail Yakshin


Re: Slaves Hot-Swaping

2008-09-02 Thread Mikhail Yakshin
On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote:
 I was wondering if there is a way to Hot-Swap Slave machines, for example,
 in case an Slave machine fails while the Cluster is running and I want to
 mount a new Slave machine to replace the old one, is there a way to tell the
 Master that a new Slave machine is Online without having to stop and start
 again the Cluster?

You don't have to restart entire cluster, you just have to run
datanode (DFS support) and/or tasktracker processes on fresh node. You
can do it using hadoop-daemon.sh, commands are start datanode and
start tasktracker respectively. There's no need for hot swapping
and replacing old slave machines with new ones pretending to be old
ones. You just plug new one in with new IP/hostname and it will
eventually start to do tasks as all other nodes.

You don't really need any hot standby or any other high-availability
schemes. You just plug all possible slaves in and it will balance
everything out.

-- 
WBR, Mikhail Yakshin