RE: Difference between Hadoop Streaming and "Normal" mode

2008-08-12 Thread John DeTreville
I think you will find that the Streaming model buys you convenience,
but costs you performance and generality. I'll let others quantify
how much of each.

Cheers,
John

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
Gaurav Veda
Sent: Tuesday, August 12, 2008 3:10 PM
To: core-user@hadoop.apache.org
Subject: Difference between Hadoop Streaming and "Normal" mode

Hi All,

This might seem too silly, but I couldn't find a satisfactory answer
to this yet. What are the advantages / disadvantages of using Hadoop
Streaming over the normal mode (wherein you write your own mapper and
reducer in Java)? From what I gather, the real advantage of Hadoop
Streaming is that you can use any executable (in c / perl / python
etc) as a mapper / reducer.
A slight disadvantage is that the default is to read (write) from the
standard input (output) ... though one can specify their own Input and
Output format (and package it with the default hadoop streaming jar
file).

My point is, why should I ever use the normal mode? Streaming seems
just as good. Is there a performance problem or do I have only limited
control over my job if I use the streaming mode or some other issue?

Thanks!
Gaurav
-- 
Share what you know, learn what you don't !


Random block placement

2008-08-12 Thread John DeTreville
My understanding is that HDFS places blocks randomly. As I would expect,
then,
when I use "hadoop fsck" to look at block placements for my files, I see
that
some nodes have more blocks than the average. I would expect that these
hot
spots would cause a performance hit relative to a more even placement of
blocks.

I'd like to experiment with non-random block placement to see if this
can
provide a performance improvement. Where in the code would I start
looking to
find the existing code for random placement?

Cheers,
John


RE: "Join" example

2008-08-08 Thread John DeTreville
When I try the map-side join example (under Hadoop 0.17.1, running in
standalone mode under Win32), it attempts to dereference a null pointer.

$ cat One/some.txt
A   1
B   1
C   1
E   1
$ cat Two/some.txt
A   2
B   2
C   2
D   2
$ bin/hadoop jar *examples.jar join -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat -outKey
org.apache.hadoop.io.Text -joinOp outer One/some.txt Two/some.txt output
cygpath: cannot create short name of c:\Documents and
Settings\jdd\Desktop\hadoop-0.17.1\logs
08/08/08 15:41:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
Job started: Fri Aug 08 15:41:34 PDT 2008
08/08/08 15:41:34 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
08/08/08 15:41:34 INFO mapred.FileInputFormat: Total input paths to
process : 1
08/08/08 15:41:34 INFO mapred.FileInputFormat: Total input paths to
process : 1
java.lang.NullPointerException
at
org.apache.hadoop.mapred.KeyValueTextInputFormat.isSplitable(KeyValueTex
tInputFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:
247)
at
org.apache.hadoop.mapred.join.Parser$WNode.getSplits(Parser.java:305)
at
org.apache.hadoop.mapred.join.Parser$CNode.getSplits(Parser.java:375)
at
org.apache.hadoop.mapred.join.CompositeInputFormat.getSplits(CompositeIn
putFormat.java:129)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:712)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.hadoop.examples.Join.run(Join.java:154)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.Join.main(Join.java:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDr
iver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
$ 

I'll look around a little to see what the problem is. The attempt to
initialize the JVM metrics twice also seems suspicious.

Here's one other thing I don't understand. Suppose my directory One
contains some number of files, and directory Two contains the same
number, named the same and partitioned the same. If I give the directory
names One and Two to the example program, will it match up the files by
name for performing the join? I haven't found the code yet to do that,
although I'm imagining that perhaps that's what it does.

Cheers,
John

-Original Message-
From: Chris Douglas [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 1:57 PM
To: core-user@hadoop.apache.org
Subject: Re: "Join" example

The contrib/data_join framework is different from the map-side join  
framework, under o.a.h.mapred.join.

To see what the example is doing in an outer join, generate a few  
sample, text input files, tab-separated:

join/a.txt:

a0
a1
a2
a3

join/b.txt:

b0
b1
b2
b3

join/c.txt:

c0
c1
c2
c3

Run the example with each as an input:

host$ bin/hadoop jar hadoop-*-examples.jar join \
   -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
   -outKey org.apache.hadoop.io.Text \
   -joinOp outer \
   join/a.txt join/b.txt join/c.txt joinout

Examine the result in joinout/part-0:

host$ bin/hadoop fs -text joinout/part-0 | less
[a0,b0,c0]
[a1,b1,c1]
[a1,b2,c1]
[a1,b3,c1]
[a2,,]
[a3,,]
[,,c2]
[,,c3]

-C

On Aug 7, 2008, at 11:39 PM, Wei Wu wrote:

> There are some examples in $HADOOPHOME/src/contrib/data_join, which  
> I hope
> would help.
>
> Wei
>
> -Original Message-
> From: John DeTreville [mailt

RE: "Join" example

2008-08-08 Thread John DeTreville
Thanks very much, Chris!

Cheers,
John

-Original Message-
From: Chris Douglas [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 1:57 PM
To: core-user@hadoop.apache.org
Subject: Re: "Join" example

The contrib/data_join framework is different from the map-side join  
framework, under o.a.h.mapred.join.

To see what the example is doing in an outer join, generate a few  
sample, text input files, tab-separated:

join/a.txt:

a0
a1
a2
a3

join/b.txt:

b0
b1
b2
b3

join/c.txt:

c0
c1
c2
c3

Run the example with each as an input:

host$ bin/hadoop jar hadoop-*-examples.jar join \
   -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
   -outKey org.apache.hadoop.io.Text \
   -joinOp outer \
   join/a.txt join/b.txt join/c.txt joinout

Examine the result in joinout/part-0:

host$ bin/hadoop fs -text joinout/part-0 | less
[a0,b0,c0]
[a1,b1,c1]
[a1,b2,c1]
[a1,b3,c1]
[a2,,]
[a3,,]
[,,c2]
[,,c3]

-C

On Aug 7, 2008, at 11:39 PM, Wei Wu wrote:

> There are some examples in $HADOOPHOME/src/contrib/data_join, which  
> I hope
> would help.
>
> Wei
>
> -Original Message-
> From: John DeTreville [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 08, 2008 2:34 AM
> To: core-user@hadoop.apache.org
> Subject: "Join" example
>
> Hadoop ships with a few example programs. One of these is "join,"  
> which
> I believe demonstrates map-side joins. I'm finding its usage
> instructions a little impenetrable; could anyone send me instructions
> that are more like "type this" then "type this" then "type this"?
>
> Thanks in advance.
>
> Cheers,
> John
>



"Join" example

2008-08-07 Thread John DeTreville
Hadoop ships with a few example programs. One of these is "join," which
I believe demonstrates map-side joins. I'm finding its usage
instructions a little impenetrable; could anyone send me instructions
that are more like "type this" then "type this" then "type this"?

Thanks in advance.

Cheers,
John