[ANN] hbase-0.2.0 release

2008-08-08 Thread stack
The HBase 0.2.0 release includes 291 changes [1]. New features include a
richer API, a new ruby irb-based shell, an improved UI, and many
improvements to overall stability.  To download, visit [4].

HBase 0.2.0 is not backward compatible with HBase 0.1 API (See [2] for an
overview of the changes). To migrate your 0.1 era HBase data to 0.2, see the
Migration Guide [3] .
HBase 0.2.0 runs on Hadoop 0.17.x. To run 0.2.0 on hadoop 0.18.x, replace
the hadoop 0.17.1 jars under $HBASE_HOME/lib with their 0.18.x equivalents
and then recompile.

Thanks to all who contributed to this release.

Yours,
The HBase Team

1.
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&pid=12310753&fixfor=12312955
2. http://wiki.apache.org/hadoop/Hbase/Plan-0.2/APIChanges
3. http://wiki.apache.org/hadoop/Hbase/HowToMigrate
4. http://www.apache.org/dyn/closer.cgi/hadoop/hbase/


Hadoop problem

2008-08-08 Thread Mr.Thien
Hi everybody,
When i was running hadoop 0.17.1 it gave me some WARNs like this:

2008-08-09 10:53:37,728 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to
remove /tmp/hadoop-thientd/mapred/system because it does not exist
2008-08-09 10:56:05,836 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to
remove /tmp/hadoop-thientd/mapred/system/job_200808091053_0001 because
it does not exist

it stills run but i don't know why it has WARNs like above and why it
can't create folder system in mapred.

Could anyone told me the possible reason of the problem?
Thanks in advanced.
thientd



[ANN] hbase-0.2.0 release available

2008-08-08 Thread stack
The HBase 0.2.0 release includes 291 changes [1]. New features include a 
richer API, a new ruby irb-based shell, an improved UI, and many 
improvements to overall stability.  To download, visit [4].


HBase 0.2.0 is not backward compatible with HBase 0.1 API (See [2] for 
an overview of the changes). To migrate your 0.1 era HBase data to 0.2, 
see the Migration Guide [3]. HBase 0.2.0 runs on Hadoop 0.17.x. To run 
0.2.0 on hadoop 0.18.x, replace the hadoop 0.17.1 jars under 
$HBASE_HOME/lib with their 0.18.x equivalents and then recompile.


Thanks to all who contributed to this release.

Yours,
The HBase Team

1. 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&pid=12310753&fixfor=12312955

2. http://wiki.apache.org/hadoop/Hbase/Plan-0.2/APIChanges
3. http://wiki.apache.org/hadoop/Hbase/HowToMigrate
4. http://www.apache.org/dyn/closer.cgi/hadoop/hbase/


RE: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Koji Noguchi
If restarting the entire dfs helped, then you might be hitting 
http://issues.apache.org/jira/browse/HADOOP-3633

When we were running 0.17.1, I had to grep for OutOfMemory on the
datanode ".out" files at least everyday and restart those zombie
datanodes.

Once datanode gets to this state, as Konstantin mentioned in the Jira, 

" it appears to happily sending heartbeats, but in fact cannot
do any data processing because the server thread is dead."

Koji

-Original Message-
From: Piotr Kozikowski [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 5:42 PM
To: core-user@hadoop.apache.org
Subject: Re: java.io.IOException: Could not get block locations.
Aborting...

Thank you for the reply. Apparently whatever it was is now gone after a
hadoop restart, but I'll keep that in mind should it happen again.

Piotr

On Fri, 2008-08-08 at 17:31 -0700, Dhruba Borthakur wrote:
> It is possible that your namenode is overloaded and is not able to
> respond to RPC requests from clients. Please check the namenode logs
> to see if you see lines of the form "discarding calls...".
> 
> dhrua
> 
> On Fri, Aug 8, 2008 at 3:41 AM, Alexander Aristov
> <[EMAIL PROTECTED]> wrote:
> > I come across the same issue and also with hadoop 0.17.1
> >
> > would be interesting if someone say the cause of the issue.
> >
> > Alex
> >
> > 2008/8/8 Steve Loughran <[EMAIL PROTECTED]>
> >
> >> Piotr Kozikowski wrote:
> >>
> >>> Hi there:
> >>>
> >>> We would like to know what are the most likely causes of this sort
of
> >>> error:
> >>>
> >>> Exception closing
> >>> file
> >>>
/data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311
534_0055_m_22_0/part-00022
> >>> java.io.IOException: Could not get block locations. Aborting...
> >>>at
> >>>
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
Client.java:2080)
> >>>at
> >>>
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.ja
va:1702)
> >>>at
> >>>
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
nt.java:1818)
> >>>
> >>> Our map-reduce job does not fail completely but over 50% of the
map tasks
> >>> fail with this same error.
> >>> We recently migrated our cluster from 0.16.4 to 0.17.1, previously
we
> >>> didn't have this problem using the same input data in a similar
map-reduce
> >>> job
> >>>
> >>> Thank you,
> >>>
> >>> Piotr
> >>>
> >>>
> >> When I see this, its because the filesystem isnt completely up:
there are
> >> no locations for a specific file, meaning the client isn't getting
back the
> >> names of any datanodes holding the data from the name nodes.
> >>
> >> I've got a patch in JIRA that prints out the name of the file in
question,
> >> as that could be useful.
> >>
> >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >



Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Lucas Nazário dos Santos
Thanks Andreas. I'll try it.


On Fri, Aug 8, 2008 at 5:47 PM, Andreas Kostyrka <[EMAIL PROTECTED]>wrote:

> On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote:
> > You are completely right. It's not safe at all. But this is what I have
> for
> > now:
> > two computers distributed across the Internet. I would really appreciate
> if
> > anyone could give me spark on how to configure the namenode's IP in a
> > datanode. As I could identify in log files, the datanode keeps trying to
> > connect
> > to the IP 10.1.1.5, which is the internal IP of the namenode. I just
> need a
> > way
> > to say to the datanode "Hey, could you instead connect to the IP
> 172.1.23.2
> > "?
>
> Your only bet is to set it up in a VPNed environment. That would make it
> securitywise okay too.
>
> Andreas
>
> >
> > Lucas
> >
> > On Fri, Aug 8, 2008 at 10:25 AM, Lukáš Vlček <[EMAIL PROTECTED]>
> wrote:
> > > HI,
> > >
> > > I am not an expert on Hadoop configuration but is this safe? As far as
> I
> > > understand the IP address is public and connection to the datanode port
> > > is not secured. Am I correct?
> > >
> > > Lukas
> > >
> > > On Fri, Aug 8, 2008 at 8:35 AM, Lucas Nazário dos Santos <
> > >
> > > [EMAIL PROTECTED]> wrote:
> > > > Hello again,
> > > >
> > > > In fact I can get the cluster up and running with two nodes in
> > > > different LANs. The problem appears when executing a job.
> > > >
> > > > As you can see in the piece of log bellow, the datanode tries to
> > >
> > > comunicate
> > >
> > > > with the namenode using the IP 10.1.1.5. The issue is that the
> datanode
> > > > should be using a valid IP, and not 10.1.1.5.
> > > >
> > > > Is there a way of manually configuring the datanode with the
> namenode's
> > >
> > > IP,
> > >
> > > > so I can change from 10.1.1.5 to, say 189.11.131.172?
> > > >
> > > > Thanks,
> > > > Lucas
> > > >
> > > >
> > > > 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > TaskTracker up at: localhost/127.0.0.1:60394
> > > > 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker:
> > >
> > > Starting
> > >
> > > > tracker tracker_localhost:localhost/127.0.0.1:60394
> > > > 2008-08-08 02:34:23,589 INFO org.apache.hadoop.mapred.TaskTracker:
> > >
> > > Starting
> > >
> > > > thread: Map-events fetcher for all reduce tasks on
> > > > tracker_localhost:localhost/127.0.0.1:60394
> > > > 2008-08-08 03:06:43,239 INFO org.apache.hadoop.mapred.TaskTracker:
> > > > LaunchTaskAction: task_200808080234_0001_m_00_0
> > > > 2008-08-08 03:07:43,989 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 1 time(s).
> > > > 2008-08-08 03:08:44,999 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 2 time(s).
> > > > 2008-08-08 03:09:45,999 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 3 time(s).
> > > > 2008-08-08 03:10:47,009 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 4 time(s).
> > > > 2008-08-08 03:11:48,009 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 5 time(s).
> > > > 2008-08-08 03:12:49,026 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 6 time(s).
> > > > 2008-08-08 03:13:50,036 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 7 time(s).
> > > > 2008-08-08 03:14:51,046 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 8 time(s).
> > > > 2008-08-08 03:15:52,056 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 9 time(s).
> > > > 2008-08-08 03:16:53,066 INFO org.apache.hadoop.ipc.Client: Retrying
> > >
> > > connect
> > >
> > > > to server: /10.1.1.5:9000. Already tried 10 time(s).
> > > > 2008-08-08 03:17:54,077 WARN org.apache.hadoop.mapred.TaskTracker:
> > > > Error initializing task_200808080234_0001_m_00_0:
> > > > java.net.SocketTimeoutException
> > > >at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:109)
> > > >at
> > > >
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:174)
> > > >at org.apache.hadoop.ipc.Client.getConnection(Client.java:623)
> > > >at org.apache.hadoop.ipc.Client.call(Client.java:546)
> > > >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> > > >at org.apache.hadoop.dfs.$Proxy5.getProtocolVersion(Unknown
> Source)
> > > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
> > > >at
> > >
> > > org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
> > >
> > > >at org.apache.hadoop.dfs.DFSClient.(DFSClient.java:

Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Piotr Kozikowski
Thank you for the reply. Apparently whatever it was is now gone after a
hadoop restart, but I'll keep that in mind should it happen again.

Piotr

On Fri, 2008-08-08 at 17:31 -0700, Dhruba Borthakur wrote:
> It is possible that your namenode is overloaded and is not able to
> respond to RPC requests from clients. Please check the namenode logs
> to see if you see lines of the form "discarding calls...".
> 
> dhrua
> 
> On Fri, Aug 8, 2008 at 3:41 AM, Alexander Aristov
> <[EMAIL PROTECTED]> wrote:
> > I come across the same issue and also with hadoop 0.17.1
> >
> > would be interesting if someone say the cause of the issue.
> >
> > Alex
> >
> > 2008/8/8 Steve Loughran <[EMAIL PROTECTED]>
> >
> >> Piotr Kozikowski wrote:
> >>
> >>> Hi there:
> >>>
> >>> We would like to know what are the most likely causes of this sort of
> >>> error:
> >>>
> >>> Exception closing
> >>> file
> >>> /data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022
> >>> java.io.IOException: Could not get block locations. Aborting...
> >>>at
> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
> >>>at
> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
> >>>at
> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> >>>
> >>> Our map-reduce job does not fail completely but over 50% of the map tasks
> >>> fail with this same error.
> >>> We recently migrated our cluster from 0.16.4 to 0.17.1, previously we
> >>> didn't have this problem using the same input data in a similar map-reduce
> >>> job
> >>>
> >>> Thank you,
> >>>
> >>> Piotr
> >>>
> >>>
> >> When I see this, its because the filesystem isnt completely up: there are
> >> no locations for a specific file, meaning the client isn't getting back the
> >> names of any datanodes holding the data from the name nodes.
> >>
> >> I've got a patch in JIRA that prints out the name of the file in question,
> >> as that could be useful.
> >>
> >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >



Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Dhruba Borthakur
It is possible that your namenode is overloaded and is not able to
respond to RPC requests from clients. Please check the namenode logs
to see if you see lines of the form "discarding calls...".

dhrua

On Fri, Aug 8, 2008 at 3:41 AM, Alexander Aristov
<[EMAIL PROTECTED]> wrote:
> I come across the same issue and also with hadoop 0.17.1
>
> would be interesting if someone say the cause of the issue.
>
> Alex
>
> 2008/8/8 Steve Loughran <[EMAIL PROTECTED]>
>
>> Piotr Kozikowski wrote:
>>
>>> Hi there:
>>>
>>> We would like to know what are the most likely causes of this sort of
>>> error:
>>>
>>> Exception closing
>>> file
>>> /data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022
>>> java.io.IOException: Could not get block locations. Aborting...
>>>at
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>>>at
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>>>at
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
>>>
>>> Our map-reduce job does not fail completely but over 50% of the map tasks
>>> fail with this same error.
>>> We recently migrated our cluster from 0.16.4 to 0.17.1, previously we
>>> didn't have this problem using the same input data in a similar map-reduce
>>> job
>>>
>>> Thank you,
>>>
>>> Piotr
>>>
>>>
>> When I see this, its because the filesystem isnt completely up: there are
>> no locations for a specific file, meaning the client isn't getting back the
>> names of any datanodes holding the data from the name nodes.
>>
>> I've got a patch in JIRA that prints out the name of the file in question,
>> as that could be useful.
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>


Re: performance not great, or did I miss something?

2008-08-08 Thread Allen Wittenauer
On 8/8/08 1:25 PM, "James Graham (Greywolf)" <[EMAIL PROTECTED]> wrote:
> 226GB of available disk space on each one;
> 4 processors (2 x dualcore)
> 8GB of RAM each.

Some simple stuff:

(Assuming SATA):
Are you using AHCI?
Do you have the write cache enabled?

Is the topologyProgram providing proper results?
Is DNS performing as expected? Is it fast?
How many tasks per node?
How much heap does your name node have?  Is it going into garbage collection
or swapping?



RE: "Join" example

2008-08-08 Thread John DeTreville
When I try the map-side join example (under Hadoop 0.17.1, running in
standalone mode under Win32), it attempts to dereference a null pointer.

$ cat One/some.txt
A   1
B   1
C   1
E   1
$ cat Two/some.txt
A   2
B   2
C   2
D   2
$ bin/hadoop jar *examples.jar join -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat -outKey
org.apache.hadoop.io.Text -joinOp outer One/some.txt Two/some.txt output
cygpath: cannot create short name of c:\Documents and
Settings\jdd\Desktop\hadoop-0.17.1\logs
08/08/08 15:41:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
Job started: Fri Aug 08 15:41:34 PDT 2008
08/08/08 15:41:34 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
08/08/08 15:41:34 INFO mapred.FileInputFormat: Total input paths to
process : 1
08/08/08 15:41:34 INFO mapred.FileInputFormat: Total input paths to
process : 1
java.lang.NullPointerException
at
org.apache.hadoop.mapred.KeyValueTextInputFormat.isSplitable(KeyValueTex
tInputFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:
247)
at
org.apache.hadoop.mapred.join.Parser$WNode.getSplits(Parser.java:305)
at
org.apache.hadoop.mapred.join.Parser$CNode.getSplits(Parser.java:375)
at
org.apache.hadoop.mapred.join.CompositeInputFormat.getSplits(CompositeIn
putFormat.java:129)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:712)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.hadoop.examples.Join.run(Join.java:154)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.examples.Join.main(Join.java:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDr
iver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at
org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
$ 

I'll look around a little to see what the problem is. The attempt to
initialize the JVM metrics twice also seems suspicious.

Here's one other thing I don't understand. Suppose my directory One
contains some number of files, and directory Two contains the same
number, named the same and partitioned the same. If I give the directory
names One and Two to the example program, will it match up the files by
name for performing the join? I haven't found the code yet to do that,
although I'm imagining that perhaps that's what it does.

Cheers,
John

-Original Message-
From: Chris Douglas [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 1:57 PM
To: core-user@hadoop.apache.org
Subject: Re: "Join" example

The contrib/data_join framework is different from the map-side join  
framework, under o.a.h.mapred.join.

To see what the example is doing in an outer join, generate a few  
sample, text input files, tab-separated:

join/a.txt:

a0
a1
a2
a3

join/b.txt:

b0
b1
b2
b3

join/c.txt:

c0
c1
c2
c3

Run the example with each as an input:

host$ bin/hadoop jar hadoop-*-examples.jar join \
   -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
   -outKey org.apache.hadoop.io.Text \
   -joinOp outer \
   join/a.txt join/b.txt join/c.txt joinout

Examine the result in joinout/part-0:

host$ bin/hadoop fs -text joinout/part-0 | less
[a0,b0,c0]
[a1,b1,c1]
[a1,b2,c1]
[a1,b3,c1]
[a2,,]
[a3,,]
[,,c2]
[,,c3]

-C

On Aug 7, 2008, at 11:39 PM, Wei Wu wrote:

> There are some examples in $HADOOPHOME/src/contrib/data_join, which  
> I hope
> would help.
>
> Wei
>
> -Original Message-
> From: John DeTreville [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 08, 2008 2:34 AM
> To: core

Re: namenode & jobtracker: joint or separate, which is better?

2008-08-08 Thread James Graham (Greywolf)

Thus spake lohit::
It depends on your machine configuration, how much resource it has and 
what you can afford to lose in case of failures.
It would be good to run NameNode and jobtracker on their own dedicate 
nodes and datanodes and tasktracker on rest of the nodes. We have seen 
cases where tasktrackers take down nodes for malicious programs, in such 
cases you do not want your jobtracker or namenode to be on those nodes.
Also, running multiple jvms might slow down the node and your process. I 
would recommend you run atleast the NameNode on dedicated node.

Thanks,
Lohit


Good to know; thank you.

--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Re: How to enable compression of blockfiles?

2008-08-08 Thread lohit
I think at present only SequenceFiles can be compressed. 
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html
If you have plain text files, they are stored as is into blocks. You can store 
them as .gz and hadoop recognizes it and process the gz files. But its not 
splittable, meaning each map will consume whole of .gz
Thanks,
Lohit



- Original Message 
From: Michael K. Tung <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, August 8, 2008 1:09:01 PM
Subject: How to enable compression of blockfiles?

Hello, I have a simple question.  How do I configure DFS to store compressed
block files?  I've noticed by looking at the "blk_" files that the text
documents I am storing are uncompressed.  Currently our hadoop deployment is
taking up 10x the diskspace as compared to our system before moving to
hadoop. I've tried modifying the io.seqfile.compress.blocksize option
without success and haven't been able to find anything online regarding
this.  Is there any way to do this or do I need to manually compress my data
before storing to HDFS?

Thanks,

Michael Tung


Re: namenode & jobtracker: joint or separate, which is better?

2008-08-08 Thread lohit
It depends on your machine configuration, how much resource it has and what you 
can afford to lose in case of failures. 
It would be good to run NameNode and jobtracker on their own dedicate nodes and 
datanodes and tasktracker on rest of the nodes. We have seen cases where 
tasktrackers take down nodes for malicious programs, in such cases you do not 
want your jobtracker or namenode to be on those nodes. 
Also, running multiple jvms might slow down the node and your process. I would 
recommend you run atleast the NameNode on dedicated node.
Thanks,
Lohit



- Original Message 
From: James Graham (Greywolf) <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, August 8, 2008 1:29:08 PM
Subject: namenode & jobtracker: joint or separate, which is better?

Which is better, to have the namenode and jobtracker as distinct nodes
or as a single node, and are there pros/cons regarding using either or
both as datanodes?
-- 
James Graham (Greywolf)  |
650.930.1138|925.768.4053  *
[EMAIL PROTECTED]  |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa



Re: what is the correct usage of hdfs metrics

2008-08-08 Thread lohit
I have tried to connect to it via jconsole.
Apart from that I have seen people of this list use Ganglia to collect metrics 
or just dump to a file. 
To start off you could easily use FileContext (dumping metrics to file). Check 
out the metrics config file (hadoop-metrics.properties) under conf directory.  
Specify file name and period to monitor the metrics.
Thanks,
Lohit



- Original Message 
From: Ivan Georgiev <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, August 8, 2008 4:39:36 AM
Subject: what is the correct usage of hdfs metrics

Hi,

I have been unable to find any examples on how to use the MBeans 
provided from HDFS.
Could anyone that has any experience on the topic share some info.
What is the URL to use to connect to the MBeanServer ?
Is it done through rmi, or only through jvm ?

Any help is highly appreciated.

Please cc me as i am not a member of the list.

Regards:
Ivan



RE: "Join" example

2008-08-08 Thread John DeTreville
Thanks very much, Chris!

Cheers,
John

-Original Message-
From: Chris Douglas [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 08, 2008 1:57 PM
To: core-user@hadoop.apache.org
Subject: Re: "Join" example

The contrib/data_join framework is different from the map-side join  
framework, under o.a.h.mapred.join.

To see what the example is doing in an outer join, generate a few  
sample, text input files, tab-separated:

join/a.txt:

a0
a1
a2
a3

join/b.txt:

b0
b1
b2
b3

join/c.txt:

c0
c1
c2
c3

Run the example with each as an input:

host$ bin/hadoop jar hadoop-*-examples.jar join \
   -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
   -outKey org.apache.hadoop.io.Text \
   -joinOp outer \
   join/a.txt join/b.txt join/c.txt joinout

Examine the result in joinout/part-0:

host$ bin/hadoop fs -text joinout/part-0 | less
[a0,b0,c0]
[a1,b1,c1]
[a1,b2,c1]
[a1,b3,c1]
[a2,,]
[a3,,]
[,,c2]
[,,c3]

-C

On Aug 7, 2008, at 11:39 PM, Wei Wu wrote:

> There are some examples in $HADOOPHOME/src/contrib/data_join, which  
> I hope
> would help.
>
> Wei
>
> -Original Message-
> From: John DeTreville [mailto:[EMAIL PROTECTED]
> Sent: Friday, August 08, 2008 2:34 AM
> To: core-user@hadoop.apache.org
> Subject: "Join" example
>
> Hadoop ships with a few example programs. One of these is "join,"  
> which
> I believe demonstrates map-side joins. I'm finding its usage
> instructions a little impenetrable; could anyone send me instructions
> that are more like "type this" then "type this" then "type this"?
>
> Thanks in advance.
>
> Cheers,
> John
>



Re: "Join" example

2008-08-08 Thread Chris Douglas
The contrib/data_join framework is different from the map-side join  
framework, under o.a.h.mapred.join.


To see what the example is doing in an outer join, generate a few  
sample, text input files, tab-separated:


join/a.txt:

a0
a1
a2
a3

join/b.txt:

b0
b1
b2
b3

join/c.txt:

c0
c1
c2
c3

Run the example with each as an input:

host$ bin/hadoop jar hadoop-*-examples.jar join \
  -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
  -outKey org.apache.hadoop.io.Text \
  -joinOp outer \
  join/a.txt join/b.txt join/c.txt joinout

Examine the result in joinout/part-0:

host$ bin/hadoop fs -text joinout/part-0 | less
[a0,b0,c0]
[a1,b1,c1]
[a1,b2,c1]
[a1,b3,c1]
[a2,,]
[a3,,]
[,,c2]
[,,c3]

-C

On Aug 7, 2008, at 11:39 PM, Wei Wu wrote:

There are some examples in $HADOOPHOME/src/contrib/data_join, which  
I hope

would help.

Wei

-Original Message-
From: John DeTreville [mailto:[EMAIL PROTECTED]
Sent: Friday, August 08, 2008 2:34 AM
To: core-user@hadoop.apache.org
Subject: "Join" example

Hadoop ships with a few example programs. One of these is "join,"  
which

I believe demonstrates map-side joins. I'm finding its usage
instructions a little impenetrable; could anyone send me instructions
that are more like "type this" then "type this" then "type this"?

Thanks in advance.

Cheers,
John





Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Andreas Kostyrka
On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote:
> You are completely right. It's not safe at all. But this is what I have for
> now:
> two computers distributed across the Internet. I would really appreciate if
> anyone could give me spark on how to configure the namenode's IP in a
> datanode. As I could identify in log files, the datanode keeps trying to
> connect
> to the IP 10.1.1.5, which is the internal IP of the namenode. I just need a
> way
> to say to the datanode "Hey, could you instead connect to the IP 172.1.23.2
> "?

Your only bet is to set it up in a VPNed environment. That would make it 
securitywise okay too.

Andreas

>
> Lucas
>
> On Fri, Aug 8, 2008 at 10:25 AM, Lukáš Vlček <[EMAIL PROTECTED]> wrote:
> > HI,
> >
> > I am not an expert on Hadoop configuration but is this safe? As far as I
> > understand the IP address is public and connection to the datanode port
> > is not secured. Am I correct?
> >
> > Lukas
> >
> > On Fri, Aug 8, 2008 at 8:35 AM, Lucas Nazário dos Santos <
> >
> > [EMAIL PROTECTED]> wrote:
> > > Hello again,
> > >
> > > In fact I can get the cluster up and running with two nodes in
> > > different LANs. The problem appears when executing a job.
> > >
> > > As you can see in the piece of log bellow, the datanode tries to
> >
> > comunicate
> >
> > > with the namenode using the IP 10.1.1.5. The issue is that the datanode
> > > should be using a valid IP, and not 10.1.1.5.
> > >
> > > Is there a way of manually configuring the datanode with the namenode's
> >
> > IP,
> >
> > > so I can change from 10.1.1.5 to, say 189.11.131.172?
> > >
> > > Thanks,
> > > Lucas
> > >
> > >
> > > 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker:
> > > TaskTracker up at: localhost/127.0.0.1:60394
> > > 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker:
> >
> > Starting
> >
> > > tracker tracker_localhost:localhost/127.0.0.1:60394
> > > 2008-08-08 02:34:23,589 INFO org.apache.hadoop.mapred.TaskTracker:
> >
> > Starting
> >
> > > thread: Map-events fetcher for all reduce tasks on
> > > tracker_localhost:localhost/127.0.0.1:60394
> > > 2008-08-08 03:06:43,239 INFO org.apache.hadoop.mapred.TaskTracker:
> > > LaunchTaskAction: task_200808080234_0001_m_00_0
> > > 2008-08-08 03:07:43,989 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 1 time(s).
> > > 2008-08-08 03:08:44,999 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 2 time(s).
> > > 2008-08-08 03:09:45,999 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 3 time(s).
> > > 2008-08-08 03:10:47,009 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 4 time(s).
> > > 2008-08-08 03:11:48,009 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 5 time(s).
> > > 2008-08-08 03:12:49,026 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 6 time(s).
> > > 2008-08-08 03:13:50,036 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 7 time(s).
> > > 2008-08-08 03:14:51,046 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 8 time(s).
> > > 2008-08-08 03:15:52,056 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 9 time(s).
> > > 2008-08-08 03:16:53,066 INFO org.apache.hadoop.ipc.Client: Retrying
> >
> > connect
> >
> > > to server: /10.1.1.5:9000. Already tried 10 time(s).
> > > 2008-08-08 03:17:54,077 WARN org.apache.hadoop.mapred.TaskTracker:
> > > Error initializing task_200808080234_0001_m_00_0:
> > > java.net.SocketTimeoutException
> > >at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:109)
> > >at
> > > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:174)
> > >at org.apache.hadoop.ipc.Client.getConnection(Client.java:623)
> > >at org.apache.hadoop.ipc.Client.call(Client.java:546)
> > >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> > >at org.apache.hadoop.dfs.$Proxy5.getProtocolVersion(Unknown Source)
> > >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
> > >at
> >
> > org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
> >
> > >at org.apache.hadoop.dfs.DFSClient.(DFSClient.java:178)
> > >at
> >
> > org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSys
> >tem.java:68)
> >
> > >at
> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1280)
> > >at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
> > >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
> > >at org.apache.hadoop.fs.F

namenode & jobtracker: joint or separate, which is better?

2008-08-08 Thread James Graham (Greywolf)

Which is better, to have the namenode and jobtracker as distinct nodes
or as a single node, and are there pros/cons regarding using either or
both as datanodes?
--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


performance not great, or did I miss something?

2008-08-08 Thread James Graham (Greywolf)

Greetings,

I'm very very new to this (as you could probably tell from my other postings).

I have 20 nodes available as a cluster, less one as the namenode and one as
the jobtracker (unless I can use them too).  Specs are:

226GB of available disk space on each one;
4 processors (2 x dualcore)
8GB of RAM each.

The RandomWriter takes just over 17 minutes to complete;
the Sorter takes well over three to four hours or more to complete
on only about a half terabyte of data.

This is certainly not the speed or power I had been led to expect from
Hadoop, so I am guessing I have some things tuned wrong (actually, I'm
certain some are tuned wrong as during the reduce phase, I'm seeing processes
die from lack of memory...).

Given the above hardware specs, what should I expect as a theoretical maximum
throughput?  machines 3-10 are on 1GbE, machines 11-20 are on a second 1GbE,
connected by a mutual 1GbE upstream (another switch).



--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


How to enable compression of blockfiles?

2008-08-08 Thread Michael K. Tung
Hello, I have a simple question.  How do I configure DFS to store compressed
block files?  I've noticed by looking at the "blk_" files that the text
documents I am storing are uncompressed.  Currently our hadoop deployment is
taking up 10x the diskspace as compared to our system before moving to
hadoop. I've tried modifying the io.seqfile.compress.blocksize option
without success and haven't been able to find anything online regarding
this.  Is there any way to do this or do I need to manually compress my data
before storing to HDFS?

Thanks,

Michael Tung






Re: access jobconf in streaming job

2008-08-08 Thread Andreas Kostyrka
On Friday 08 August 2008 11:43:50 Rong-en Fan wrote:
> After looking into streaming source, the answer is via environment
> variables. For example, mapred.task.timeout is in
> the mapred_task_timeout environment variable.

Well, another typical way to deal with that is to pass the parameters via 
cmdline.

I personally ended up stuffing all our configuration that is related to the 
environment into a config.ini file, that gets served via http, and I pass 
a -c http://host:port/config.ini parameter to all the jobs.

Configuration related to what I expect the job to do I still keep on the 
cmdline, e.g. the hadoop call looks something like this:

time $HADOOP_HOME/bin/hadoop jar 
$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \
   -mapper "/home/hadoop/bin/llfp --s3fetch -K $AWS_ACCESS_KEY_ID -S 
$AWS_SECRET_ACCESS_KEY --stderr -d vmregistry -d frontpage -d papi2 -d 
gen_dailysites -d fb_memberfind -c $CONFIGURL " \
   -reducer "/home/hadoop/bin/lrp --stderr -c $CONFIGURL" \
   -jobconf mapred.reduce.tasks=22 \
   -input /user/hadoop/run-$JOBNAME-input -output 
/user/hadoop/run-$JOBNAME-output || 
exit 1

In our case the seperate .ini file makes sense because it describes the 
environment (e.g. http service urls, sql database connections, and so on) and 
is being used by other scripts that are not run inside hadoop.

Andreas

>
> On Fri, Aug 8, 2008 at 4:26 PM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
> > I'm using streaming with a mapper written in perl. However, an
> > issue is that I want to pass some arguments via command line.
> > In regular Java mapper, I can access JobConf in Mapper.
> > Is there a way to do this?
> >
> > Thanks,
> > Rong-En Fan




signature.asc
Description: This is a digitally signed message part.


How to set System property for my job

2008-08-08 Thread Tarandeep Singh
Hi,

While submitting a job to Hadoop, how can I set system properties that are
required by my code ?
Passing -Dmy.prop=myvalue to the hadoop job command is not going to work as
hadoop command will pass this to my program as command line argument.

Is there any way to achieve this ?

Thanks,
Taran

*

*


Re: extracting input to a task from a (streaming) job?

2008-08-08 Thread Yuri Pradkin
On Thursday 07 August 2008 16:43:10 John Heidemann wrote:
> On Thu, 07 Aug 2008 19:42:05 +0200, "Leon Mergen" wrote:
> >Hello John,
> >
> >On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann <[EMAIL PROTECTED]> wrote:
> >> I have a large Hadoop streaming job that generally works fine,
> >> but a few (2-4) of the ~3000 maps and reduces have problems.
> >> To make matters worse, the problems are system-dependent (we run an a
> >> cluster with machines of slightly different OS versions).
> >> I'd of course like to debug these problems, but they are embedded in a
> >> large job.
> >>
> >> Is there a way to extract the input given to a reducer from a job, given
> >> the task identity?  (This would also be helpful for mappers.)
> >
> >I believe you should set "keep.failed.tasks.files" to true -- this way,
> > give a task id, you can see what input files it has in ~/
> >taskTracker/${taskid}/work (source:
> >http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#IsolationR
> >unner )

IsolationRunner does not work as described in the tutorial.  After the task 
hung, I failed it 
via the web interface.  Then I went to the node that was running this task

  $ cd ...local/taskTracker/jobcache/job_200808071645_0001/work
(this path is already different from the tutorial's)

  $ hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:164)

Looking at IsolationRunner code, I see this:

164 File workDirName = new File(lDirAlloc.getLocalPathToRead(
165   TaskTracker.getJobCacheSubdir()
166   + Path.SEPARATOR + taskId.getJobID()
167   + Path.SEPARATOR + taskId
168   + Path.SEPARATOR + "work",
169   conf). toString());

I.e. it assumes there is supposed to be a taskID subdirectory under the job
dir, but:
 $ pwd
 ...mapred/local/taskTracker/jobcache/job_200808071645_0001
 $ ls
 jars  job.xml  work

-- it's not there.  Any suggestions?

Thanks,

  -Yuri




Re: fuse-dfs

2008-08-08 Thread Pete Wyckoff

Hi Sebastian.

Setting of times doesn¹t work, but ls, rm, rmdir, mkdir, cp, etc etc should
work.

Things that are not currently supported include:

Touch,  chown, chmod, permissions in general and obviously random writes for
which you would get an IO error.

This is what I get on 0.17 for df ­h:

FilesystemSize  Used Avail Use% Mounted on
fuse  XXXT  YYYT  ZZZT  AA% /export/hdfs  and the #s are
right

There is no unit test for df though (doh!), so it¹s quite possible the
libhdfs API has changed and fuse-dfs needs to update its code to match the
API. I will check that.

To be honest, we run on 0.17.1, so other than unit tests, I never run on
0.19 :(

-- pete

Ps I created: https://issues.apache.org/jira/browse/HADOOP-3928 to track
this.






On 8/8/08 3:34 AM, "Sebastian Vieira" <[EMAIL PROTECTED]> wrote:

> Hi Pete,
> 
> From within the 0.19 source i did:
> 
> ant jar
> ant metrics.jar
> ant test-core
> 
> This resulted in 3 jar files within $HADOOP_HOME/build :
> 
> [EMAIL PROTECTED] hadoop-0.19]# ls -l build/*.jar
> -rw-r--r-- 1 root root 2201651 Aug  8 08:26 build/hadoop-0.19.0-dev-core.jar
> -rw-r--r-- 1 root root 1096699 Aug  8 08:29 build/hadoop-0.19.0-dev-test.jar
> -rw-r--r-- 1 root root   55695 Aug  8 08:26
> build/hadoop-metrics-0.19.0-dev.jar
> 
> I've added these to be included in the CLASSPATH within the wrapper script:
> 
> for f in `ls $HADOOP_HOME/build/*.jar`; do
> export CLASSPATH=$CLASSPATH:$f
> done
> 
> This still produced the same error, so (thanks to the more detailed error
> output your patch provided) i renamed hadoop-0.19.0-dev-core.jar to
> hadoop-core.jar to match the regexp.
> 
> Then i figured out that i can't use dfs://master:9000 becaus in
> hadoop-site.xml i specified that dfs should run on port 54310 (doh!). So i
> issued this command:
> 
> ./fuse_dfs_wrapper.sh dfs://master:54310 /mnt/hadoop -d
> 
> Succes! Even though the output from df -h is .. weird :
> 
> fuse  512M 0  512M   0% /mnt/hadoop
> 
> I added some data:
> 
> for x in `seq 1 25`;do
> dd if=/dev/zero of=/mnt/hadoop/test-$x.raw bs=1MB count=10
> done
> 
> And now the output from df -h is:
> 
> fuse  512M -3.4G  3.9G   -  /mnt/hadoop
> 
> Note that my HDFS setup now consists of 20 nodes, exporting 15G each, so df is
> a little confused. Hadoop's status page (dfshealth.jsp) correctly displays the
> output though, evenly dividing the blocks over all the nodes.
> 
> What i didn't understand however, is why there's no fuse-dfs in the
> downloadable tarballs. Am i looking in the wrong place perhaps?
> 
> Anyway, now that i got things mounted, i come upon the next problem. I can't
> do much else than dd :)
> 
> [EMAIL PROTECTED] fuse-dfs]# touch /mnt/hadoop/test.tst
> touch: setting times of `/mnt/hadoop/test.tst': Function not implemented
> 
> 
> regards,
> 
> Sebastian
> 




Re: Distributed Lucene - from hadoop contrib

2008-08-08 Thread Ning Li
> 1) Katta n Distributed Lucene are different projects though, right? Both
> being based on kind of the same paradigm (Distributed Index)?

The design of Katta and that of Distributed Lucene are quite different
last time I checked. I pointed out the Katta project because you can
find the code for Distributed Lucene there.

> 2) So, I should be able to use the hadoop.contrib.index with HDFS.
> Though, it would be much better if it is integrated with "Distributed
> Lucene" or the "Katta project" as these are designed keeping the
> structure and behavior of indexes in mind. Right?

As described in the README file, hadoop.contrib.index uses map/reduce
to build Lucene instances. It does not contain a component that serves
queries. If that's not sufficient for you, you can check out the
designs of Katta and Distributed Index and see which one suits your
use better.

Ning


Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Lucas Nazário dos Santos
You are completely right. It's not safe at all. But this is what I have for
now:
two computers distributed across the Internet. I would really appreciate if
anyone could give me spark on how to configure the namenode's IP in a
datanode. As I could identify in log files, the datanode keeps trying to
connect
to the IP 10.1.1.5, which is the internal IP of the namenode. I just need a
way
to say to the datanode "Hey, could you instead connect to the IP 172.1.23.2
"?

Lucas


On Fri, Aug 8, 2008 at 10:25 AM, Lukáš Vlček <[EMAIL PROTECTED]> wrote:

> HI,
>
> I am not an expert on Hadoop configuration but is this safe? As far as I
> understand the IP address is public and connection to the datanode port is
> not secured. Am I correct?
>
> Lukas
>
> On Fri, Aug 8, 2008 at 8:35 AM, Lucas Nazário dos Santos <
> [EMAIL PROTECTED]> wrote:
>
> > Hello again,
> >
> > In fact I can get the cluster up and running with two nodes in different
> > LANs. The problem appears when executing a job.
> >
> > As you can see in the piece of log bellow, the datanode tries to
> comunicate
> > with the namenode using the IP 10.1.1.5. The issue is that the datanode
> > should be using a valid IP, and not 10.1.1.5.
> >
> > Is there a way of manually configuring the datanode with the namenode's
> IP,
> > so I can change from 10.1.1.5 to, say 189.11.131.172?
> >
> > Thanks,
> > Lucas
> >
> >
> > 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker:
> > TaskTracker up at: localhost/127.0.0.1:60394
> > 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker:
> Starting
> > tracker tracker_localhost:localhost/127.0.0.1:60394
> > 2008-08-08 02:34:23,589 INFO org.apache.hadoop.mapred.TaskTracker:
> Starting
> > thread: Map-events fetcher for all reduce tasks on
> > tracker_localhost:localhost/127.0.0.1:60394
> > 2008-08-08 03:06:43,239 INFO org.apache.hadoop.mapred.TaskTracker:
> > LaunchTaskAction: task_200808080234_0001_m_00_0
> > 2008-08-08 03:07:43,989 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 1 time(s).
> > 2008-08-08 03:08:44,999 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 2 time(s).
> > 2008-08-08 03:09:45,999 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 3 time(s).
> > 2008-08-08 03:10:47,009 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 4 time(s).
> > 2008-08-08 03:11:48,009 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 5 time(s).
> > 2008-08-08 03:12:49,026 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 6 time(s).
> > 2008-08-08 03:13:50,036 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 7 time(s).
> > 2008-08-08 03:14:51,046 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 8 time(s).
> > 2008-08-08 03:15:52,056 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 9 time(s).
> > 2008-08-08 03:16:53,066 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> > to server: /10.1.1.5:9000. Already tried 10 time(s).
> > 2008-08-08 03:17:54,077 WARN org.apache.hadoop.mapred.TaskTracker: Error
> > initializing task_200808080234_0001_m_00_0:
> > java.net.SocketTimeoutException
> >at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:109)
> >at
> > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:174)
> >at org.apache.hadoop.ipc.Client.getConnection(Client.java:623)
> >at org.apache.hadoop.ipc.Client.call(Client.java:546)
> >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> >at org.apache.hadoop.dfs.$Proxy5.getProtocolVersion(Unknown Source)
> >at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
> >at
> org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
> >at org.apache.hadoop.dfs.DFSClient.(DFSClient.java:178)
> >at
> >
> >
> org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
> >at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1280)
> >at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
> >at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
> >at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
> >at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:152)
> >at
> > org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:670)
> >at
> > org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
> >at
> > org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)
> >at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
> >at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.j

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Lukáš Vlček
HI,

I am not an expert on Hadoop configuration but is this safe? As far as I
understand the IP address is public and connection to the datanode port is
not secured. Am I correct?

Lukas

On Fri, Aug 8, 2008 at 8:35 AM, Lucas Nazário dos Santos <
[EMAIL PROTECTED]> wrote:

> Hello again,
>
> In fact I can get the cluster up and running with two nodes in different
> LANs. The problem appears when executing a job.
>
> As you can see in the piece of log bellow, the datanode tries to comunicate
> with the namenode using the IP 10.1.1.5. The issue is that the datanode
> should be using a valid IP, and not 10.1.1.5.
>
> Is there a way of manually configuring the datanode with the namenode's IP,
> so I can change from 10.1.1.5 to, say 189.11.131.172?
>
> Thanks,
> Lucas
>
>
> 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker:
> TaskTracker up at: localhost/127.0.0.1:60394
> 2008-08-08 02:34:23,335 INFO org.apache.hadoop.mapred.TaskTracker: Starting
> tracker tracker_localhost:localhost/127.0.0.1:60394
> 2008-08-08 02:34:23,589 INFO org.apache.hadoop.mapred.TaskTracker: Starting
> thread: Map-events fetcher for all reduce tasks on
> tracker_localhost:localhost/127.0.0.1:60394
> 2008-08-08 03:06:43,239 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction: task_200808080234_0001_m_00_0
> 2008-08-08 03:07:43,989 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 1 time(s).
> 2008-08-08 03:08:44,999 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 2 time(s).
> 2008-08-08 03:09:45,999 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 3 time(s).
> 2008-08-08 03:10:47,009 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 4 time(s).
> 2008-08-08 03:11:48,009 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 5 time(s).
> 2008-08-08 03:12:49,026 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 6 time(s).
> 2008-08-08 03:13:50,036 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 7 time(s).
> 2008-08-08 03:14:51,046 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 8 time(s).
> 2008-08-08 03:15:52,056 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 9 time(s).
> 2008-08-08 03:16:53,066 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /10.1.1.5:9000. Already tried 10 time(s).
> 2008-08-08 03:17:54,077 WARN org.apache.hadoop.mapred.TaskTracker: Error
> initializing task_200808080234_0001_m_00_0:
> java.net.SocketTimeoutException
>at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:109)
>at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:174)
>at org.apache.hadoop.ipc.Client.getConnection(Client.java:623)
>at org.apache.hadoop.ipc.Client.call(Client.java:546)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>at org.apache.hadoop.dfs.$Proxy5.getProtocolVersion(Unknown Source)
>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
>at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
>at org.apache.hadoop.dfs.DFSClient.(DFSClient.java:178)
>at
>
> org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
>at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1280)
>at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
>at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
>at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
>at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:152)
>at
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:670)
>at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274)
>at
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915)
>at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310)
>at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251)
>
>
>
> On Fri, Aug 8, 2008 at 12:16 AM, Lucas Nazário dos Santos <
> [EMAIL PROTECTED]> wrote:
>
> > Hello,
> >
> > Can someone point me out what are the extra tasks that need to be
> performed
> > in order to set up a cluster where nodes are spread over the Internet, in
> > different LANs?
> >
> > Do I need to free any datanode/namenode ports? How do I get the datanodes
> > to know the valid namenode IP, and not something like 10.1.1.1?
> >
> > Any help is appreciate.
> >
> > Lucas
> >
>



-- 
http://blog.lukas-vlcek.com/


Hadoop Pipes Job submission and JobId

2008-08-08 Thread Leon Mergen
Hello,

I was wondering what the correct way to submit a Job to hadoop using the
Pipes API is -- currently, I invoke a command similar to this:

/usr/local/hadoop/bin/hadoop pipes -conf
/usr/local/mapreduce/reports/reports.xml -input
/store/requests/archive/*/*/* -output out

However, this way of invoking the job has a few problems: it is a shell
command, and thus a bit awkward to embed this type of job submission in a
C++ program. Secondly, it would be awkward to retrieve the JobId from this
shell command since all its output would have to be properly parsed. And
last, it goes into a loop as long as the program is running, instead of
going to the background.

Now, in an ideal case, I would have some kind of HTTP url or whatever where
I can submit job submissions to, which in turn returns some data about the
new job, including the JobId.

I need the JobId to me able to match my system's task id's with Hadoop
JobId's when the  is visited.

I was wondering whether these requirements can be met without having to
write a custom Java application, or that the native Java API is the only way
to go to retrieve the JobId upon job submission.

Thanks in advance!

Regards,

Leon Mergen


what is the correct usage of hdfs metrics

2008-08-08 Thread Ivan Georgiev

Hi,

I have been unable to find any examples on how to use the MBeans 
provided from HDFS.

Could anyone that has any experience on the topic share some info.
What is the URL to use to connect to the MBeanServer ?
Is it done through rmi, or only through jvm ?

Any help is highly appreciated.

Please cc me as i am not a member of the list.

Regards:
Ivan


Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Alexander Aristov
I come across the same issue and also with hadoop 0.17.1

would be interesting if someone say the cause of the issue.

Alex

2008/8/8 Steve Loughran <[EMAIL PROTECTED]>

> Piotr Kozikowski wrote:
>
>> Hi there:
>>
>> We would like to know what are the most likely causes of this sort of
>> error:
>>
>> Exception closing
>> file
>> /data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022
>> java.io.IOException: Could not get block locations. Aborting...
>>at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>>at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>>at
>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
>>
>> Our map-reduce job does not fail completely but over 50% of the map tasks
>> fail with this same error.
>> We recently migrated our cluster from 0.16.4 to 0.17.1, previously we
>> didn't have this problem using the same input data in a similar map-reduce
>> job
>>
>> Thank you,
>>
>> Piotr
>>
>>
> When I see this, its because the filesystem isnt completely up: there are
> no locations for a specific file, meaning the client isn't getting back the
> names of any datanodes holding the data from the name nodes.
>
> I've got a patch in JIRA that prints out the name of the file in question,
> as that could be useful.
>



-- 
Best Regards
Alexander Aristov


Re: access jobconf in streaming job

2008-08-08 Thread Rong-en Fan
After looking into streaming source, the answer is via environment
variables. For example, mapred.task.timeout is in
the mapred_task_timeout environment variable.

On Fri, Aug 8, 2008 at 4:26 PM, Rong-en Fan <[EMAIL PROTECTED]> wrote:
> I'm using streaming with a mapper written in perl. However, an
> issue is that I want to pass some arguments via command line.
> In regular Java mapper, I can access JobConf in Mapper.
> Is there a way to do this?
>
> Thanks,
> Rong-En Fan
>


Re: java.io.IOException: Could not get block locations. Aborting...

2008-08-08 Thread Steve Loughran

Piotr Kozikowski wrote:

Hi there:

We would like to know what are the most likely causes of this sort of
error:

Exception closing
file 
/data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022
java.io.IOException: Could not get block locations. Aborting...
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)

Our map-reduce job does not fail completely but over 50% of the map tasks fail with this same error. 


We recently migrated our cluster from 0.16.4 to 0.17.1, previously we didn't 
have this problem using the same input data in a similar map-reduce job

Thank you,

Piotr



When I see this, its because the filesystem isnt completely up: there 
are no locations for a specific file, meaning the client isn't getting 
back the names of any datanodes holding the data from the name nodes.


I've got a patch in JIRA that prints out the name of the file in 
question, as that could be useful.


access jobconf in streaming job

2008-08-08 Thread Rong-en Fan
I'm using streaming with a mapper written in perl. However, an
issue is that I want to pass some arguments via command line.
In regular Java mapper, I can access JobConf in Mapper.
Is there a way to do this?

Thanks,
Rong-En Fan


Hadoop + Servlet Problems

2008-08-08 Thread Kylie McCormick
Hi!
I've gotten Hadoop to run a search as I want, but now I'm trying to
add a servlet component to it.

All of Hadoop works properly, but when I set components from the
servlet instead of setting them via the command-line, Hadoop only
produces temporary output files and doesn't complete.

I've looked at Nutch's NutchBean + Cached file for the servlet
information from Nutch, and there is nothing terribly enlightening
there in the code. Does anyone have any information on Hadoop +
Tomcat/Servlets?

Thanks,
Kylie

--
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Re: fuse-dfs

2008-08-08 Thread Sebastian Vieira
Hi Pete,

>From within the 0.19 source i did:

ant jar
ant metrics.jar
ant test-core

This resulted in 3 jar files within $HADOOP_HOME/build :

[EMAIL PROTECTED] hadoop-0.19]# ls -l build/*.jar
-rw-r--r-- 1 root root 2201651 Aug  8 08:26 build/hadoop-0.19.0-dev-core.jar
-rw-r--r-- 1 root root 1096699 Aug  8 08:29 build/hadoop-0.19.0-dev-test.jar
-rw-r--r-- 1 root root   55695 Aug  8 08:26
build/hadoop-metrics-0.19.0-dev.jar

I've added these to be included in the CLASSPATH within the wrapper script:

for f in `ls $HADOOP_HOME/build/*.jar`; do
export CLASSPATH=$CLASSPATH:$f
done

This still produced the same error, so (thanks to the more detailed error
output your patch provided) i renamed hadoop-0.19.0-dev-core.jar to
hadoop-core.jar to match the regexp.

Then i figured out that i can't use dfs://master:9000 becaus in
hadoop-site.xml i specified that dfs should run on port 54310 (doh!). So i
issued this command:

./fuse_dfs_wrapper.sh dfs://master:54310 /mnt/hadoop -d

Succes! Even though the output from df -h is .. weird :

fuse  512M 0  512M   0% /mnt/hadoop

I added some data:

for x in `seq 1 25`;do
dd if=/dev/zero of=/mnt/hadoop/test-$x.raw bs=1MB count=10
done

And now the output from df -h is:

fuse  512M -3.4G  3.9G   -  /mnt/hadoop

Note that my HDFS setup now consists of 20 nodes, exporting 15G each, so df
is a little confused. Hadoop's status page (dfshealth.jsp) correctly
displays the output though, evenly dividing the blocks over all the nodes.

What i didn't understand however, is why there's no fuse-dfs in the
downloadable tarballs. Am i looking in the wrong place perhaps?

Anyway, now that i got things mounted, i come upon the next problem. I can't
do much else than dd :)

[EMAIL PROTECTED] fuse-dfs]# touch /mnt/hadoop/test.tst
touch: setting times of `/mnt/hadoop/test.tst': Function not implemented


regards,

Sebastian