[Fwd: Re: runtime exceptions not killing job]

2008-03-17 Thread Matt Kent

Or maybe I can't use attachments, so here's the stack traces inline:

--task tracker

2008-03-17 21:58:30
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed 
mode):


"Attach Listener" daemon prio=10 tid=0x2aab1205c400 nid=0x523d 
waiting on condition [0x..0x]

  java.lang.Thread.State: RUNNABLE

"IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9001" 
daemon prio=10 tid=0x2aab14317000 nid=0x5230 in Object.wait() 
[0x41c44000..0x41c44ba0]

  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)

   at java.lang.Object.wait(Object.java:485)
   at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
   - locked <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)

   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"process reaper" daemon prio=10 tid=0x2aab1205bc00 nid=0x51c6 
runnable [0x41f47000..0x41f47da0]

  java.lang.Thread.State: RUNNABLE
   at java.lang.UNIXProcess.waitForProcessExit(Native Method)
   at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
   at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

"Thread-408" prio=10 tid=0x2aab14316000 nid=0x51c5 in Object.wait() 
[0x41d45000..0x41d45a20]

  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
   at java.lang.Object.wait(Object.java:485)
   at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
   - locked <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
   at org.apache.hadoop.util.Shell.run(Shell.java:100)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)

   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)

"SocketListener0-0" prio=10 tid=0x2aab1205e400 nid=0x519d in 
Object.wait() [0x41038000..0x41038da0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf2650c20> (a 
org.mortbay.util.ThreadPool$PoolThread)

   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
   - locked <0x2aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread)

"[EMAIL PROTECTED]" daemon prio=10 
tid=0x2aab183a9000 nid=0x46f5 waiting on condition 
[0x41a42000..0x41a42aa0]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
   at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10 
tid=0x2aab183ce000 nid=0x46ef waiting on condition 
[0x4184..0x41840c20]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
   at java.lang.Thread.run(Thread.java:619)

"Map-events fetcher for all reduce tasks on 
tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon 
prio=10 tid=0x2aab18438400 nid=0x4631 in Object.wait() 
[0x4173f000..0x4173fda0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaab3f0ace0> (a java.lang.Object)
   at 
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534)

   - locked <0x2aaab3f0ace0> (a java.lang.Object)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 
tid=0x2aab18427400 nid=0x462f waiting on condition 
[0x4153d000..0x4153daa0]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)

"IPC Server handler 1 on 43477" daemon prio=10 tid=0x2aab18476c00 
nid=0x462e in Object.wait() [0x4143c000..0x4143cb20]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaab41356b0> (a java.util.LinkedList)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
   - locked <0x2aaab41356b0> (a java.util.LinkedList)

"IPC Server handler 0 on 43477" daemon prio=10 tid=0x2aab18389c00 
nid=0x462d in Object.wait() [0x4133b000..0x4133bba0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaab41356b0> (a java.util.LinkedList)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
   - locked <0x2aaab41356b0> (a

Re: runtime exceptions not killing job

2008-03-17 Thread Matt Kent
It seems to happen only with reduce tasks, not map tasks. I reproduced 
it by having a dummy reduce task throw an NPE immediately. The error is 
shown on the reduce details page but the job does not register the task 
as failed. I've attached the task tracker stack trace, the child stack 
trace and a screenshot of the task list page.


Matt

Owen O'Malley wrote:


On Mar 17, 2008, at 3:14 PM, Matt Kent wrote:

I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 
0.14.1, if a map or reduce task threw a runtime exception such as an 
NPE, the task, and ultimately the job, would fail in short order. I 
was running on job on my local 0.16.1 cluster today, and when the 
reduce tasks started throwing NPEs, the tasks just hung. Eventually 
they timed out and were killed, but is this expected behavior in 
0.16.1? I'd prefer the job to fail quickly if NPEs are being thrown.


This sounds like a bug. Tasks should certainly fail immediately if an 
exception is thrown. Do you know where the exception is being thrown? 
Can you get a stack trace of the task from jstack after the exception 
and before the task times out?


Thanks,
   Owen




--
Matt Kent
Co-Founder
Persai
1221 40th St #113
Emeryville, CA 94608
[EMAIL PROTECTED]

2008-03-17 21:58:30
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed mode):

"Attach Listener" daemon prio=10 tid=0x2aab1205c400 nid=0x523d waiting on 
condition [0x..0x]
   java.lang.Thread.State: RUNNABLE

"IPC Client connection to bigmike.internal.persai.com/192.168.1.3:9001" daemon 
prio=10 tid=0x2aab14317000 nid=0x5230 in Object.wait() 
[0x41c44000..0x41c44ba0]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
- locked <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"process reaper" daemon prio=10 tid=0x2aab1205bc00 nid=0x51c6 runnable 
[0x41f47000..0x41f47da0]
   java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.waitForProcessExit(Native Method)
at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

"Thread-408" prio=10 tid=0x2aab14316000 nid=0x51c5 in Object.wait() 
[0x41d45000..0x41d45a20]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
at java.lang.Object.wait(Object.java:485)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
- locked <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
at org.apache.hadoop.util.Shell.run(Shell.java:100)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)

"SocketListener0-0" prio=10 tid=0x2aab1205e400 nid=0x519d in Object.wait() 
[0x41038000..0x41038da0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaaf2650c20> (a 
org.mortbay.util.ThreadPool$PoolThread)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
- locked <0x2aaaf2650c20> (a org.mortbay.util.ThreadPool$PoolThread)

"[EMAIL PROTECTED]" daemon prio=10 tid=0x2aab183a9000 nid=0x46f5 waiting on 
condition [0x41a42000..0x41a42aa0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10 tid=0x2aab183ce000 nid=0x46ef waiting on 
condition [0x4184..0x41840c20]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)
at java.lang.Thread.run(Thread.java:619)

"Map-events fetcher for all reduce tasks on 
tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon prio=10 
tid=0x2aab18438400 nid=0x4631 in Object.wait() 
[0x4173f000..0x4173fda0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaab3f0ace0> (a java.lang.Object)
at 
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThr

Re: Hadoop-Patch buil is not progressing for 6 hours

2008-03-17 Thread Nigel Daley

org.apache.hadoop.streaming.TestGzipInput was stuck.  I killed it.

Nige

On Mar 17, 2008, at 8:17 PM, Konstantin Shvachko wrote:


Usually a build takes 2 hours or less.
This one is stuck and I don't see changes in the QUEUE OF PENDING  
PATCHES when I submit a patch.

I guess something is wrong with Hadson.
Could anybody please check.
--Konstantin




Re: runtime exceptions not killing job

2008-03-17 Thread Owen O'Malley


On Mar 17, 2008, at 3:14 PM, Matt Kent wrote:

I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in  
0.14.1, if a map or reduce task threw a runtime exception such as  
an NPE, the task, and ultimately the job, would fail in short  
order. I was running on job on my local 0.16.1 cluster today, and  
when the reduce tasks started throwing NPEs, the tasks just hung.  
Eventually they timed out and were killed, but is this expected  
behavior in 0.16.1? I'd prefer the job to fail quickly if NPEs are  
being thrown.


This sounds like a bug. Tasks should certainly fail immediately if an  
exception is thrown. Do you know where the exception is being thrown?  
Can you get a stack trace of the task from jstack after the exception  
and before the task times out?


Thanks,
   Owen


Problems with 0.16.1

2008-03-17 Thread Owen O'Malley
We believe that there has been a regression in release 0.16.1 with  
respect to the reliability of HDFS. In particular, tasks get stuck  
talking to the data nodes when they are under load. It seems to be  
HADOOP-3033. We are currently testing it further. In the mean time, I  
would suggest holding off 0.16.1.


-- Owen


Hadoop-Patch buil is not progressing for 6 hours

2008-03-17 Thread Konstantin Shvachko

Usually a build takes 2 hours or less.
This one is stuck and I don't see changes in the QUEUE OF PENDING PATCHES when 
I submit a patch.
I guess something is wrong with Hadson.
Could anybody please check.
--Konstantin


Re: runtime exceptions not killing job

2008-03-17 Thread Chris Dyer
I've noticed this behavior as well in 16.0 with RuntimeExceptions in general.

Chris

On Mon, Mar 17, 2008 at 6:14 PM, Matt Kent <[EMAIL PROTECTED]> wrote:
> I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1,
>  if a map or reduce task threw a runtime exception such as an NPE, the
>  task, and ultimately the job, would fail in short order. I was running
>  on job on my local 0.16.1 cluster today, and when the reduce tasks
>  started throwing NPEs, the tasks just hung. Eventually they timed out
>  and were killed, but is this expected behavior in 0.16.1? I'd prefer the
>  job to fail quickly if NPEs are being thrown.
>
>  Matt
>
>  --
>  Matt Kent
>  Co-Founder
>  Persai
>  1221 40th St #113
>  Emeryville, CA 94608
>  [EMAIL PROTECTED]
>
>


Reading from and writing to a particular block

2008-03-17 Thread Cagdas Gerede
I examined
org.apache.hadoop.dfs.DistributedFileSystem

For reads, I couldn't find a method to read a particular block or
chunk. I only see the open method, which opens an input stream for a
path.
How can I read only a specific block?

Also,
for writes, I cannot write to a particular block. I have to write the
entire file again to modify a single byte. Am I correct?


Thanks for your responses,
Best,

Cagdas


runtime exceptions not killing job

2008-03-17 Thread Matt Kent
I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1, 
if a map or reduce task threw a runtime exception such as an NPE, the 
task, and ultimately the job, would fail in short order. I was running 
on job on my local 0.16.1 cluster today, and when the reduce tasks 
started throwing NPEs, the tasks just hung. Eventually they timed out 
and were killed, but is this expected behavior in 0.16.1? I'd prefer the 
job to fail quickly if NPEs are being thrown.


Matt

--
Matt Kent
Co-Founder
Persai
1221 40th St #113
Emeryville, CA 94608
[EMAIL PROTECTED]



secondarynamenode

2008-03-17 Thread Jorgen Johnson
Hi,

Anyone have stats for the impact on system resources (disk/memory/cpu) of
running secondarynamenode on a host? Are people running multiple secondary's
for redundancy or just going the route of having the namenode (& secondary)
replicate edits & fsimage to multiple physical disks (and nfs when
possible)?  Basically, I'm trying to determine the possible impact on
performance of running secondarynamenode daemon on several machines, as well
as if there's any potential benefit of running more than 1.

What are other people doing for disaster recovery on namenode?

-jorgenj

-- 
"Liberties are not given, they are taken."
- Aldous Huxley


Re: single node Hbase

2008-03-17 Thread stack
Try our 'getting started': 
http://hadoop.apache.org/hbase/docs/current/api/index.html.

St.Ack


Peter W. wrote:

Hello,

Are there any Hadoop documentation resources showing
how to run the current version of Hbase on a single node?

Thanks,

Peter W.




single node Hbase

2008-03-17 Thread Peter W.

Hello,

Are there any Hadoop documentation resources showing
how to run the current version of Hbase on a single node?

Thanks,

Peter W.


Re: [some bugs] Re: file permission problem

2008-03-17 Thread s29752-hadoopuser
Hi Stefan,

> any magic we can do with hadoop.dfs.umask?
>
dfs.umask is similar to Unix umask.

> Or is there any other off switch for the file security?
>
If dfs.permissions is set to false, then the security will be turned off.  

For the two questions above, see 
http://hadoop.apache.org/core/docs/r0.16.1/hdfs_permissions_guide.html for more 
details

> I definitely can reproduce the problem Johannes describes ...
>
I guess you are using the nightly builds which having the bug.  Please try 
0.16.1 release or current trunk.

> Beside of that I had some interesting observations.
> If I have permissions to write to a folder A I can delete folder A and 
> file B that is inside of folder A even if I do have no permissions for B.
>
This is also true for POSIX or Unix, where Hadoop permission bases on.

> Also I noticed following in my dfs
> [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/myApp-1205474968598
> Found 1 items
> /user/joa23/myApp-1205474968598/VOICE_CALL2008-03-13 16:00   
> rwxr-xr-xhadoopsupergroup
> [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls 
> /user/joa23/myApp-1205474968598/VOICE_CALL
> Found 1 items
> /user/joa23/myApp-1205474968598/VOICE_CALL/part-027311   
> 2008-03-13 16:00rw-r--r--joa23supergroup
>
> Do I miss something or was I able to write as user joa23 into a 
> folder owned by hadoop where I should have no permissions. :-O.
> Should I open some jira issues?
>
Suppose joa23 is not a superuser.  Then, no.

The output above only shows a file owned by joa23 exists in a directory owned 
hadoop.  This can definitely be done by a sequence of commands with chmod/chown.

Suppose joa23 is not a superuser.  If joa23 can create a file, say by "hadoop 
fs -put ...", under hadoop's directory with rwxr-xr-x, then it is a bug.  But I 
don't think we can do this.

Hope this helps.

Nicholas






Re: [some bugs] Re: file permission problem

2008-03-17 Thread s29752-hadoopuser
Hi,

Let me clarify the versions having this problem.
0.16.0 release, 0.16.1 release, current trunk: 
no problem

Nightly builds between 0.16.0 and 0.16.1 before HADOOP-2391 or after 
HADOOP-2915: 
no problem

Nightly builds between 0.16.0 and 0.16.1 after HADOOP-2391 and before 
HADOOP-2915: 
bug exists

Similarly, codes downloading from trunk before HADOOP-2391 or after 
HADOOP-2915: 
no problem

Codes downloading from trunk after HADOOP-2391 and before HADOOP-2915: 
bug exists

Sorry for the confusion.

Nicholas


- Original Message 
From: Stefan Groschupf <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Saturday, March 15, 2008 8:02:07 PM
Subject: Re: [some bugs] Re: file permission problem

Great - it is even alrady fixed in 16.1!
Thanks for the hint!
Stefan

On Mar 14, 2008, at 2:49 PM, Andy Li wrote:

> I think this is the same problem related to this mail thread.
>
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html
>
> A JIRA has been filed, please see HADOOP-2915.




Re: [core-user] Processing binary files Howto??

2008-03-17 Thread Ted Dunning

You can certainly do this, but you are simply repeating the work that hadoop
developers have already done.

Can you say what kind of satellite data you will be processing?  If it is
imagery, then I would imagine that Google's use of map-reduce to prepare
image tiles for Google maps would be an interesting example.


On 3/17/08 11:12 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> Another thing we want to considerer is to make our simple grid aware
> of the data location in order to move the task to the node which
> contains the data. A way of getting the hostname were the
> filename-block is and then calling the dfs API from that node.



[core-user] Integration of disperse clusters.

2008-03-17 Thread Alfonso Olias Sanz
Hi,

I haven't gone so far with the hadoop dfs documentation, but I would
like to know if it is possible to integrate disperse clusters
(separated by firewalls).

We cannot open ports, so we have to traverse the firewall through
http/https (tunnelling). Apart from its performance drawbacks. Is it
possible?

Regards
Alfonso


Re: [core-user] Processing binary files Howto??

2008-03-17 Thread Ted Dunning

Use the default block size of hadoop if you can (64 MB).

Then don't worry about splitting or replication.  Hadoop will do that.


On 3/17/08 11:12 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> But we could also try to make every binary file as big as
> a block size (then we need to decide the block size). Or if they are
> bigger than a block, the split them before adding the data to the
> cluster.



Re: [core-user] Processing binary files Howto??

2008-03-17 Thread Alfonso Olias Sanz
Hi Ted

Thanks for the info :)

We do not know yet how is going to be generated the input data.
Because the data is going to be observations from a satellite... still
on design.  But we could also try to make every binary file as big as
a block size (then we need to decide the block size). Or if they are
bigger than a block, the split them before adding the data to the
cluster. But the file should be large enough, as you said, in order to
last more than 10 seconds its computation.

Then each task (map) will process the files that are locally stored in
a node (the framework controls this??) 

All these is fine. We already have a grid solution with agents on
every node polling for jobs. Each job sent to a node computes 1-n
files (could be zipped) of simulated data.

One solution is to move to map/reduce and let the framework do the
distribution of tasks and data.


Another thing we want to considerer is to make our simple grid aware
of the data location in order to move the task to the node which
contains the data. A way of getting the hostname were the
filename-block is and then calling the dfs API from that node.

Cheers
Alfonso


On 17/03/2008, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>
>  This sounds very different from your earlier questions.
>
>  If you have a moderate (10's to 1000's) number of binary files, then it is
>  very easy to write a special purpose InputFormat that tells hadoop that the
>  file is not splittable.  This allows you to add all of the files as inputs
>  to the map step and you will get the locality that you want.  The files
>  should be large enough so that you take at least 10 seconds or more
>  processing them to get good performance relative to startup costs.  If they
>  are not, then you may want to package them up in a form that can be read
>  sequentially.  This need not be splittable, but it would be nice if it were.
>
>  If you are producing a single file per hour, then this style works pretty
>  well.  In my own work, we have a few compressed and encrypted files each
>  hour that are map-reduced into a more congenial and splittable form each
>  hour.  Then subsequent steps are used to aggregate or process the data as
>  needed.
>
>  This gives you all of the locality that you were looking for.
>
>
>  On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
>  wrote:
>
>
>  > Hi there.
>  >
>  > After reading a bit of the hadoop framework and trying the WordCount
>  > example. I have several doubts about how to use map /reduce with
>  > binary files.
>  >
>  > In my case binary files are generated in a time line basis. Let's say
>  > 1 file per hour. The size of each file is different (briefly we are
>  > getting pictures from space and the stars density is different between
>  > observations). The mappers, rather than receiving the file content.
>  > They have to receive the file name.  I read that if the input files
>  > are big (several blocks), they are split among several tasks in
>  > same/different node/s (block sizes?).  But we want each map task
>  > processes a file rather than a block (or a line of a file as in the
>  > WordCount sample).
>  >
>  > In a previous post I did to this forum. I was recommended to use an
>  > input file with all the file names, so the mappers would receive the
>  > file name. But there is a drawback related with data  location (also
>  > was mentioned this), because data then has to be moved from one node
>  > to another.   Data is not going to be replicated to all the nodes.  So
>  > a task taskA that has to process fileB on nodeN, it has to be executed
>  > on nodeN. How can we achive that???  What if a task requires a file
>  > that is on other node. Does the framework moves the logic to that
>  > node?  We need to define a URI file map in each node
>  > (hostname/path/filename) for all the files. Tasks would access the
>  > local URI file map in order to process the files.
>  >
>  > Another approach we have thought is to use the distributed file system
>  > to load balance the data among the nodes. And have our processes
>  > running on every node (without using the map/reduce framework). Then
>  > each process has to access to the local node to process the data,
>  > using the dfs API (or checking the local URI file map).  This approach
>  > would be more flexible to us, because depending on the machine
>  > (cuadcore, dualcore) we know how many java threads we can run in order
>  > to get the maximum performance of the machine.  Using the framework we
>  > can only say a number of tasks to be executed on every node, but all
>  > the nodes have to be the same.
>  >
>  > URI file map.
>  > Once the files are copied to the distributed file system, then we need
>  > to create this table map. Or is it a way to access a  at
>  > the data node and retrieve the files it handles? rather than getting
>  > all the files in all the nodes in that   ie
>  >
>  > NodeA  /tmp/.../mytask/input/fileA-1
>  > /tmp/.../mytask/input

Re: Multiple Output Value Classes

2008-03-17 Thread Doug Cutting

Stu Hood wrote:

But I'm trying to _output_ multiple different value classes from a Mapper, and 
not having any luck.


You can wrap things in ObjectWritable.  When writing, this records the 
class name with each instance, then, when reading, constructs an 
appropriate instance and reads it.  It can wrap Writable, String, 
primitive types, and arrays of these.


Doug


Re: [core-user] Processing binary files Howto??

2008-03-17 Thread Ted Dunning


This sounds very different from your earlier questions.

If you have a moderate (10's to 1000's) number of binary files, then it is
very easy to write a special purpose InputFormat that tells hadoop that the
file is not splittable.  This allows you to add all of the files as inputs
to the map step and you will get the locality that you want.  The files
should be large enough so that you take at least 10 seconds or more
processing them to get good performance relative to startup costs.  If they
are not, then you may want to package them up in a form that can be read
sequentially.  This need not be splittable, but it would be nice if it were.

If you are producing a single file per hour, then this style works pretty
well.  In my own work, we have a few compressed and encrypted files each
hour that are map-reduced into a more congenial and splittable form each
hour.  Then subsequent steps are used to aggregate or process the data as
needed.

This gives you all of the locality that you were looking for.


On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> Hi there.
> 
> After reading a bit of the hadoop framework and trying the WordCount
> example. I have several doubts about how to use map /reduce with
> binary files.
> 
> In my case binary files are generated in a time line basis. Let's say
> 1 file per hour. The size of each file is different (briefly we are
> getting pictures from space and the stars density is different between
> observations). The mappers, rather than receiving the file content.
> They have to receive the file name.  I read that if the input files
> are big (several blocks), they are split among several tasks in
> same/different node/s (block sizes?).  But we want each map task
> processes a file rather than a block (or a line of a file as in the
> WordCount sample).
> 
> In a previous post I did to this forum. I was recommended to use an
> input file with all the file names, so the mappers would receive the
> file name. But there is a drawback related with data  location (also
> was mentioned this), because data then has to be moved from one node
> to another.   Data is not going to be replicated to all the nodes.  So
> a task taskA that has to process fileB on nodeN, it has to be executed
> on nodeN. How can we achive that???  What if a task requires a file
> that is on other node. Does the framework moves the logic to that
> node?  We need to define a URI file map in each node
> (hostname/path/filename) for all the files. Tasks would access the
> local URI file map in order to process the files.
> 
> Another approach we have thought is to use the distributed file system
> to load balance the data among the nodes. And have our processes
> running on every node (without using the map/reduce framework). Then
> each process has to access to the local node to process the data,
> using the dfs API (or checking the local URI file map).  This approach
> would be more flexible to us, because depending on the machine
> (cuadcore, dualcore) we know how many java threads we can run in order
> to get the maximum performance of the machine.  Using the framework we
> can only say a number of tasks to be executed on every node, but all
> the nodes have to be the same.
> 
> URI file map.
> Once the files are copied to the distributed file system, then we need
> to create this table map. Or is it a way to access a  at
> the data node and retrieve the files it handles? rather than getting
> all the files in all the nodes in that   ie
> 
> NodeA  /tmp/.../mytask/input/fileA-1
> /tmp/.../mytask/input/fileA-2
> 
> NodeB /tmp/.../mytask/input/fileB
> 
> A process at nodeB listing the /tmp/.../input directory, would get only fileB
> 
> Any ideas?
> Thanks
> Alfonso.



Re: [core-user] Move application to Map/Reduce architecture with Hadoop

2008-03-17 Thread Ted Dunning

Replication is vital in large or even medium-sized clusters for reliability.

Replication also helps distribution.


On 3/17/08 2:48 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> But what I wanted to say was that we need to set up a
> cluster in a way that the data is distributed among all the data
> nodes. Instead of replicated. 



[core-user] Processing binary files Howto??

2008-03-17 Thread Alfonso Olias Sanz
Hi there.

After reading a bit of the hadoop framework and trying the WordCount
example. I have several doubts about how to use map /reduce with
binary files.

In my case binary files are generated in a time line basis. Let's say
1 file per hour. The size of each file is different (briefly we are
getting pictures from space and the stars density is different between
observations). The mappers, rather than receiving the file content.
They have to receive the file name.  I read that if the input files
are big (several blocks), they are split among several tasks in
same/different node/s (block sizes?).  But we want each map task
processes a file rather than a block (or a line of a file as in the
WordCount sample).

In a previous post I did to this forum. I was recommended to use an
input file with all the file names, so the mappers would receive the
file name. But there is a drawback related with data  location (also
was mentioned this), because data then has to be moved from one node
to another.   Data is not going to be replicated to all the nodes.  So
a task taskA that has to process fileB on nodeN, it has to be executed
on nodeN. How can we achive that???  What if a task requires a file
that is on other node. Does the framework moves the logic to that
node?  We need to define a URI file map in each node
(hostname/path/filename) for all the files. Tasks would access the
local URI file map in order to process the files.

Another approach we have thought is to use the distributed file system
to load balance the data among the nodes. And have our processes
running on every node (without using the map/reduce framework). Then
each process has to access to the local node to process the data,
using the dfs API (or checking the local URI file map).  This approach
would be more flexible to us, because depending on the machine
(cuadcore, dualcore) we know how many java threads we can run in order
to get the maximum performance of the machine.  Using the framework we
can only say a number of tasks to be executed on every node, but all
the nodes have to be the same.

URI file map.
Once the files are copied to the distributed file system, then we need
to create this table map. Or is it a way to access a  at
the data node and retrieve the files it handles? rather than getting
all the files in all the nodes in that   ie

NodeA  /tmp/.../mytask/input/fileA-1
/tmp/.../mytask/input/fileA-2

NodeB /tmp/.../mytask/input/fileB

A process at nodeB listing the /tmp/.../input directory, would get only fileB

Any ideas?
Thanks
Alfonso.


Re: [core-user] Move application to Map/Reduce architecture with Hadoop

2008-03-17 Thread Alfonso Olias Sanz
Hi Stu

thanks for the link.

Well, I should have been more precise. I do not think we will have 1
millon files. But what I wanted to say was that we need to set up a
cluster in a way that the data is distributed among all the data
nodes. Instead of replicated.   Thats the first step, the second is to
think in fault tolreance/replication of the data nodes.

Rui pointed to have a look at the "distcp"
I will also have a look at the number of mappers per node.

Thanks
Alfonso

On 15/03/2008, Stu Hood <[EMAIL PROTECTED]> wrote:
> Hello Alfonso,
>
>  A Hadoop namenode with a reasonable amount of memory (4-8GB) should be able 
> to handle a few million files (but I can't find a reference to back that 
> statement up: anyone?).
>
>  Unfortunately, the Hadoop jobtracker is not currently capable of handling 
> jobs with that many inputs. See 
> https://issues.apache.org/jira/browse/HADOOP-2119
>
>  The only approach I know of currently (without modifying your input files) 
> is the following, but it has the nasty side-effect of losing all input 
> locality for the tasks:
>
>  Do all of your processing in the Map tasks, and implement them a lot like 
> the distcp tool is implemented: the input for your Map task is a file 
> containing a list of your files to be processed, one per line. You then use 
> the Hadoop Filesystem API to read and process the files, and write your 
> outputs. You would then set the number of Reducers for the job to 0, since 
> you won't have any key->value output from the Mappers.
>
>  Thanks,
>  Stu
>
>
>
>  -Original Message-
>  From: Alfonso Olias Sanz <[EMAIL PROTECTED]>
>  Sent: Friday, March 14, 2008 8:05pm
>  To: core-user@hadoop.apache.org
>  Subject: [core-user] Move application to Map/Reduce architecture with Hadoop
>
>  Hi
>
>  I have just started using hadoop and HDFS.  I have done the WordCount
>  test application which gets some input files, process the files, and
>  generates and output file.
>
>
> I have a similar application, that has a million input files and has
>  to produce a million output files. The correlation between
>  input/output is 1:1.
>
>
> This process is suitable to run with a Map/Reduce approach.  But I
>  have several doubts I hope some body can answer me.
>
>  *** What should be the Reduce function??? Because there is no merge of
>  data of the running map processes.
>
>
> *** I set up a 2nodes cluster for the WordCount test. They work as
>  master+slave, slave. Before  launching the process I copied the files
>  using $HADOOP_HOME/bin/hadoop dfs -copyFromLocal 
>  
>
>
> This files were replicated in both nodes. Is it there anyway to avoid
>  the files being replicated to all the nodes and instead, have them
>  distributed among all the nodes. With no replication of files.
>
>
>
> *** During the WordCount test, 25 map jobs were launched!  For our
>  application is overkilling. We have done several performance tests
>  without using hadoop and we have seen  that we can launch 1
>  application per core.  So is it there anyway to configure the cluster
>  in order to launch a number of tasks per core. So depending on
>  dual-core or quad core pcs, the number of running processes will be
>  different.
>
>
>
> Thanks in advance.
>  alf.
>
>
>


Re: [ANNOUNCE] Hadoop release 0.16.1 available

2008-03-17 Thread Erwan Arzur
No trouble at all. Just wanted to make sure i test my code against the
appropriate code base and would not hit bugs that were fixed in releases
inbetween.

Thanks

Erwan

On Sun, Mar 16, 2008 at 9:45 PM, stack <[EMAIL PROTECTED]> wrote:

> Erwan:
>
> I just posted a longer note to the hbase-user mailing list on this topic.
>
> In short, the most recent hbase release is that which is bundled as a
> contrib in hadoop 0.16.0.  The hbase that is in hadoop 0.16.1 is the
> same as is in the hadoop-0.16.0-hbase.jar.  Since the 0.16.0 release,
> hbase graduated out of contrib and became a subproject homed at a
> different location in svn.  We also got our own area in JIRA,
> mailing-lists, website, etc.  (See hbase.org).  All development since
> has gone on at the new svn location, not under hadoop at
> src/contrib/hbase.  The next hbase release will be 0.1.0 made from the
> hbase 0.1 branch.   Our 0.1 branch is intended to ride along atop the
> hadoop 0.16 branch.  0.1.0 will be the hbase that is in hadoop 0.16.0
> with fixes and some changes to accomodate our new standing as a project
> dependent, but apart from our parent hadoop.  We hope to put up a
> candidate later this week.
>
> Originally, to cut confusion, the idea was that hbase would quickly
> release a 0.1.0 from our new location, before hadoop 0.16.1 went out,
> and then hbase would be cleaned out of the hadoop 0.16 branch as it has
> been from hadoop TRUNK, but we've been laggards.
>
> We'll do up a better explanation of the transition at hbase.org over the
> next few days.  Sorry for any trouble caused meantime.
> St.Ack
>
>
>
> Erwan Arzur wrote:
> > Hello everybody,
> >
> > i am a bit confused by the release notes (and
> > http://hadoop.apache.org/hbase/releases.html) stating that hadoop and
> hbase
> > are going to be separate from now on, and :
> >
> > ~/hadoop-0.16.1$ ls -l contrib/hbase/
> > total 700
> > drwxr-xr-x 2 hadoop hadoop   4096 2008-03-15 08:56 bin
> > drwxr-xr-x 2 hadoop hadoop   4096 2008-03-15 08:55 conf
> > -rw-r--r-- 1 hadoop hadoop 693401 2008-03-09 05:43
> hadoop-0.16.1-hbase.jar
> > drwxr-xr-x 2 hadoop hadoop   4096 2008-03-15 08:55 lib
> > drwxr-xr-x 6 hadoop hadoop   4096 2008-03-09 05:43 webapps
> >
> > Is the release that shipped with hadoop 0.16.1 nevertheless the
> equivalent
> > to 0.1.0 ? Which hbase release should be stable enough for us to use ?
> >
> > Thanks in advance,
> >
> > Erwan
> >
> > On Fri, Mar 14, 2008 at 7:55 PM, Nigel Daley <[EMAIL PROTECTED]>
> wrote:
> >
> >
> >> Release 0.16.1 fixes critical bugs in 0.16.0.  Note that HBase
> >> releases are now maintained at http://hadoop.apache.org/hbase/
> >>
> >> For Hadoop release details and downloads, visit:
> >>
> >>http://hadoop.apache.org/core/releases.html
> >>
> >> Thanks to all who contributed to this release!
> >>
> >> Nigel
> >>
> >>
> >>
> >
> >
>
>