my questions

2008-03-18 Thread zr0243
On Solaris 10, it had some problems with the "bin/hadoop jar 
hadoop-0.16.1-examples.jar grep input output 'dfs[a-z.]+'"
08/03/18 13:12:29 INFO mapred.FileInputFormat: Total input paths to process : 11
08/03/18 13:12:30 INFO mapred.JobClient: Running job: job_200803181307_0001
08/03/18 13:12:31 INFO mapred.JobClient:  map 0% reduce 0%
08/03/18 13:12:32 INFO mapred.JobClient: Task Id : 
task_200803181307_0001_m_00_0, Status : FAILED
Error initializing task_200803181307_0001_m_00_0:
java.io.IOException: Login failed: Cannot run program "whoami": error=2, No 
such file or directory" 
    at 
org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
    at 
org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
    at 
org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
    at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
    at 
org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
    at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
    at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
    at 
org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
    at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
    at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:620)
    at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
    at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
    at 
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
    at 
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
08/03/18 13:12:32 WARN mapred.JobClient: Error reading task 
outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_00_0&filter=stdout
08/03/18 13:12:32 WARN mapred.JobClient: Error reading task 
outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_00_0&filter=stderr
08/03/18 13:12:37 INFO mapred.JobClient: Task Id : 
task_200803181307_0001_m_01_0, Status : FAILED
Error initializing task_200803181307_0001_m_01_0:
java.io.IOException: Login failed: Cannot run program "whoami": error=2, No 
such file or directory" 
    at 
org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
    at 
org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
    at 
org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
    at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
    at 
org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
    at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
    at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
    at 
org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
    at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
    at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:620)
    at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
    at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
    at 
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
    at 
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
08/03/18 13:12:37 WARN mapred.JobClient: Error reading task 
outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_01_0&filter=stdout
08/03/18 13:12:37 WARN mapred.JobClient: Error reading task 
outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_01_0&filter=stderr
08/03/18 13:12:37 INFO mapred.JobClient: Task Id : 
task_200803181307_0001_m_02_0, Status : FAILED
Error initializing task_200803181307_0001_m_02_0:
java.io.IOException: Login failed: Cannot run program "whoami": error=2, No 
such file or directory" 
    at 
org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
    at 
org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
    at 
org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
    at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
    at 
org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
    at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
    at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
    at 
org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
    at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
    at 
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:620)
    at 
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
    at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
    at 
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
    at 
org.ap

Re: MapReduce failure

2008-03-18 Thread Ved Prakash
I increased the heap size as you have suggested, and I could run a map
reduce job on it.

thanks

On Mon, Mar 10, 2008 at 10:58 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> What is the heap size you are using for your tasks? Check
> 'mapred.child.java.opts' in your hadoop-default.xml. Try increasing it.
> This will happen if you try running the random-writer + sort examples with
> default parameters. The maps are not able to spill the data to the disk.
> Btw what version of HADOOP are you using?
> Amar
> On Mon, 10 Mar 2008, Ved Prakash
> wrote:
>
> > Hi friends,
> >
> > I have made a cluster of 3 machines, one of them is master, and other 2
> > slaves. I executed a mapreduce job on master but after Map, the
> execution
> > terminates and Reduce doesn't happen. I have checked dfs and no output
> > folder gets created.
> >
> > this is the error I see
> >
> > 08/03/10 10:35:21 INFO mapred.JobClient: Task Id :
> > task_200803101001_0001_m_64_0, Status : FAILED
> > java.lang.OutOfMemoryError: Java heap space
> >at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
> > :95)
> >at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >at org.apache.hadoop.io.Text.write(Text.java:243)
> >at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
> > MapTask.java:347)
> >at org.apache.hadoop.examples.WordCount$MapClass.map(
> WordCount.java
> > :72)
> >at org.apache.hadoop.examples.WordCount$MapClass.map(
> WordCount.java
> > :59)
> >at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >at org.apache.hadoop.mapred.TaskTracker$Child.main(
> TaskTracker.java
> > :1787)
> >
> > 08/03/10 10:35:22 INFO mapred.JobClient:  map 55% reduce 17%
> > 08/03/10 10:35:31 INFO mapred.JobClient:  map 56% reduce 17%
> > 08/03/10 10:35:51 INFO mapred.JobClient:  map 57% reduce 17%
> > 08/03/10 10:36:04 INFO mapred.JobClient:  map 58% reduce 17%
> > 08/03/10 10:36:07 INFO mapred.JobClient:  map 57% reduce 17%
> > 08/03/10 10:36:07 INFO mapred.JobClient: Task Id :
> > task_200803101001_0001_m_71_0, Status : FAILED
> > java.lang.OutOfMemoryError: Java heap space
> >at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java
> > :95)
> >at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >at org.apache.hadoop.io.Text.write(Text.java:243)
> >at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
> > MapTask.java:347)
> >at org.apache.hadoop.examples.WordCount$MapClass.map(
> WordCount.java
> > :72)
> >at org.apache.hadoop.examples.WordCount$MapClass.map(
> WordCount.java
> > :59)
> >at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >at org.apache.hadoop.mapred.TaskTracker$Child.main(
> TaskTracker.java
> > :1787)
> >
> > though it tries to overcome this problem but the mapreduce application
> > doesn't create output, can anyone tell me why is this happening?
> >
> > Thanks
> >
>


Re: [Fwd: Re: runtime exceptions not killing job]

2008-03-18 Thread Amareshwari Sriramadasu

Thanks Matt for info.
I raised a Jira for this at 
https://issues.apache.org/jira/browse/HADOOP-3039


Thanks
Amareshwari
Matt Kent wrote:

Or maybe I can't use attachments, so here's the stack traces inline:

--task tracker

2008-03-17 21:58:30
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed 
mode):


"Attach Listener" daemon prio=10 tid=0x2aab1205c400 nid=0x523d 
waiting on condition [0x..0x]

  java.lang.Thread.State: RUNNABLE

"IPC Client connection to 
bigmike.internal.persai.com/192.168.1.3:9001" daemon prio=10 
tid=0x2aab14317000 nid=0x5230 in Object.wait() 
[0x41c44000..0x41c44ba0]

  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)

   at java.lang.Object.wait(Object.java:485)
   at 
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
   - locked <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)

   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"process reaper" daemon prio=10 tid=0x2aab1205bc00 nid=0x51c6 
runnable [0x41f47000..0x41f47da0]

  java.lang.Thread.State: RUNNABLE
   at java.lang.UNIXProcess.waitForProcessExit(Native Method)
   at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
   at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

"Thread-408" prio=10 tid=0x2aab14316000 nid=0x51c5 in 
Object.wait() [0x41d45000..0x41d45a20]

  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
   at java.lang.Object.wait(Object.java:485)
   at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
   - locked <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
   at org.apache.hadoop.util.Shell.run(Shell.java:100)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)

   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)

"SocketListener0-0" prio=10 tid=0x2aab1205e400 nid=0x519d in 
Object.wait() [0x41038000..0x41038da0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf2650c20> (a 
org.mortbay.util.ThreadPool$PoolThread)

   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
   - locked <0x2aaaf2650c20> (a 
org.mortbay.util.ThreadPool$PoolThread)


"[EMAIL PROTECTED]" daemon prio=10 
tid=0x2aab183a9000 nid=0x46f5 waiting on condition 
[0x41a42000..0x41a42aa0]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)

   at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10 
tid=0x2aab183ce000 nid=0x46ef waiting on condition 
[0x4184..0x41840c20]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)

   at java.lang.Thread.run(Thread.java:619)

"Map-events fetcher for all reduce tasks on 
tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon 
prio=10 tid=0x2aab18438400 nid=0x4631 in Object.wait() 
[0x4173f000..0x4173fda0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaab3f0ace0> (a java.lang.Object)
   at 
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534) 


   - locked <0x2aaab3f0ace0> (a java.lang.Object)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 
tid=0x2aab18427400 nid=0x462f waiting on condition 
[0x4153d000..0x4153daa0]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)

"IPC Server handler 1 on 43477" daemon prio=10 tid=0x2aab18476c00 
nid=0x462e in Object.wait() [0x4143c000..0x4143cb20]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaab41356b0> (a java.util.LinkedList)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
   - locked <0x2aaab41356b0> (a java.util.LinkedList)

"IPC Server handler 0 on 43477" daemon prio=10 tid=0x2aab18389c00 
nid=0x462d in Object.wait() [0x4133b000..0x4133bba0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waitin

Re: [core-user] Processing binary files Howto??

2008-03-18 Thread Alfonso Olias Sanz
On 17/03/2008, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>  You can certainly do this, but you are simply repeating the work that hadoop
>  developers have already done.

Well the fact is that we have disperse clusters, so running node
agents use http to talk with the grid manager.  At present we are able
to decide the number of app instances+java threads we can run on each
node based on the number of cores and memory. But as far as I have
seen this is not possible with hadoop, where you can say the number of
tasks a node can run, but it will be for all the nodes the same.



>
>  Can you say what kind of satellite data you will be processing?  If it is
>  imagery, then I would imagine that Google's use of map-reduce to prepare
>  image tiles for Google maps would be an interesting example.
>
Not imaginery, raw data from the instruments.

>
>  On 3/17/08 11:12 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
>  wrote:
>
>
>  > Another thing we want to considerer is to make our simple grid aware
>  > of the data location in order to move the task to the node which
>  > contains the data. A way of getting the hostname were the
>  > filename-block is and then calling the dfs API from that node.
>
>


Re: [core-user] Processing binary files Howto??

2008-03-18 Thread Enis Soztutar

Hi, please see below,

Ted Dunning wrote:

This sounds very different from your earlier questions.

If you have a moderate (10's to 1000's) number of binary files, then it is
very easy to write a special purpose InputFormat that tells hadoop that the
file is not splittable.  

@ Ted,
   actually we have MultiFileInputFormat and MultiFileSplit for exactly 
this :)


@ Alfonso,
   The core of the hadoop does not care about the source
of the data(such as files, database, etc). The map and reduce functions 
operate on records
which are just key value pairs. The job of the 
InputFormat/InputSplit/RecordReader interfaces

is to map the actual data source to records.

So, if a file contains a few records and no records is split among two 
files and the total number of files
is in the order of ten thousands, you can extend MultiFileInputFormat to 
return a Records reader which

extracts records from these binary files.

If the above does not apply, you can concatenate  all the files into a 
smaller number of files, then use FileInputFormat.
Then your RecordReader implementation is responsible for finding the 
record boundaries and extracting the records.


In both options, storing the files in DFS and using map-red is a wise 
choice, since mapred over dfs already has locality optimizations. But if 
you must you can distribute the files to the nodes manually, and 
implement an ad-hock Partitioner which ensures the map task is executed 
on the node that has the relevant files.



This allows you to add all of the files as inputs
to the map step and you will get the locality that you want.  The files
should be large enough so that you take at least 10 seconds or more
processing them to get good performance relative to startup costs.  If they
are not, then you may want to package them up in a form that can be read
sequentially.  This need not be splittable, but it would be nice if it were.

If you are producing a single file per hour, then this style works pretty
well.  In my own work, we have a few compressed and encrypted files each
hour that are map-reduced into a more congenial and splittable form each
hour.  Then subsequent steps are used to aggregate or process the data as
needed.

This gives you all of the locality that you were looking for.


On 3/17/08 6:19 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

  

Hi there.

After reading a bit of the hadoop framework and trying the WordCount
example. I have several doubts about how to use map /reduce with
binary files.

In my case binary files are generated in a time line basis. Let's say
1 file per hour. The size of each file is different (briefly we are
getting pictures from space and the stars density is different between
observations). The mappers, rather than receiving the file content.
They have to receive the file name.  I read that if the input files
are big (several blocks), they are split among several tasks in
same/different node/s (block sizes?).  But we want each map task
processes a file rather than a block (or a line of a file as in the
WordCount sample).

In a previous post I did to this forum. I was recommended to use an
input file with all the file names, so the mappers would receive the
file name. But there is a drawback related with data  location (also
was mentioned this), because data then has to be moved from one node
to another.   Data is not going to be replicated to all the nodes.  So
a task taskA that has to process fileB on nodeN, it has to be executed
on nodeN. How can we achive that???  What if a task requires a file
that is on other node. Does the framework moves the logic to that
node?  We need to define a URI file map in each node
(hostname/path/filename) for all the files. Tasks would access the
local URI file map in order to process the files.

Another approach we have thought is to use the distributed file system
to load balance the data among the nodes. And have our processes
running on every node (without using the map/reduce framework). Then
each process has to access to the local node to process the data,
using the dfs API (or checking the local URI file map).  This approach
would be more flexible to us, because depending on the machine
(cuadcore, dualcore) we know how many java threads we can run in order
to get the maximum performance of the machine.  Using the framework we
can only say a number of tasks to be executed on every node, but all
the nodes have to be the same.

URI file map.
Once the files are copied to the distributed file system, then we need
to create this table map. Or is it a way to access a  at
the data node and retrieve the files it handles? rather than getting
all the files in all the nodes in that   ie

NodeA  /tmp/.../mytask/input/fileA-1
/tmp/.../mytask/input/fileA-2

NodeB /tmp/.../mytask/input/fileB

A process at nodeB listing the /tmp/.../input directory, would get only fileB

Any ideas?
Thanks
Alfonso.




  


Issue with cluster over EC2 and different AMI types

2008-03-18 Thread Andrey Pankov

Hi all,

I'm trying to configure Hadoop cluster over Amazon EC2, one m1.small 
instance for master node, and some m1.large instances for slaves. Both 
master's on slaves's AMIs have the same version of Hadoop, 0.16.0.


I run ec2 instances using ec2-run-instances, with the same --group 
parameter, but in two step, one call - run for master, second call - run 
for slaves.


It looks like EC2 instances with different AMI types starting in 
different networks, for example external and internal DNS names:


  * ec2-67-202-59-12.compute-1.amazonaws.com
ip-10-251-74-181.ec2.internal - for small instance
  * ec2-67-202-3-191.compute-1.amazonaws.com
domU-12-31-38-00-5C-C1.compute-1.internal - for large

The trouble is that slaves could not contact the master. When I specify 
fs.default.name parameter in hadoop-site.xml on slaves box to be full 
DNS name of master (either external or internal) and try to start 
datanode on it (bin/hadoop-daemon.sh ... start datanode), Hadoop 
replaces fs.default.name with just 'ip-10-251-74-181' and puts in log


2008-03-18 07:08:16,028 ERROR org.apache.hadoop.dfs.DataNode: 
java.net.UnknownHostException: unknown host: ip-10-251-74-181

...

So DataNode could not be started.

I tried to specify IP addr of ip-10-251-74-181 in /etc/hosts for each 
slave instance and it helped to start DataNode on slaves. And it became 
possible to store smth in HDFS. But. When I'm trying to run map-reduce 
job (in jar file), it doesn't work. I mean that jobs is still working 
but there is no any progress at all. Hadoop have written Map 0% Reduce 
0% and just freeze.


Can not not find anything in logs what could help a bit, both on master 
and on slave boxes.


I found that dfs.network.script could be used to specify somehow a 
network location for a machine, but have no ideas now to use it. Can 
racks help me with it?


Thanks in advance.

---
Andrey Pankov




Re: my questions

2008-03-18 Thread Eddie C
hadoop is dependant on having a whoami binary. I never really
understood this it makes problems on windows as well I am not sure if
you can specify the user. I would suggest creating your own whoami
shell script and make it match the linux whoami output.


2008/3/18  <[EMAIL PROTECTED]>:
> On Solaris 10, it had some problems with the "bin/hadoop jar 
> hadoop-0.16.1-examples.jar grep input output 'dfs[a-z.]+'"
>  08/03/18 13:12:29 INFO mapred.FileInputFormat: Total input paths to process 
> : 11
>  08/03/18 13:12:30 INFO mapred.JobClient: Running job: job_200803181307_0001
>  08/03/18 13:12:31 INFO mapred.JobClient:  map 0% reduce 0%
>  08/03/18 13:12:32 INFO mapred.JobClient: Task Id : 
> task_200803181307_0001_m_00_0, Status : FAILED
>  Error initializing task_200803181307_0001_m_00_0:
>  java.io.IOException: Login failed: Cannot run program "whoami": error=2, No 
> such file or directory" 
>      at 
> org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
>      at 
> org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
>      at 
> org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
>      at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
>      at 
> org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
>      at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
>      at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
>      at 
> org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
>      at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
>      at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:620)
>      at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
>      at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
>      at 
> org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
>      at 
> org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
>  08/03/18 13:12:32 WARN mapred.JobClient: Error reading task 
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_00_0&filter=stdout
>  08/03/18 13:12:32 WARN mapred.JobClient: Error reading task 
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_00_0&filter=stderr
>  08/03/18 13:12:37 INFO mapred.JobClient: Task Id : 
> task_200803181307_0001_m_01_0, Status : FAILED
>  Error initializing task_200803181307_0001_m_01_0:
>  java.io.IOException: Login failed: Cannot run program "whoami": error=2, No 
> such file or directory" 
>      at 
> org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
>      at 
> org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
>      at 
> org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
>      at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
>      at 
> org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
>      at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
>      at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
>      at 
> org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
>      at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
>      at 
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:620)
>      at 
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
>      at 
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
>      at 
> org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
>      at 
> org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
>  08/03/18 13:12:37 WARN mapred.JobClient: Error reading task 
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_01_0&filter=stdout
>  08/03/18 13:12:37 WARN mapred.JobClient: Error reading task 
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_01_0&filter=stderr
>  08/03/18 13:12:37 INFO mapred.JobClient: Task Id : 
> task_200803181307_0001_m_02_0, Status : FAILED
>  Error initializing task_200803181307_0001_m_02_0:
>  java.io.IOException: Login failed: Cannot run program "whoami": error=2, No 
> such file or directory" 
>      at 
> org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
>      at 
> org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
>      at 
> org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
>      at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
>      at 
> org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
>      at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
>      at 
> org.ap

Re: Issue with cluster over EC2 and different AMI types

2008-03-18 Thread Andrey Pankov

Hi,

I'm apologize. It was my fault - I forgot to run tasktracker on slaves.
But anyway can anyone share his experience how to use rack?
Thanks.

Andrey Pankov wrote:

Hi all,

I'm trying to configure Hadoop cluster over Amazon EC2, one m1.small 
instance for master node, and some m1.large instances for slaves. Both 
master's on slaves's AMIs have the same version of Hadoop, 0.16.0.


I run ec2 instances using ec2-run-instances, with the same --group 
parameter, but in two step, one call - run for master, second call - run 
for slaves.


It looks like EC2 instances with different AMI types starting in 
different networks, for example external and internal DNS names:


  * ec2-67-202-59-12.compute-1.amazonaws.com
ip-10-251-74-181.ec2.internal - for small instance
  * ec2-67-202-3-191.compute-1.amazonaws.com
domU-12-31-38-00-5C-C1.compute-1.internal - for large

The trouble is that slaves could not contact the master. When I specify 
fs.default.name parameter in hadoop-site.xml on slaves box to be full 
DNS name of master (either external or internal) and try to start 
datanode on it (bin/hadoop-daemon.sh ... start datanode), Hadoop 
replaces fs.default.name with just 'ip-10-251-74-181' and puts in log


2008-03-18 07:08:16,028 ERROR org.apache.hadoop.dfs.DataNode: 
java.net.UnknownHostException: unknown host: ip-10-251-74-181

...

So DataNode could not be started.

I tried to specify IP addr of ip-10-251-74-181 in /etc/hosts for each 
slave instance and it helped to start DataNode on slaves. And it became 
possible to store smth in HDFS. But. When I'm trying to run map-reduce 
job (in jar file), it doesn't work. I mean that jobs is still working 
but there is no any progress at all. Hadoop have written Map 0% Reduce 
0% and just freeze.


Can not not find anything in logs what could help a bit, both on master 
and on slave boxes.


I found that dfs.network.script could be used to specify somehow a 
network location for a machine, but have no ideas now to use it. Can 
racks help me with it?


Thanks in advance.

---
Andrey Pankov




---
Andrey Pankov


Re: User information in 0.16.1 problems

2008-03-18 Thread Erwan Arzur
Hey,

i was hit by the same issue. I have been able to run a MiniDFSCluster for my
own unit tests under windows by setting:

conf.set ("hadoop.job.ugi", "hadoop,hadoop");

That forces UnixUserGroupInfo to use the specified user,group.

I think that UnixUserGroupInfo is not portable enough, even between unix (as
one user reports whoami cannot be found under Solaris).

Maybe using System.getProperty ("user.name") in the UnixUserGroupInformation
class would solve that specific problem and be more portable ?

That doesn't solve the groups listing which uses "bash -c groups" ... wow. I
know a few BOFHs who would try to bite anyone daring to install bash on
their boxes :-)

I am not too familiar with the intricacies of UnixUserGroupInfo & hadoop's
permissions, but maybe forcing to group "nobody" if the command fails would
work ?

Erwan

On Sun, Mar 16, 2008 at 1:30 PM, Naama Kraus <[EMAIL PROTECTED]> wrote:

> Hi All
>
> I've been upgrading from 0.15.3 to 0.16.1. I ran some tests in the local
> mode (no HDFS). The tests perform some MapReduce jobs. I ran into the
> following problems:
>
> On Windows it complained that it does not find the 'whoami' command. Once
> I
> fixed that, it did work. BUT, it didn't use to complain in 0.15.3.
> On Linux, it had some problems with the 'id' command - "Failed to get the
> current user's information: Login failed: id: cannot find name for group
> ID
> 1102921537" which it didn't use to in 0.15.3.
>
> Both look related to getting user information.
>
> My question is what has changed in 0.16.1 and whether there are new
> requirements now for being able to submit jobs .
>
> Thanks for any help,
> Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If
> you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>


Re: my questions

2008-03-18 Thread Erwan Arzur
conf.set ("hadoop.job.ugi", "hadoop,hadoop");

did the trick for me in some unit tests i am writing. I still have problem
running a MiniDFSCluster of my own.

I guess that, with 0.16.1, setting this on any Windows installation would be
mandatory.

This prevents UnixUserGroupInfo to try to run 'whoami' and try to get groups
membership information.

I think it would be more portable to use System.getProperty ("user.name")
than executing whoami.

The group problem stays ... execing "bash -c groups" can not be stated as
portable :-)

Erwan

2008/3/18 Eddie C <[EMAIL PROTECTED]>:

> hadoop is dependant on having a whoami binary. I never really
> understood this it makes problems on windows as well I am not sure if
> you can specify the user. I would suggest creating your own whoami
> shell script and make it match the linux whoami output.
>
>
> 2008/3/18  <[EMAIL PROTECTED]>:
> > On Solaris 10, it had some problems with the "bin/hadoop jar
> hadoop-0.16.1-examples.jar grep input output 'dfs[a-z.]+'"
> >  08/03/18 13:12:29 INFO mapred.FileInputFormat: Total input paths to
> process : 11
> >  08/03/18 13:12:30 INFO mapred.JobClient: Running job:
> job_200803181307_0001
> >  08/03/18 13:12:31 INFO mapred.JobClient:  map 0% reduce 0%
> >  08/03/18 13:12:32 INFO mapred.JobClient: Task Id :
> task_200803181307_0001_m_00_0, Status : FAILED
> >  Error initializing task_200803181307_0001_m_00_0:
> >  java.io.IOException: Login failed: Cannot run program "whoami":
> error=2, No such file or
> directory" 
> >      at
> org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
> >      at
> org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
> >      at
> org.apache.hadoop.dfs.DistributedFileSystem.initialize(
> DistributedFileSystem.java:65)
> >      at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
> >      at
> org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
> >      at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
> >      at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
> >      at
> org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
> >      at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
> >      at
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:620)
> >      at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
> >      at
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
> >      at
> org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
> >      at
> org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
> >  08/03/18 13:12:32 WARN mapred.JobClient: Error reading task
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_00_0&filter=stdout
> >  08/03/18 13:12:32 WARN mapred.JobClient: Error reading task
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_00_0&filter=stderr
> >  08/03/18 13:12:37 INFO mapred.JobClient: Task Id :
> task_200803181307_0001_m_01_0, Status : FAILED
> >  Error initializing task_200803181307_0001_m_01_0:
> >  java.io.IOException: Login failed: Cannot run program "whoami":
> error=2, No such file or
> directory" 
> >      at
> org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:124)
> >      at
> org.apache.hadoop.dfs.DFSClient.(DFSClient.java:143)
> >      at
> org.apache.hadoop.dfs.DistributedFileSystem.initialize(
> DistributedFileSystem.java:65)
> >      at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
> >      at
> org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
> >      at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
> >      at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
> >      at
> org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
> >      at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
> >      at
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:620)
> >      at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1282)
> >      at
> org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:923)
> >      at
> org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1318)
> >      at
> org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2210)
> >  08/03/18 13:12:37 WARN mapred.JobClient: Error reading task
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_01_0&filter=stdout
> >  08/03/18 13:12:37 WARN mapred.JobClient: Error reading task
> outputhttp://zr:50060/tasklog?plaintext=true&taskid=task_200803181307_0001_m_01_0&filter=stderr
> >  08/03/18 13:12:37 INFO mapred.JobClient: Task Id :
> task_200803181307_0001_m_02_0, Status : FAILED
> >  Error initializing task_2008031813

HDFS: how to append

2008-03-18 Thread Cagdas Gerede
The HDFS documentation says it is possible to append to an HDFS file.

In org.apache.hadoop.dfs.DistributedFileSystem class,
there is no method to open an existing file for writing (there are
methods for reading).
Only similar methods are "create" methods which return FSDataOutputStream.
When I look at FSDataOutputStream class, it seems there is no "append"
method, and
all "write" methods overwrite an existing file or return an error if
such a file exists.

Does anybody know how to append to a file in HDFS?
I appreciate your help.
Thanks,

Cagdas


RE: HDFS: how to append

2008-03-18 Thread dhruba Borthakur
HDFS files, once created, cannot be modified in any way. Appends to HDFS
files will probably be supported in a future release in the next couple
of months.

Thanks,
dhruba

-Original Message-
From: Cagdas Gerede [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 18, 2008 9:53 AM
To: core-user@hadoop.apache.org
Subject: HDFS: how to append

The HDFS documentation says it is possible to append to an HDFS file.

In org.apache.hadoop.dfs.DistributedFileSystem class,
there is no method to open an existing file for writing (there are
methods for reading).
Only similar methods are "create" methods which return
FSDataOutputStream.
When I look at FSDataOutputStream class, it seems there is no "append"
method, and
all "write" methods overwrite an existing file or return an error if
such a file exists.

Does anybody know how to append to a file in HDFS?
I appreciate your help.
Thanks,

Cagdas


HDFS: Flash Application and Available APIs

2008-03-18 Thread Cagdas Gerede
I have two questions:

- I was wondering if an HDFS client can be invoked from a Flash application.
- What are the available APIs for HDFS? (I read that there is a C/C++
api for Hadoop Map/Reduce but is there a C/C++ API for HDFS? or Can it
only be invoked from a Java application?


Thanks for your help,
Cagdas


Re: libhdfs working for test program when run from ant but failing when run individually

2008-03-18 Thread Arun C Murthy


On Mar 14, 2008, at 11:48 PM, Raghavendra K wrote:


Hi,
  My apologies for bugging the forum again and again.
I am able to get the sample program for libhdfs working. I followed  
these

steps.

---> compiled using ant
---> modified the test-libhdfs.sh to include CLASSPATH, HADOOP_HOME,
HADOOP_CONF_DIR, HADOOP_LOG_DIR, LIBHDFS_BUILD_DIR (since I ran
test-libhdfs.sh individually and dint invoke it from ant)
---> The program ran succesfully and was able to write, read and all.

Now I copy the same program to a different directory and use the same
Makefile(used by ant) and modified the variables accordingly. Used  
make test

compiled successfully
Used the same test-libhdfs.sh to invoke hdfs_test, but now it fails  
saying

Segmentation Fault.
I dont know where it is going wrong.
Cant libhdfs be compiled without using ant? I want to test it and  
integrate

libhdfs with my program
Please do reply and help me out as this is driving me crazy.


I can only assume there is something wrong with the values you are  
passing for the requisite environment variables: OS_{NAME|OS_ARCH},  
SHLIB_VERSION, LIBHDFS_VERSION, HADOOP_{HOME|CONF_DIR|LOG_DIR} since  
it works when you run 'make test'.


Sorry it isn't of much help... could you share the values you are  
using for these?


Arun



Thanks in advance.

--
Regards,
Raghavendra K




Re: HadoopDfsReadWriteExample

2008-03-18 Thread Cagdas Gerede
For people who would have a similar problem:

I realized
org.apache.hadoop.fs.DF class documentation says
"Filesystem disk space usage statistics. Uses the unix 'df' program.
Tested on Linux, FreeBSD, Cygwin."

As a result, I run java command to run my HDFS accessing application
from Cygwin, and it worked fine.

Cagdas

On Thu, Mar 13, 2008 at 10:33 AM, Cagdas Gerede <[EMAIL PROTECTED]> wrote:
> I tried HadoopDfsReadWriteExample. I am getting the following error. I
>  appreciate any help. I provide more info at the end.
>
>
>  Error while copying file
>  Exception in thread "main" java.io.IOException: Cannot run program
>  "df": CreateProcess error=2, The system cannot find the file specified
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> at java.lang.Runtime.exec(Runtime.java:593)
> at java.lang.Runtime.exec(Runtime.java:466)
> at org.apache.hadoop.fs.ShellCommand.runCommand(ShellCommand.java:48)
> at org.apache.hadoop.fs.ShellCommand.run(ShellCommand.java:42)
> at org.apache.hadoop.fs.DF.getAvailable(DF.java:72)
> at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296)
> at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:326)
> at 
> org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:155)
> at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.newBackupFile(DFSClient.java:1483)
> at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.openBackupStream(DFSClient.java:1450)
> at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:1592)
> at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
> at 
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
> at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1728)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
> at HadoopDFSFileReadWrite.main(HadoopDFSFileReadWrite.java:106)
>  Caused by: java.io.IOException: CreateProcess error=2, The system
>  cannot find the file specified
> at java.lang.ProcessImpl.create(Native Method)
> at java.lang.ProcessImpl.(ProcessImpl.java:81)
> at java.lang.ProcessImpl.start(ProcessImpl.java:30)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> ... 17 more
>
>
>
>  Note: I am on a Windows machine. The namenode is running in the same
>  Windows machine. The way I initialized the configuration is:
>
> Configuration conf = new Configuration();
> conf.addResource(new
>  Path("C:\\cygwin\\hadoop-management\\hadoop-conf\\hadoop-site.xml"));
> FileSystem fs = FileSystem.get(conf);
>
>
>  Any suggestions?
>
>  Cagdas
>


Re: Hadoop Virtual Image: ipc.Client: Retrying connect to server:

2008-03-18 Thread Sushma Rao
Hi,

I'm using the Hadoop distribution 0.13.0 that comes with the VM image.

While trying to run the simple WordCount example, I get the following error:
ipc.Client: Retrying connect to server:x times

Can someone please tell what is wrong and how I could rectify the problem?

Thanks a lot


Hadoop Virtual Image: ipc.Client: Retrying connect to server:

2008-03-18 Thread Sushma Rao

Hi,

I'm using the Hadoop distribution 0.13.0 that comes with the VM image.
While trying to run the simple WordCount example, I get the following error:

ipc.Client: Retrying connect to server:x times

Can someone please tell what is wrong and how I could rectify the problem?

Thanks a lot


streaming problem

2008-03-18 Thread Andreas Kostyrka
Hi!

I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
scripts to be used to all nodes:

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper 
~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output

Now, this gives me:

java.io.IOException: log:null
R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Tue Mar 18 21:06:13 GMT 2008
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)


at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)

Any ideas what my problems could be?

TIA,

Andreas


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: streaming problem

2008-03-18 Thread Andreas Kostyrka
Some additional details if it's helping, the HDFS is hosted on AWS S3,
and the input file set consists of 152 gzipped Apache log files.

Thanks,

Andreas

Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
> Hi!
> 
> I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
> scripts to be used to all nodes:
> 
> time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper 
> ~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output
> 
> Now, this gives me:
> 
> java.io.IOException: log:null
> R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
> minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
> HOST=null
> USER=hadoop
> HADOOP_USER=null
> last Hadoop input: |null|
> last tool output: |null|
> Date: Tue Mar 18 21:06:13 GMT 2008
> java.io.IOException: Broken pipe
>   at java.io.FileOutputStream.writeBytes(Native Method)
>   at java.io.FileOutputStream.write(FileOutputStream.java:260)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
>   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>   at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>   at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> 
> 
>   at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>   at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
> 
> Any ideas what my problems could be?
> 
> TIA,
> 
> Andreas


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Fastest way to do grep via hadoop streaming

2008-03-18 Thread Theodore Van Rooy
I've been benchmarking hadoop streaming against just regular old command
line grep.

I set the job to run 4 tasks at a time per box, with one box (with 4
processors).  The file is a 54 Gb file with <100 bytes per line (DFS block
size 128 MB).  I grep an item that shows up in about 2% of the lines in the
data set.

And then I set
-mapper "/bin/grep myregexp"
-numReduceTasks 0

MapReduce gives me a time to complete on average of about 45 minutes.

Command Line Unix gives me a time to complete of about 7 minutes.

Then I did the same with a much smaller file (1 GB) and still got MR=3min,
Linux=7seconds)

Does anyone know of a better/faster way to do grep via streaming?

Is there a better, more optimized version written in Java or Python?

Last, why would the method I am using take so long?  I've determined that
some of the time is write time (output) from the mappers... but could it
really be that much overhead due to read time?

Thanks for your help!
-- 
Theodore Van Rooy
http://greentheo.scroggles.com


RE: Fastest way to do grep via hadoop streaming

2008-03-18 Thread Joydeep Sen Sarma
i hope this is not an error in setup - but many multiples worse is not 
surprising (but not nice).

just think about the number of times hadoop will copy/scan data around (as 
opposed to 'grep' - which is probably ultra optimized by this time) ..

- starting from getting bytes out of a file - they will first be buffered in a 
java buffered stream (copy #1)
- then the buffered stream will be scanned for lines worth of data and then 
copied into a Text (#2)
- the Text will then be written out to a buffered output stream (#3) to the 
streaming script.
- perhaps, someone will tell me why the buffered output stream is flushed every 
iteration by Streaming - but it is:
clientOut_.flush();
  in any case - that's likely a system call every single line of input data 
that copies into kernel space (#4)

once the data comes out of grep - we get another bunch - but who cares - it's 
2% of the data.

i don't know the dfs stack well enough to count copies there - but we can 
probably bet that there's quite a few there as well. (for one - we will be 
scanning the data at least once to do the crc check)

with 4 threads pounding the cpu and so much copying going around (and this is 
not counting that java itself is reputedly memory intensive) - we are probably 
memory bound by this time (which shows up as cpu bound).

sigh.




-Original Message-
From: Theodore Van Rooy [mailto:[EMAIL PROTECTED]
Sent: Tue 3/18/2008 3:09 PM
To: core-user@hadoop.apache.org
Subject: Fastest way to do grep via hadoop streaming
 
I've been benchmarking hadoop streaming against just regular old command
line grep.

I set the job to run 4 tasks at a time per box, with one box (with 4
processors).  The file is a 54 Gb file with <100 bytes per line (DFS block
size 128 MB).  I grep an item that shows up in about 2% of the lines in the
data set.

And then I set
-mapper "/bin/grep myregexp"
-numReduceTasks 0

MapReduce gives me a time to complete on average of about 45 minutes.

Command Line Unix gives me a time to complete of about 7 minutes.

Then I did the same with a much smaller file (1 GB) and still got MR=3min,
Linux=7seconds)

Does anyone know of a better/faster way to do grep via streaming?

Is there a better, more optimized version written in Java or Python?

Last, why would the method I am using take so long?  I've determined that
some of the time is write time (output) from the mappers... but could it
really be that much overhead due to read time?

Thanks for your help!
-- 
Theodore Van Rooy
http://greentheo.scroggles.com



Partitioning reduce output by date

2008-03-18 Thread Otis Gospodnetic
Hi,

What is the best/right way to handle partitioning of the final job output (i.e. 
output of reduce tasks)?  In my case, I am processing logs whose entries 
include dates (e.g. "2008-03-01foobarbaz").  A single log file may 
contain a number of different dates, and I'd like to group reduce output by 
date so that, in the end, I have not a single part-x file but, say, 
2008-03-01.txt, 2008-03-02.txt, and so on, one file for each distinct date.

If it helps, the keys in my job include the dates from the input logs, so I 
could parse the dates out of the keys in the reduce phase, if that's the thing 
to do.

I'm looking at OutputFormat and RecordWriter, but I'm not sure if that's the 
direction I should pursue.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




Limiting Total # of TaskTracker threads

2008-03-18 Thread Jimmy Wan

The properties mentioned here: http://wiki.apache.org/hadoop/FAQ#13

have been deprecated in favor of two separate properties:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

I'd like to limit the total # of threads on a task tracker (think limited  
resources on a given compute node) to a given number, and there does not  
appear to be a way to do that anymore. Am I correct in my understanding  
that there is no capability to do this?


--
Jimmy


Re: Partitioning reduce output by date

2008-03-18 Thread Arun C Murthy


On Mar 18, 2008, at 4:35 PM, Otis Gospodnetic wrote:


Hi,

What is the best/right way to handle partitioning of the final job  
output (i.e. output of reduce tasks)?  In my case, I am processing  
logs whose entries include dates (e.g. "2008-03-01foobar 
baz").  A single log file may contain a number of different dates,  
and I'd like to group reduce output by date so that, in the end, I  
have not a single part-x file but, say, 2008-03-01.txt,  
2008-03-02.txt, and so on, one file for each distinct date.




You want a custom partitioner...
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#Partitioner


Arun

If it helps, the keys in my job include the dates from the input  
logs, so I could parse the dates out of the keys in the reduce  
phase, if that's the thing to do.


I'm looking at OutputFormat and RecordWriter, but I'm not sure if  
that's the direction I should pursue.


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch






Re: Limiting Total # of TaskTracker threads

2008-03-18 Thread Arun C Murthy


On Mar 18, 2008, at 4:41 PM, Jimmy Wan wrote:


The properties mentioned here: http://wiki.apache.org/hadoop/FAQ#13

have been deprecated in favor of two separate properties:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum



I've updated the wiki to reflect those... sorry you got mislead.

I'd like to limit the total # of threads on a task tracker (think  
limited resources on a given compute node) to a given number, and  
there does not appear to be a way to do that anymore. Am I correct  
in my understanding that there is no capability to do this?




The map/reduce tasks are not threads, they are run in separate JVMs  
which are forked by the tasktracker.


OTOH, there are other threads (RPC etc.) - are you looking at  
limiting those?


Arun


--
Jimmy




Re: Partitioning reduce output by date

2008-03-18 Thread Ted Dunning

I think that a custom partitioner is half of the answer.  The other half is
that the reducer can open and close output files as needed.  With the
partitioner, only one file need be kept open at a time.  It is good practice
to open the files relative to the task directory so that process failure is
handled correctly.

These files are called task side effect files and are documented here:

http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Task+Side-Ef
fect+Files


On 3/18/08 5:17 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:

>> I have not a single part-x file but, say, 2008-03-01.txt,
>> 2008-03-02.txt, and so on, one file for each distinct date.
>> 
> 
> You want a custom partitioner...
> http://hadoop.apache.org/core/docs/current/
> mapred_tutorial.html#Partitioner



Re: Limiting Total # of TaskTracker threads

2008-03-18 Thread Ted Dunning

I think the original request was to limit the sum of maps and reduces rather
than limiting the two parameters independently.

Clearly, with a single job running at a time, this is a non-issue since
reducers don't do much until the maps are done.  With multiple jobs it is a
bit more of an issue.


On 3/18/08 5:26 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:

>> I'd like to limit the total # of threads on a task tracker (think
>> limited resources on a given compute node) to a given number, and
>> there does not appear to be a way to do that anymore. Am I correct
>> in my understanding that there is no capability to do this?
>> 
> 
> The map/reduce tasks are not threads, they are run in separate JVMs
> which are forked by the tasktracker.



Re: Partitioning reduce output by date

2008-03-18 Thread Otis Gospodnetic
Thanks for the pointer, Arun.  Earlier, I did look at Partitioner in the 
tutorial:

"Partitioner controls the partitioning of the keys of the   
intermediate map-outputs. The key (or a subset of the key) is used to   
derive the partition, typically by a hash function. The total   number 
of partitions is the same as the number of reduce tasks for the   job. 
Hence this controls which of the m reduce tasks the   intermediate key 
(and hence the record) is sent to for reduction."
 
This makes it sound like the Partitioner is only for intermediate map-outputs, 
and not outputs of reduces.  Also, it sounds like the number of distinct 
partitions is tied to the number of reduces.  But what if my job uses, say, 
only 2 reduce tasks, and my input has 100 distinct dates, and as the result, I 
want to end up with 100 distinct output files?

Also, if there a way to specify the name of the final output/filenames (so that 
each of the 100 output files can have its own distinct name, in my case in the 
-mm-dd format)?

If this is explained somewhere, please point.  If it's not, I can document it 
once I have it working.

Thanks,
Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Arun C Murthy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, March 18, 2008 8:17:32 PM
Subject: Re: Partitioning reduce output by date


On Mar 18, 2008, at 4:35 PM, Otis Gospodnetic wrote:

> Hi,
>
> What is the best/right way to handle partitioning of the final job  
> output (i.e. output of reduce tasks)?  In my case, I am processing  
> logs whose entries include dates (e.g. "2008-03-01foobar 
> baz").  A single log file may contain a number of different dates,  
> and I'd like to group reduce output by date so that, in the end, I  
> have not a single part-x file but, say, 2008-03-01.txt,  
> 2008-03-02.txt, and so on, one file for each distinct date.
>

You want a custom partitioner...
http://hadoop.apache.org/core/docs/current/ 
mapred_tutorial.html#Partitioner

Arun

> If it helps, the keys in my job include the dates from the input  
> logs, so I could parse the dates out of the keys in the reduce  
> phase, if that's the thing to do.
>
> I'm looking at OutputFormat and RecordWriter, but I'm not sure if  
> that's the direction I should pursue.
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>






Re: Partitioning reduce output by date

2008-03-18 Thread Martin Traverso
> This makes it sound like the Partitioner is only for intermediate
> map-outputs, and not outputs of reduces.  Also, it sounds like the number of
> distinct partitions is tied to the number of reduces.  But what if my job
> uses, say, only 2 reduce tasks, and my input has 100 distinct dates, and as
> the result, I want to end up with 100 distinct output files?
>

Check out https://issues.apache.org/jira/browse/HADOOP-2906

Martin


Re: Partitioning reduce output by date

2008-03-18 Thread Ted Dunning

Also see my comment about side effect files.

Basically, if you partition on date, then each set of values in the reduce
will have the same date.  Thus the reducer can open a file, write the
values, close the file (repeat).

This gives precisely the effect you were seeking.


On 3/18/08 6:17 PM, "Martin Traverso" <[EMAIL PROTECTED]> wrote:

>> This makes it sound like the Partitioner is only for intermediate
>> map-outputs, and not outputs of reduces.  Also, it sounds like the number of
>> distinct partitions is tied to the number of reduces.  But what if my job
>> uses, say, only 2 reduce tasks, and my input has 100 distinct dates, and as
>> the result, I want to end up with 100 distinct output files?
>> 
> 
> Check out https://issues.apache.org/jira/browse/HADOOP-2906
> 
> Martin



Re: streaming problem

2008-03-18 Thread Amareshwari Sriramadasu

Hi Andreas,
Looks like your mapper is not available to the streaming jar. Where is 
your mapper script? Did you use distributed cache to distribute the mapper?
You can use -file  to make it part of 
jar. or Use -cacheFile /dist/wordloadmf#workloadmf to distribute the 
script. Distributing this way will add your script to the PATH.


So, now you command will be:

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper 
workloadmf -reducer NONE -input testlogs/* -output testlogs-output -cacheFile 
/dist/wordloadmf#workloadmf

or

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf 
-reducer NONE -input testlogs/* -output testlogs-output -file 

Thanks,
Amareshwari

Andreas Kostyrka wrote:

Some additional details if it's helping, the HDFS is hosted on AWS S3,
and the input file set consists of 152 gzipped Apache log files.

Thanks,

Andreas

Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
  

Hi!

I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
scripts to be used to all nodes:

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper 
~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output

Now, this gives me:

java.io.IOException: log:null
R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Tue Mar 18 21:06:13 GMT 2008
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)


at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)

Any ideas what my problems could be?

TIA,

Andreas