Re: How to configure RandomWriter to generate less amount of data

2008-06-30 Thread Amar Kamat

Heshan Lin wrote:

Hi,

I'm trying to configure RandomWriter to generate less data than does 
the default configuration. 
bin/hadoop jar hadoop-*-examples.jar randomwriter 
-Dtest.randomwrite.bytes_per_map= 
-Dtest.randomwrite.total_bytes= 
-Dtest.randomwriter.maps_per_host= 
The number of maps that will be spawned in this case will be 
total_bytes/bytes_per_map.
Other parameters are test.randomwrite.min_key (size in bytes), 
test.randomwrite.max_key (size in bytes), test.randomwrite.min_value 
(size in bytes) and test.randomwrite.max_value (size in bytes).

Amar
I created a job configuration file job.xml and added in variables 
given at http://wiki.apache.org/hadoop/RandomWriter. Tried a couple of 
ways of running the program below, but configurations in job.xml were 
not taken by RandomWriter.


1) bin/hadoop jar hadoop-*-examples.jar randomwriter rand job.xml
2) bin/hadoop jar hadoop-*-examples.jar randomwriter rand --conf job.xml
3) bin/hadoop jar --conf job.xml hadoop-*-examples.jar randomwriter rand

Passing property values via the -D option didn't seem to work either. 
Can anybody advise on how to use the job configuration file properly?


Thanks,
Heshan




How to configure RandomWriter to generate less amount of data

2008-06-30 Thread Heshan Lin

Hi,

I'm trying to configure RandomWriter to generate less data than does  
the default configuration. I created a job configuration file job.xml  
and added in variables given at http://wiki.apache.org/hadoop/ 
RandomWriter. Tried a couple of ways of running the program below,  
but configurations in job.xml were not taken by RandomWriter.


1) bin/hadoop jar hadoop-*-examples.jar randomwriter rand job.xml
2) bin/hadoop jar hadoop-*-examples.jar randomwriter rand --conf job.xml
3) bin/hadoop jar --conf job.xml hadoop-*-examples.jar randomwriter rand

Passing property values via the -D option didn't seem to work either.  
Can anybody advise on how to use the job configuration file properly?


Thanks,
Heshan


Should there be a way not maintaining the whole namespace structure in memory?

2008-06-30 Thread heyongqiang
In now's hdfs implementation,all INodeFile and INodeDirectory objects were 
loaded into memory,this is done when setting up the  FSNameSpacs structure set 
up at namenode startup.
the namenode will analyze the fsimage file and edit log file. And if there are 
milllions of files or directories how it can be handled?

I have done an exprements by making dirs,before i exprements:
[EMAIL PROTECTED] bin]$ ps -p 9122 -o rss,size,vsize,%mem
  RSSSZVSZ %MEM
153648 1193868 1275340  3.7

after i creating 1 directories, it turns:
[EMAIL PROTECTED] bin]$ ps -p 9122 -o rss,size,vsize,%mem
  RSSSZVSZ %MEM
169084 1193868 1275340  4.0

I m trying to improve the fsimage file,so that namenode can locate and load the 
needed information at need,and just like linux vfs,we can only obtain an inode 
cache.So this can avoid loading the whole namespace structure at startup.




Best regards,
 
Yongqiang He
2008-07-01

Email: [EMAIL PROTECTED]
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 


Re: Too many fetch failures AND Shuffle error

2008-06-30 Thread Amar Kamat

Tarandeep Singh wrote:

I am getting this error as well.
As Sayali mentioned in his mail, I updated the /etc/hosts file with the
slave machines IP addresses, but I am still getting this error.

Amar, which is the url that you were talking about in your mail -
"There will be a URL associated with a map that the reducer try to fetch
(check the reducer logs for this url)"

Please tell me where should I look for it... I will try to access it
manually to see if this error is due to firewall.
  
One thing you can do is to see if all the maps that have failed while 
fetching are from remote host. Look at the web-ui to find out where the 
map task finished and look at the reduce task logs to find out which 
maps-fetches failed.


I am not sure if the reduce task logs have it. Try this
port=tasktracker.http.port (this is set through conf)
tthost = tasktracker hostname (destination tasktracker from where the 
map out needs to be fetched)

jobid = complete job id "job_"
mapid = the task attemptid "attempt_..." that has successfully completed 
the map
reduce-partition-id = this is the partition number for reduce task. 
task_..._r_$i_$j will have reduce-partition-id as int-value($i).


url = 
http://'$tthost':'$port'/mapOutput?job='$jobid'&map='$mapid'&reduce='$reduce-partition-id'

'$var' is what you have to substitute.
Amar

Thanks,
Taran

On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat <[EMAIL PROTECTED]> wrote:

  

Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
able to fetch maps from the same machine (locally) but fails to copy it from
the remote machine. A common reason in such cases is the *restricted machine
access* (firewall etc). The web-server on a machine/node hosts map outputs
which the reducers on the other machine are not able to access. There will
be a URL associated with a map that the reducer try to fetch (check the
reducer logs for this url). Just try accessing it manually from the
reducer's machine/node. Most likely this experiment should also fail. Let us
know if this is not the case.
Amar

Sayali Kulkarni wrote:



Can you post the reducer logs. How many nodes are there in the cluster?
  


There are 6 nodes in the cluster - 1 master and 5 slaves
 I tried to reduce the number of nodes, and found that the problem is
solved only if there is a single node in the cluster. So I can deduce that
the problem is there in some configuration.

Configuration file:








 hadoop.tmp.dir
 /extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}
 A base for other temporary directories.



 fs.default.name
 hdfs://10.105.41.25:54310
 The name of the default file system.  A URI whose
 scheme and authority determine the FileSystem implementation.  The
 uri's scheme determines the config property (fs.SCHEME.impl) naming
 the FileSystem implementation class.  The uri's authority is used to
 determine the host, port, etc. for a filesystem.



 mapred.job.tracker
 10.105.41.25:54311
 The host and port that the MapReduce job tracker runs
 at.  If "local", then jobs are run in-process as a single map
 and reduce task.
 



 dfs.replication
 2
 Default block replication.
 The actual number of replications can be specified when the file is
created.
 The default is used if replication is not specified in create time.
 




 mapred.child.java.opts
 -Xmx1048M



   mapred.local.dir
   /extra/HADOOP/hadoop-0.16.3/tmp/mapred



 mapred.map.tasks
 53
 The default number of map tasks per job.  Typically set
 to a prime several times greater than number of available hosts.
 Ignored when mapred.job.tracker is "local".
 



 mapred.reduce.tasks
 7
 The default number of reduce tasks per job.  Typically set
 to a prime close to the number of available hosts.  Ignored when
 mapred.job.tracker is "local".
 






This is the output that I get when running the tasks with 2 nodes in the
cluster:

08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
process : 1
08/06/20 11:07:45 INFO mapred.JobClient: Running job:
job_200806201106_0001
08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
08/06/20 11:08:18 INFO mapred.JobClient:  

Re: Data-local tasks

2008-06-30 Thread heyongqiang
Hadoop does not implemented the clever task scheduler, when a data node 
heartbeat with the namenode, and if the data node wants a job, simply get one 
for it.
The selection  does not consider the task's input file at all.




  
Best regards,
 
Yongqiang He
2008-06-25



发件人: Saptarshi Guha
发送时间: 2008-06-30 21:12:24
收件人: core-user@hadoop.apache.org
抄送: 
主题: Data-local tasks

Hello, 
I recall asking this question but this is in addition to what I'ev askd.
Firstly, to recap my question and Arun's specific response:



-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >  
-- Does the "Data-local map tasks" counter mean the number of tasks  that the 
had the input data already present on the machine on they  are running on? 
-- i.e the wasn't a need to ship the data to them.  


Response from Arun

-- Yes. Your understanding is correct. More specifically it means that the 
map-task got scheduled on a machine on which one of the 
-- replicas of it's input-split-block was present and was served by the 
datanode running on that machine. *smile* Arun




Now, Is Hadoop designed to schedule a map task on a machine which has one of 
the replicas of it's input split block?

Failing that, does then assign a map task on machine close to one that contains 
a replica of it's input split block?

Are there any performance metrics for this?



Many thanks

Saptarshi





Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha


Re: Data-local tasks

2008-06-30 Thread Amar Kamat

Saptarshi Guha wrote:

Hello,
I recall asking this question but this is in addition to what I'ev askd.
Firstly, to recap my question and Arun's specific response:

-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, > 
-- Does the "Data-local map tasks" counter mean the number of tasks 
 that the had the input data already present on the machine on they 
 are running on? 
-- i.e the wasn't a need to ship the data to them. 


Response from Arun
-- Yes. Your understanding is correct. More specifically it means that 
the map-task got scheduled on a machine on which one of the 
-- replicas of it's input-split-block was present and was served by 
the datanode running on that machine. *smile* Arun



Now, Is Hadoop designed to schedule a map task on a machine which has 
one of the replicas of it's input split block?

Yes.
Failing that, does then assign a map task on machine close to one that 
contains a replica of it's input split block?
The scheduling is tasktracker based rather than split based. By that 
what I mean is that the tasktracker asks for a task and the JT schedules 
a task to that tracker.
If there is any split that is data local to the tasktracker and not yet 
scheduled, it will be assigned to the tracker. If no such split can be 
found the JT will assign a high priority split to it. The priority 
amongst the splits is based on their ordering given by the jobclient. By 
default its sorted on split size (decreasing order). Either the split is 
data-local (on the same machine), rack local (within the same rack) or 
is not-local. There is no other measure of closeness. The scheduling 
problem is 'given a tasktracker find out the best split' rather than 
'given a split find out the best/closest tracker'.

Are there any performance metrics for this?

Many thanks
Saptarshi


*/Saptarshi Guha | [EMAIL PROTECTED] 
 | http://www.stat.purdue.edu/~sguha 
/*







MapSide Join and left outer or right outer joins?

2008-06-30 Thread Jason Venner
It only seems like full outer or full inner joins are supported. I was 
hoping to just do a left outer join.


Is this supported or planned?

On the flip side doing the Outer Join is about 8x faster than doing a 
map/reduce over our dataset.


Thanks
--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Re: Test Hadoop performance on EC2

2008-06-30 Thread 王志祥
Sorry for the previous post. I haven't finished. Please skip it.

Hi all,
I've made some experiments on Hadoop on Amazon EC2.
I would like to share the result and any feedback would be appreciated.

Environment:
-Xen VM (Amazon EC2 instance ami-ee53b687)
-1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network
bandwidth (small instance)
-Hadoop 0.17.0
-storage: HDFS
-Test example: wordcount

Experiment 1: (fixed # of instances (8), variant data size (2MB~512MB), # of
maps: 8, # of reduces: 8)
Data Size(MB) | Time(s)
512  |  124
256  |  70
128  |  41
...
8|  22
4|  17
2|  21

The purpose is to observe the lowest framework overhead for wordcount.
As the result, when the data size is between 2MB to 16MB, the time is around
20 second.
May I conclude the lowest framework overhead for wordcount is 20s?

Experiment 2: (variant # of instances (2~32), variant data size (128MB~2GB),
# of maps: (2-32), # of reduces: (2-32))
Data Size(MB) | Map | Reduce | Time(s)
2048 | 32  | 32 | 140
1024 | 16   | 16| 120
512  | 8| 8| 124
256  | 4| 4| 127
128  | 2| 2| 119

The purpose is to observe if each instance be allocated the same blocks of
data, the time will be similar.
As the result, when the data size is between 128MB to 1024MB, the time is
around 120 seconds.
The time is 140s when data size is 2048MB. I think the reason is more data
to process would cause more overhead.

Experiment 3: (variant # of instances (2~16), fixed data size (128MB), # of
maps: (2-16), # of reduces: (2-16))
Data Size(MB) | Map | Reduce | Time(s)
128  | 16   | 16| 31
128  | 8| 8| 41
128  | 4| 4| 69
128  | 2| 2| 119

The purpose is to observe for fixed data, add more and more instances, how
would the result change?
As the result, as the instances double, the time would be smaller but not
the half.
There is always the framework overhead even give infinite instances.

In fact, I did more experiments, but I just post some results.
Interestingly, I discover a formula for wordcount by my experiment result.
That is: Time(s) ~= 20+((DataSize - 8MB)*1.6 / (# of instance))
I've check the formula by all my experiment result and almost all is
matched.
Maybe it's coincidental or I have something wrong.
Anyway, I just want to share my experience and any feedback would be
appreciated.

-- 
Best Regards,
Shawn


Using S3 Block FileSystem as HDFS replacement

2008-06-30 Thread slitz
Hello,
I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki
that it's possible to choose either S3 native FileSystem or S3 Block
Filesystem. I would like to use S3 Block FileSystem to avoid the task of
"manually" transferring data from S3 to HDFS every time i want to run a job.

I'm still experimenting with EC2 contrib scripts and those seem to be
excellent.
What i can't understand is how may be possible to use S3 using a public
hadoop AMI since from my understanding hadoop-site.xml gets written on each
instance startup with the options on hadoop-init, and it seems that the
public AMI (at least the 0.17.0 one) is not configured to use S3 at
all(which makes sense because the bucket would need individual configuration
anyway).

So... to use S3 block FileSystem with EC2 i need to create a custom AMI with
a modified hadoop-init script right? or am I completely confused?


slitz


Test Hadoop performance on EC2

2008-06-30 Thread 王志祥
Hi all,
I've made some experiments on Hadoop on Amazon EC2.
I would like to share the result and any feedback would be appreciated.

Environment:
-Xen VM (Amazon EC2 instance ami-ee53b687)
-1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network
bandwidth (small instance)
-Hadoop 0.17.0
-storage: HDFS
-Test example: wordcount

Experiment 1: (fixed # of instances (8), variant data size (2MB~512MB), # of
maps: 8, # of reduces: 8)
Data Size(MB) | Time(s)
128
256
512

Experiment 2:

Experiment 3:

-- 
Best Regards,
Shawn


Re: Is it possible to access the HDFS using webservices?

2008-06-30 Thread heyongqiang
if u want to access hdfs metadata through webservices, it is ok. but it is not 
a wise way to deal with data. 
And further namenode daemon even can be implemented by webservice,it is just 
another alternative way of rpc.




Best regards,
 
Yongqiang He
2008-07-01

Email: [EMAIL PROTECTED]
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 



发件人: [EMAIL PROTECTED]
发送时间: 2008-07-01 06:19:30
收件人: [EMAIL PROTECTED]; core-user@hadoop.apache.org
抄送: 
主题: Is it possible to access the HDFS using webservices?

Hi everybody, 

I'm trying to access the hdfs using web services. The idea is that the
web service client can access the HDFS using SOAP or REST and has to
support all the hdfs shell commands. 

Is it some work around this?.

I really appreciate any feedback,

Xavier


Re: Data-local tasks

2008-06-30 Thread heyongqiang
Hadoop does not implemented the clever task scheduler, when a data node 
heartbeat with the namenode, and if the data node wants a job, simply get one 
for it.
The selection  does not consider the task's input file at all.




 
Best regards,
 
Yongqiang He
2008-06-25



发件人: Saptarshi Guha
发送时间: 2008-06-30 21:12:24
收件人: core-user@hadoop.apache.org
抄送: 
主题: Data-local tasks

Hello,
I recall asking this question but this is in addition to what I'ev askd.
Firstly, to recap my question and Arun's specific response:



-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >  
-- Does the "Data-local map tasks" counter mean the number of tasks  that the 
had the input data already present on the machine on they  are running on? 
-- i.e the wasn't a need to ship the data to them.  


Response from Arun

-- Yes. Your understanding is correct. More specifically it means that the 
map-task got scheduled on a machine on which one of the 
-- replicas of it's input-split-block was present and was served by the 
datanode running on that machine. *smile* Arun




Now, Is Hadoop designed to schedule a map task on a machine which has one of 
the replicas of it's input split block?

Failing that, does then assign a map task on machine close to one that contains 
a replica of it's input split block?

Are there any performance metrics for this?



Many thanks

Saptarshi





Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha


Re: reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
On Monday 30 June 2008 18:38:28 Runping Qi wrote:
> Looks like the reducer stuck at shuffling phase.
> What is the progression percentage do you see for the reducer from web
> GUI?
>
> It is known that 0.17 does not handle shuffling well.

I think it has been 87% (meaning that 19 of 22 reducer tasks were finished). 
On a smaller job size, it hangs at 93%.

That makes me curious, when will 0.18 be out? Or 0.17.1? Till now I always 
managed to run into problems far behind the curve that there was almost 
always a cure in form of an upgrade. Not knowing if running trunk is a good 
idea.

Andreas


signature.asc
Description: This is a digitally signed message part.


Re: reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
On Monday 30 June 2008 17:49:38 Chris Anderson wrote:
> On Mon, Jun 30, 2008 at 8:30 AM, Andreas Kostyrka <[EMAIL PROTECTED]> 
wrote:
> >  Plus it seems to be deterministic, it always stop at 3 reduce parts
> > not finishing, although I haven't yet checked if they are always the same
> > errors or not.
>
> I've been struggling through getting my streaming tasks (map only, no
> reduce) to run across large Nutch crawls. I'd been having
> deterministic failures as well. It's worth checking your streaming job
> against input data (maybe on a local workstation) to see that it
> doesn't blow up on some of the input. It turns out my Ruby/Hpricot XML
> parsers were having a hard time swallowing large binary files
> (surprise) and as a result, some map tasks would always die in the
> same place.

I am not getting failures => failures would mean that my driver script retries 
it, ...

In my case it just stops. The external reducer program hangs while reading 
stdin, and that's it. For 5 hours or so.

I'll try something else, I won't kill these hung processes, instead I'll run a 
longer-term strace on it, and see if some data trickles into it or not.

And while my reducers output potentially long lines (in my case, it's a list 
of cookie values associated with a given site, that can get really long), 
this is on the output side of the reducer. The cookie lines come nice and 
short from the mappers.

> I got my data to test locally but running streaming jar with cat as
> it's mapper, and then copying the results to my workstation, and
> piping them into my script. I haven't tried using cat as a reducer,
> but it should yield output files suitable for running your streaming
> reducers over, in an instrumented environment.

The problem is most probably not in my reducer. Notice that hadoop, not my 
reducer, is having problems to fetch map output files. My reducer just sleeps 
like an innocent babe waiting on data.

Andreas




signature.asc
Description: This is a digitally signed message part.


RE: Summit / Camp Hadoop at ApacheCon

2008-06-30 Thread Ajay Anand
At this point I am looking for proposals for talks or topics for panel
discussions - similar to the Summit we did a few months ago. The idea
would be to share with the community progress that's being made with
Hadoop related projects or discuss interesting applications /
deployments using Hadoop. 

-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 30, 2008 4:35 PM
To: core-user@hadoop.apache.org
Subject: Re: Summit / Camp Hadoop at ApacheCon

I would love to help, especially on the Mahout side of things.

What would you like to have?

On Mon, Jun 30, 2008 at 2:53 PM, Ajay Anand <[EMAIL PROTECTED]>
wrote:

> We are planning to host a mini-summit (aka "Camp Hadoop") in
conjunction
> with ApacheCon this year - Nov 6th and 7th - in New Orleans.
>
>
>
> We are working on putting together the agenda for this now, and would
> love to hear from you if you have suggestions for talks or panel
> discussions that we could include. Please send your suggestions to
> [EMAIL PROTECTED]
>
>
>
> Thanks!
>
> Ajay
>
>


-- 
ted


Re: Summit / Camp Hadoop at ApacheCon

2008-06-30 Thread Ted Dunning
I would love to help, especially on the Mahout side of things.

What would you like to have?

On Mon, Jun 30, 2008 at 2:53 PM, Ajay Anand <[EMAIL PROTECTED]> wrote:

> We are planning to host a mini-summit (aka "Camp Hadoop") in conjunction
> with ApacheCon this year - Nov 6th and 7th - in New Orleans.
>
>
>
> We are working on putting together the agenda for this now, and would
> love to hear from you if you have suggestions for talks or panel
> discussions that we could include. Please send your suggestions to
> [EMAIL PROTECTED]
>
>
>
> Thanks!
>
> Ajay
>
>


-- 
ted


Is it possible to access the HDFS using webservices?

2008-06-30 Thread xavier.quintuna
Hi everybody, 

I'm trying to access the hdfs using web services. The idea is that the
web service client can access the HDFS using SOAP or REST and has to
support all the hdfs shell commands. 

Is it some work around this?.

I really appreciate any feedback,

Xavier




Summit / Camp Hadoop at ApacheCon

2008-06-30 Thread Ajay Anand
We are planning to host a mini-summit (aka "Camp Hadoop") in conjunction
with ApacheCon this year - Nov 6th and 7th - in New Orleans.

 

We are working on putting together the agenda for this now, and would
love to hear from you if you have suggestions for talks or panel
discussions that we could include. Please send your suggestions to
[EMAIL PROTECTED]

 

Thanks!

Ajay



Delete directory only if empty

2008-06-30 Thread Nathan Marz

Hi,

Is there any sort of hadoop command like the unix "rmdir" command? I  
need a command that will delete a directory only if it is empty. It is  
not sufficient to check whether the directory is empty first, because  
a file could be added to the directory in between my checking if its  
empty and deleting it.


Thanks,
Nathan Marz


Re: Too many fetch failures AND Shuffle error

2008-06-30 Thread Tarandeep Singh
I am getting this error as well.
As Sayali mentioned in his mail, I updated the /etc/hosts file with the
slave machines IP addresses, but I am still getting this error.

Amar, which is the url that you were talking about in your mail -
"There will be a URL associated with a map that the reducer try to fetch
(check the reducer logs for this url)"

Please tell me where should I look for it... I will try to access it
manually to see if this error is due to firewall.

Thanks,
Taran

On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
> able to fetch maps from the same machine (locally) but fails to copy it from
> the remote machine. A common reason in such cases is the *restricted machine
> access* (firewall etc). The web-server on a machine/node hosts map outputs
> which the reducers on the other machine are not able to access. There will
> be a URL associated with a map that the reducer try to fetch (check the
> reducer logs for this url). Just try accessing it manually from the
> reducer's machine/node. Most likely this experiment should also fail. Let us
> know if this is not the case.
> Amar
>
> Sayali Kulkarni wrote:
>
>> Can you post the reducer logs. How many nodes are there in the cluster?
>>>
>>>
>> There are 6 nodes in the cluster - 1 master and 5 slaves
>>  I tried to reduce the number of nodes, and found that the problem is
>> solved only if there is a single node in the cluster. So I can deduce that
>> the problem is there in some configuration.
>>
>> Configuration file:
>> 
>> 
>>
>> 
>>
>> 
>>
>> 
>>  hadoop.tmp.dir
>>  /extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}
>>  A base for other temporary directories.
>> 
>>
>> 
>>  fs.default.name
>>  hdfs://10.105.41.25:54310
>>  The name of the default file system.  A URI whose
>>  scheme and authority determine the FileSystem implementation.  The
>>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>>  the FileSystem implementation class.  The uri's authority is used to
>>  determine the host, port, etc. for a filesystem.
>> 
>>
>> 
>>  mapred.job.tracker
>>  10.105.41.25:54311
>>  The host and port that the MapReduce job tracker runs
>>  at.  If "local", then jobs are run in-process as a single map
>>  and reduce task.
>>  
>> 
>>
>> 
>>  dfs.replication
>>  2
>>  Default block replication.
>>  The actual number of replications can be specified when the file is
>> created.
>>  The default is used if replication is not specified in create time.
>>  
>> 
>>
>>
>> 
>>  mapred.child.java.opts
>>  -Xmx1048M
>> 
>>
>> 
>>mapred.local.dir
>>/extra/HADOOP/hadoop-0.16.3/tmp/mapred
>> 
>>
>> 
>>  mapred.map.tasks
>>  53
>>  The default number of map tasks per job.  Typically set
>>  to a prime several times greater than number of available hosts.
>>  Ignored when mapred.job.tracker is "local".
>>  
>> 
>>
>> 
>>  mapred.reduce.tasks
>>  7
>>  The default number of reduce tasks per job.  Typically set
>>  to a prime close to the number of available hosts.  Ignored when
>>  mapred.job.tracker is "local".
>>  
>> 
>>
>> 
>>
>>
>> 
>> This is the output that I get when running the tasks with 2 nodes in the
>> cluster:
>>
>> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 08/06/20 11:07:45 INFO mapred.JobClient: Running job:
>> job_200806201106_0001
>> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
>> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
>> 08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
>> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
>> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
>> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
>> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
>> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
>> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
>> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
>> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
>> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
>> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
>> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
>> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
>> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
>> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
>> 08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:09:54 INFO mapred

Re: Hadoop Meetup @ Berlin

2008-06-30 Thread Isabel Drost
On Tuesday 17 June 2008, j.L wrote:
> where we can download pdf or ppt of this meetup?

Slides are online: 

Isabel

-- 
It's from Casablanca.  I've been waiting all my life to use that line.  
-- 
Woody Allen, "Play It Again, Sam"
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


RE: Hadoop - is it good for me and performance question

2008-06-30 Thread Haijun Cao
http://www.mail-archive.com/core-user@hadoop.apache.org/msg02906.html


-Original Message-
From: yair gotdanker [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,



I am newbie to hadoop, The technology seems very interesting but I am
not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on
how
the jobtracker/tasktracker communicate and so.


RE: Hadoop - is it good for me and performance question

2008-06-30 Thread Haijun Cao

Not sure if this will answer your question, but a similar thread
regarding hadoop performance:

http://www.mail-archive.com/core-user@hadoop.apache.org/msg02878.html

Hadoop is good for log processing if you have a lot of logs to process
and you don't need the result in real time (e.g. you can accumulate one
day's log and process them in one batch, latency == 1 day). In another
word, it shines with large data set batch (long latency) processing.  It
is good at scalability (scale out), not at increasing single
core/machine performance. If your data fits in one process, then using a
distributed framework will probably slow it down.

Haijun

-Original Message-
From: yair gotdanker [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,



I am newbie to hadoop, The technology seems very interesting but I am
not
sure it suit my needs.  I really appreciate your feedbacks.



The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
since
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
processing
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
increase
the performance.
btw:
Does anyone know good behind the schene tutorial, that explains more on
how
the jobtracker/tasktracker communicate and so.


Underlying file system Block size

2008-06-30 Thread Naama Kraus
Hi All,

To my knowledge, HDFS block size is 64MB - fairly large. Is this a
requirement from a file system, if one wishes to implement Hadoop on top of
it ? Or is there a way to get along with a file system supporting a smaller
block size such as 1M or even less ? What is the case for existing, non
HDFS, implementations of Hadoop (such as S3, KFS) ?

Thanks for any input,
Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


problem when many map tasks are used (since 0.17.1 was installed)

2008-06-30 Thread Ashish Venugopal
The crash below occurs when I run many ( -jobconf mapred.map.tasks=200)
mappers. It does not occur if I set mapred.map.task=1 even when I allocated
many machines (causing there to be many mappers). But when I set
number of map.tasks to 200

the error below happens. This just started happening after the recent
upgrade to 0.17.1

(previously was using 0.16.4). This is a streaming job. Any help is appreciated.


Ashish

Exception closing file
/user/ashishv/iwslt/syn_baseline/translation_dev/_temporary/_task_200806272233_0001_m_000174_0/part-00174
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete
write to file /user/ashishv/iwslt/syn_baseline/translation_dev/_tem
porary/_task_200806272233_0001_m_000174_0/part-00174 by
DFSClient_task_200806272233_0001_m_000174_0
at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:332)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2655)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2576)
at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:221)


Parameterized InputFormats

2008-06-30 Thread Nathan Marz

Hello,

Are there any plans to change the JobConf API so that it takes an  
instance of an InputFormat rather than the InputFormat class? I am  
finding the inability to properly parameterize my InputFormats to be  
very restricting. What's the reasoning behind having the class as a  
parameter rather than an instance?


-Nathan Marz


RE: RecordReader Functionality

2008-06-30 Thread Runping Qi

Your record reader must be able to find the beginning of the next record
beyond the start position of a given split. Your file format must enable
your record reader to detect the beginning of the next record beyond the
start pos of a split. It seems to me that is not possible based on the
info I saw so far.
Why not just use SequenceFile instead?

Runping


> -Original Message-
> From: Sean Arietta [mailto:[EMAIL PROTECTED]
> Sent: Monday, June 30, 2008 10:29 AM
> To: core-user@hadoop.apache.org
> Subject: Re: RecordReader Functionality
> 
> 
> I thought that the InStream buffer (called 'in' in this case) would
> maintain
> the stream position based on how many bytes I had 'read' via
in.read().
> Maybe this is not the case...
> 
> Would it then be proper to call:
> 
> in.seek(pos);
> 
> I believe I tried this at one point and I got an error. I will try
again
> to
> be sure though. Thanks for your reply!
> 
> Cheers,
> Sean
> 
> 
> Jorgen Johnson wrote:
> >
> > Hi Sean,
> >
> > Perhaps I'm missing something, but it doesn't appear to me that
you're
> > actually seeking to the filesplit start position in your
constructor...
> >
> > This would explain why all the mappers are getting the same records.
> >
> > -jorgenj
> >
> > On Mon, Jun 30, 2008 at 9:22 AM, Sean Arietta
<[EMAIL PROTECTED]>
> > wrote:
> >
> >>
> >> Hello all,
> >> I am having a problem writing my own RecordReader. The basic setup
I
> have
> >> is
> >> a large byte array that needs to be diced up into key value pairs
such
> >> that
> >> the key is the index into the array and the value is a byte array
> itself
> >> (albeit much smaller). Here is the code that I currently have
written
> to
> >> do
> >> just this:
> >>
> >>
> >> /* This method is just the constructor for my new RecordReader
> >> public TrainingRecordReader(Configuration job, FileSplit split)
throws
> >> IOException
> >>{
> >>start = split.getStart();
> >>end = start + split.getLength();
> >>final Path file = split.getPath();
> >>compressionCodecs = new
CompressionCodecFactory(job);
> >>final CompressionCodec codec =
> >> compressionCodecs.getCodec(file);
> >>
> >>// open the file and seek to the start of the split
> >>FileSystem fs = file.getFileSystem(job);
> >>FSDataInputStream fileIn = fs.open(split.getPath());
> >>in = new TrainingReader(fileIn, job);
> >>this.pos = start;
> >>}
> >>
> >> // This returns the key, value pair I was talking about
> >> public synchronized boolean next(LongWritable key, BytesWritable
value)
> >> throws IOException
> >>{
> >>if (pos >= end)
> >>return false;
> >>
> >>key.set(pos);   // key is position
> >>int newSize = in.readVector(value);
> >>if (newSize > 0)
> >>{
> >>pos += newSize;
> >>return true;
> >>}
> >>return false;
> >>}
> >>
> >> // This extracts that smaller byte array from the large input file
> >> public int readVector(BytesWritable value) throws IOException
> >>{
> >>int numBytes = in.read(buffer);
> >>value.set(buffer, 0, numBytes);
> >>return numBytes;
> >>}
> >>
> >> So all of this worked just fine when I set
> conf.set("mapred.job.tracker",
> >> "local"), but now that I am attempting to test in a fake
distributed
> >> setting
> >> (aka still one node, but I haven't set the above config param), I
do
> not
> >> get
> >> what I want. Instead of getting unique key value pairs, I get
repeated
> >> key
> >> value pairs based on the number of map tasks I have set. So, say
that
> my
> >> large file contained 49 entries, I would want a unique key value
pair
> for
> >> each of those, but if I set my numMapTasks to 7, I get 7 unique
ones
> that
> >> repeat every 7 key value pairs.
> >>
> >> So it seems that each MapTask which ultimately calls my
> >> TrainingReader.next() method from above is somehow pointing to the
same
> >> FileSplit. I know that in the LineRecordReader in the source there
is
> >> some
> >> small little routine that skips the first line of the data if you
> aren't
> >> at
> >> the beginning Is that related? Why isn't it the case that
> >> split.getStart() isn't returning the absolute pointer to the start
of
> the
> >> split? So many questions I don't know the answer to, haha.
> >>
> >> I would appreciate anyone's help in resolving this issue. Thanks
very
> >> much!
> >>
> >> Cheers,
> >> Sean M. Arietta
> >> --
> >> View this message in context:
> >> http://www.nabble.com/RecordReader-Functionality-
> tp18199187p18199187.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > "Libe

Re: RecordReader Functionality

2008-06-30 Thread Sean Arietta

I thought that the InStream buffer (called 'in' in this case) would maintain
the stream position based on how many bytes I had 'read' via in.read().
Maybe this is not the case...

Would it then be proper to call:

in.seek(pos);

I believe I tried this at one point and I got an error. I will try again to
be sure though. Thanks for your reply!

Cheers,
Sean


Jorgen Johnson wrote:
> 
> Hi Sean,
> 
> Perhaps I'm missing something, but it doesn't appear to me that you're
> actually seeking to the filesplit start position in your constructor...
> 
> This would explain why all the mappers are getting the same records.
> 
> -jorgenj
> 
> On Mon, Jun 30, 2008 at 9:22 AM, Sean Arietta <[EMAIL PROTECTED]>
> wrote:
> 
>>
>> Hello all,
>> I am having a problem writing my own RecordReader. The basic setup I have
>> is
>> a large byte array that needs to be diced up into key value pairs such
>> that
>> the key is the index into the array and the value is a byte array itself
>> (albeit much smaller). Here is the code that I currently have written to
>> do
>> just this:
>>
>>
>> /* This method is just the constructor for my new RecordReader
>> public TrainingRecordReader(Configuration job, FileSplit split) throws
>> IOException
>>{
>>start = split.getStart();
>>end = start + split.getLength();
>>final Path file = split.getPath();
>>compressionCodecs = new CompressionCodecFactory(job);
>>final CompressionCodec codec =
>> compressionCodecs.getCodec(file);
>>
>>// open the file and seek to the start of the split
>>FileSystem fs = file.getFileSystem(job);
>>FSDataInputStream fileIn = fs.open(split.getPath());
>>in = new TrainingReader(fileIn, job);
>>this.pos = start;
>>}
>>
>> // This returns the key, value pair I was talking about
>> public synchronized boolean next(LongWritable key, BytesWritable value)
>> throws IOException
>>{
>>if (pos >= end)
>>return false;
>>
>>key.set(pos);   // key is position
>>int newSize = in.readVector(value);
>>if (newSize > 0)
>>{
>>pos += newSize;
>>return true;
>>}
>>return false;
>>}
>>
>> // This extracts that smaller byte array from the large input file
>> public int readVector(BytesWritable value) throws IOException
>>{
>>int numBytes = in.read(buffer);
>>value.set(buffer, 0, numBytes);
>>return numBytes;
>>}
>>
>> So all of this worked just fine when I set conf.set("mapred.job.tracker",
>> "local"), but now that I am attempting to test in a fake distributed
>> setting
>> (aka still one node, but I haven't set the above config param), I do not
>> get
>> what I want. Instead of getting unique key value pairs, I get repeated
>> key
>> value pairs based on the number of map tasks I have set. So, say that my
>> large file contained 49 entries, I would want a unique key value pair for
>> each of those, but if I set my numMapTasks to 7, I get 7 unique ones that
>> repeat every 7 key value pairs.
>>
>> So it seems that each MapTask which ultimately calls my
>> TrainingReader.next() method from above is somehow pointing to the same
>> FileSplit. I know that in the LineRecordReader in the source there is
>> some
>> small little routine that skips the first line of the data if you aren't
>> at
>> the beginning Is that related? Why isn't it the case that
>> split.getStart() isn't returning the absolute pointer to the start of the
>> split? So many questions I don't know the answer to, haha.
>>
>> I would appreciate anyone's help in resolving this issue. Thanks very
>> much!
>>
>> Cheers,
>> Sean M. Arietta
>> --
>> View this message in context:
>> http://www.nabble.com/RecordReader-Functionality-tp18199187p18199187.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> "Liberties are not given, they are taken."
> - Aldous Huxley
> 
> 

-- 
View this message in context: 
http://www.nabble.com/RecordReader-Functionality-tp18199187p18200404.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: joins in map reduce

2008-06-30 Thread Jason Venner

I have just started to try using the Join operators.

The join I am trying is this;
join is 
outer(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat,"Input1"),tbl(org.apache.hadoop.mapred.SequenceFileInputFormat,"IndexedTry1"))


but I get an error
08/06/30 08:55:13 INFO mapred.FileInputFormat: Total input paths to 
process : 10
Exception in thread "main" java.io.IOException: No input paths specified 
in input
   at 
org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:115)

   at org.apache.hadoop.mapred.join.Parser$WNode.getSplits(Parser.java:304)
   at org.apache.hadoop.mapred.join.Parser$CNode.getSplits(Parser.java:375)
   at 
org.apache.hadoop.mapred.join.CompositeInputFormat.getSplits(CompositeInputFormat.java:131)

   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:544)

I am clearly missing something basic...

   conf.setInputFormat(CompositeInputFormat.class);
   conf.setOutputPath( outputDirectory );
   conf.setOutputKeyClass(Text.class);
   conf.setOutputValueClass(Text.class);
   conf.setOutputFormat(MapFileOutputFormat.class);
   conf.setMapperClass( LeftHandJoinMapper.class );
   conf.setReducerClass( IdentityReducer.class );
   conf.setNumReduceTasks(0);

   System.err.println( "join is " + 
CompositeInputFormat.compose("outer", SequenceFileInputFormat.class, 
allTables ) );
   conf.set("mapred.join.expr", 
CompositeInputFormat.compose("outer", SequenceFileInputFormat.class, 
allTables ));
  
   JobClient client = new JobClient();
  
   client.setConf( conf );


   RunningJob job = JobClient.runJob( conf );



Shirley Cohen wrote:

Hi,

How does one do a join operation in map reduce? Is there more than one 
way to do a join? Which way works better and why?


Thanks,

Shirley

--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


RE: reducers hanging problem

2008-06-30 Thread Runping Qi

Looks like the reducer stuck at shuffling phase.
What is the progression percentage do you see for the reducer from web
GUI?

It is known that 0.17 does not handle shuffling well.

Runping


> -Original Message-
> From: Andreas Kostyrka [mailto:[EMAIL PROTECTED]
> Sent: Monday, June 30, 2008 8:30 AM
> To: core-user@hadoop.apache.org
> Subject: reducers hanging problem
> 
> Hi!
> 
> I'm running streaming tasks on hadoop 0.17.0, and wondered, if anyone
has
> an
> approach to debugging the following situation:
> 
> -) map have all finished (100% in http display),
> -) some reducers are hanging, with the messages below.
> 
> Notice, that the task had 100 map tasks at allo, so 58 seems like an
> extraordinary high number of missing parts, long after map has
officially
> finished. Plus it seems to be deterministic, it always stop at 3
reduce
> parts
> not finishing, although I haven't yet checked if they are always the
same
> errors or not.
> 
> > 2008-06-30 15:25:41,953 INFO org.apache.hadoop.mapred.ReduceTask:
> > task_200806300847_0002_r_14_0 Need 58 map output(s) 2008-06-30
> > 15:25:41,953 INFO org.apache.hadoop.mapred.ReduceTask:
> > task_200806300847_0002_r_14_0: Got 0 new map-outputs & 0
obsolete
> > map-outputs from tasktracker and 0 map-outputs from previous
failures
> > 2008-06-30 15:25:41,954 INFO org.apache.hadoop.mapred.ReduceTask:
> > task_200806300847_0002_r_14_0 Got 0 known map output
location(s);
> > scheduling... 2008-06-30 15:25:41,954 INFO
> > org.apache.hadoop.mapred.ReduceTask:
task_200806300847_0002_r_14_0
> > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
2008-06-30
> > 15:25:46,770 INFO org.apache.hadoop.streaming.PipeMapRed:
MRErrorThread
> > done 2008-06-30 15:25:46,963 INFO
org.apache.hadoop.mapred.ReduceTask:
> > task_200806300847_0002_r_14_0 Need 58 map output(s) 2008-06-30
> > 15:25:46,963 INFO org.apache.hadoop.mapred.ReduceTask:
> > task_200806300847_0002_r_14_0: Got 0 new map-outputs & 0
obsolete
> > map-outputs from tasktracker and 0 map-outputs from previous
failures
> > 2008-06-30 15:25:46,964 INFO org.apache.hadoop.mapred.ReduceTask:
> > task_200806300847_0002_r_14_0 Got 0 known map output
location(s);
> > scheduling... 2008-06-30 15:25:46,964 INFO
> > org.apache.hadoop.mapred.ReduceTask:
task_200806300847_0002_r_14_0
> > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 
> TIA,
> 
> Andreas


Re: reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
Another observation, the TaskTracker$Child was alive, and the reduce script 
has hung on read(0, ) :(

Andreas


signature.asc
Description: This is a digitally signed message part.


Re: RecordReader Functionality

2008-06-30 Thread Jorgen Johnson
Hi Sean,

Perhaps I'm missing something, but it doesn't appear to me that you're
actually seeking to the filesplit start position in your constructor...

This would explain why all the mappers are getting the same records.

-jorgenj

On Mon, Jun 30, 2008 at 9:22 AM, Sean Arietta <[EMAIL PROTECTED]> wrote:

>
> Hello all,
> I am having a problem writing my own RecordReader. The basic setup I have
> is
> a large byte array that needs to be diced up into key value pairs such that
> the key is the index into the array and the value is a byte array itself
> (albeit much smaller). Here is the code that I currently have written to do
> just this:
>
>
> /* This method is just the constructor for my new RecordReader
> public TrainingRecordReader(Configuration job, FileSplit split) throws
> IOException
>{
>start = split.getStart();
>end = start + split.getLength();
>final Path file = split.getPath();
>compressionCodecs = new CompressionCodecFactory(job);
>final CompressionCodec codec =
> compressionCodecs.getCodec(file);
>
>// open the file and seek to the start of the split
>FileSystem fs = file.getFileSystem(job);
>FSDataInputStream fileIn = fs.open(split.getPath());
>in = new TrainingReader(fileIn, job);
>this.pos = start;
>}
>
> // This returns the key, value pair I was talking about
> public synchronized boolean next(LongWritable key, BytesWritable value)
> throws IOException
>{
>if (pos >= end)
>return false;
>
>key.set(pos);   // key is position
>int newSize = in.readVector(value);
>if (newSize > 0)
>{
>pos += newSize;
>return true;
>}
>return false;
>}
>
> // This extracts that smaller byte array from the large input file
> public int readVector(BytesWritable value) throws IOException
>{
>int numBytes = in.read(buffer);
>value.set(buffer, 0, numBytes);
>return numBytes;
>}
>
> So all of this worked just fine when I set conf.set("mapred.job.tracker",
> "local"), but now that I am attempting to test in a fake distributed
> setting
> (aka still one node, but I haven't set the above config param), I do not
> get
> what I want. Instead of getting unique key value pairs, I get repeated key
> value pairs based on the number of map tasks I have set. So, say that my
> large file contained 49 entries, I would want a unique key value pair for
> each of those, but if I set my numMapTasks to 7, I get 7 unique ones that
> repeat every 7 key value pairs.
>
> So it seems that each MapTask which ultimately calls my
> TrainingReader.next() method from above is somehow pointing to the same
> FileSplit. I know that in the LineRecordReader in the source there is some
> small little routine that skips the first line of the data if you aren't at
> the beginning Is that related? Why isn't it the case that
> split.getStart() isn't returning the absolute pointer to the start of the
> split? So many questions I don't know the answer to, haha.
>
> I would appreciate anyone's help in resolving this issue. Thanks very much!
>
> Cheers,
> Sean M. Arietta
> --
> View this message in context:
> http://www.nabble.com/RecordReader-Functionality-tp18199187p18199187.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
"Liberties are not given, they are taken."
- Aldous Huxley


RecordReader Functionality

2008-06-30 Thread Sean Arietta

Hello all,
I am having a problem writing my own RecordReader. The basic setup I have is
a large byte array that needs to be diced up into key value pairs such that
the key is the index into the array and the value is a byte array itself
(albeit much smaller). Here is the code that I currently have written to do
just this:


/* This method is just the constructor for my new RecordReader
public TrainingRecordReader(Configuration job, FileSplit split) throws
IOException 
{
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
final CompressionCodec codec = compressionCodecs.getCodec(file);

// open the file and seek to the start of the split
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new TrainingReader(fileIn, job);
this.pos = start;
}

// This returns the key, value pair I was talking about
public synchronized boolean next(LongWritable key, BytesWritable value)
throws IOException 
{
if (pos >= end)
return false;

key.set(pos);   // key is position
int newSize = in.readVector(value);
if (newSize > 0) 
{
pos += newSize;
return true;
}
return false;
}

// This extracts that smaller byte array from the large input file
public int readVector(BytesWritable value) throws IOException 
{
int numBytes = in.read(buffer);
value.set(buffer, 0, numBytes);
return numBytes;
}

So all of this worked just fine when I set conf.set("mapred.job.tracker",
"local"), but now that I am attempting to test in a fake distributed setting
(aka still one node, but I haven't set the above config param), I do not get
what I want. Instead of getting unique key value pairs, I get repeated key
value pairs based on the number of map tasks I have set. So, say that my
large file contained 49 entries, I would want a unique key value pair for
each of those, but if I set my numMapTasks to 7, I get 7 unique ones that
repeat every 7 key value pairs.

So it seems that each MapTask which ultimately calls my
TrainingReader.next() method from above is somehow pointing to the same
FileSplit. I know that in the LineRecordReader in the source there is some
small little routine that skips the first line of the data if you aren't at
the beginning Is that related? Why isn't it the case that
split.getStart() isn't returning the absolute pointer to the start of the
split? So many questions I don't know the answer to, haha.

I would appreciate anyone's help in resolving this issue. Thanks very much!

Cheers,
Sean M. Arietta
-- 
View this message in context: 
http://www.nabble.com/RecordReader-Functionality-tp18199187p18199187.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: reducers hanging problem

2008-06-30 Thread Chris Anderson
On Mon, Jun 30, 2008 at 8:30 AM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote:
>  Plus it seems to be deterministic, it always stop at 3 reduce parts
> not finishing, although I haven't yet checked if they are always the same
> errors or not.

I've been struggling through getting my streaming tasks (map only, no
reduce) to run across large Nutch crawls. I'd been having
deterministic failures as well. It's worth checking your streaming job
against input data (maybe on a local workstation) to see that it
doesn't blow up on some of the input. It turns out my Ruby/Hpricot XML
parsers were having a hard time swallowing large binary files
(surprise) and as a result, some map tasks would always die in the
same place.

I got my data to test locally but running streaming jar with cat as
it's mapper, and then copying the results to my workstation, and
piping them into my script. I haven't tried using cat as a reducer,
but it should yield output files suitable for running your streaming
reducers over, in an instrumented environment.


-- 
Chris Anderson
http://jchris.mfdz.com


reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
Hi!

I'm running streaming tasks on hadoop 0.17.0, and wondered, if anyone has an 
approach to debugging the following situation:

-) map have all finished (100% in http display),
-) some reducers are hanging, with the messages below.

Notice, that the task had 100 map tasks at allo, so 58 seems like an 
extraordinary high number of missing parts, long after map has officially 
finished. Plus it seems to be deterministic, it always stop at 3 reduce parts 
not finishing, although I haven't yet checked if they are always the same 
errors or not.

> 2008-06-30 15:25:41,953 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200806300847_0002_r_14_0 Need 58 map output(s) 2008-06-30
> 15:25:41,953 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200806300847_0002_r_14_0: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-06-30 15:25:41,954 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200806300847_0002_r_14_0 Got 0 known map output location(s);
> scheduling... 2008-06-30 15:25:41,954 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200806300847_0002_r_14_0
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-06-30
> 15:25:46,770 INFO org.apache.hadoop.streaming.PipeMapRed: MRErrorThread
> done 2008-06-30 15:25:46,963 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200806300847_0002_r_14_0 Need 58 map output(s) 2008-06-30
> 15:25:46,963 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200806300847_0002_r_14_0: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-06-30 15:25:46,964 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200806300847_0002_r_14_0 Got 0 known map output location(s);
> scheduling... 2008-06-30 15:25:46,964 INFO
> org.apache.hadoop.mapred.ReduceTask: task_200806300847_0002_r_14_0
> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)

TIA,

Andreas


signature.asc
Description: This is a digitally signed message part.


Data-local tasks

2008-06-30 Thread Saptarshi Guha

Hello,
	I recall asking this question but this is in addition to what I'ev  
askd.

Firstly, to recap my question and Arun's specific response:

--  On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
--	Does the "Data-local map tasks" counter mean the number of tasks   
that the had the input data already present on the machine on they   
are running on?

--  i.e the wasn't a need to ship the data to them.

Response from Arun
--	Yes. Your understanding is correct. More specifically it means that  
the map-task got scheduled on a machine on which one of the
--	replicas of it's input-split-block was present and was served by  
the datanode running on that machine. *smile* Arun



	Now, Is Hadoop designed to schedule a map task on a machine which has  
one of the replicas of it's input split block?
	Failing that, does then assign a map task on machine close to one  
that contains a replica of it's input split block?

Are there any performance metrics for this?

Many thanks
Saptarshi


Saptarshi Guha | [EMAIL PROTECTED] | http://www.stat.purdue.edu/~sguha



smime.p7s
Description: S/MIME cryptographic signature


DataXceiver: java.io.IOException: Connection reset by peer

2008-06-30 Thread Rong-en Fan
Hi,

I'm using Hadoop 0.17.1 with HBase trunk, and notice lots of exception
in hadoop's log (it's a 3-node hdfs):

2008-06-30 19:27:45,760 ERROR org.apache.hadoop.dfs.DataNode: 192.168.23.1:500
10:DataXceiver: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:104)
at sun.nio.ch.IOUtil.write(IOUtil.java:75)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
at 
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:53)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:140)
at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
at 
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
at java.lang.Thread.run(Thread.java:619)

It seems to me that the datanode can not handle the incoming traffic.
If so, what parameters in hadoop sire and/or in os (I'm using rhel 4) that
I can play with?

Thanks,
Rong-En Fan


namnode replication

2008-06-30 Thread Vibhooti Verma
I have set my  property  as follows.

  dfs.name.dir

/apollo/env/TVHadoopCluster/var/tmp/hadoop/dfs/name,/local/namenode
  Determines where on the local filesystem the DFS name node
  should store the name table.  If this is a comma-delimited list
  of directories then the name table is replicated in all of the
  directories, for redundancy. 




when  I start my dfs after that, it does not find all the directory
structure and hence cant start the namenode. has any one tried this before?
Please let me  know if i have to  create entire structure manually.

Regards,
VIbhooti

-- 
cheers,
Vibhooti