from:"Amareshwari Sriramadasu"

Re: Using addCacheArchive

2009-06-25 Thread Amareshwari Sriramadasu


Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the form 
/user/akhil1988/Config.zip.
Just wondering if you are giving wrong path in the uri!

Thanks
Amareshwari

akhil1988 wrote:

Thanks Amareshwari for your reply!

The file Config.zip is lying in the HDFS, if it would not have been then the
error would be reported by the jobtracker itself while executing the
statement:
DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);

But I get error in the map function when I try to access the Config
directory. 

Now I am using the following statement but still getting the same error: 
DistributedCache.addCacheArchive(new

URI("/home/akhil1988/Config.zip#Config"), conf);

Do you think whether there should be any problem in distributing a zipped
directory and then hadoop unzipping it recursively.

Thanks!
Akhil



Amareshwari Sriramadasu wrote:
  

Hi Akhil,

DistributedCache.addCacheArchive takes path on hdfs. From your code, it
looks like you are passing local path.
Also, if you want to create symlink, you should pass URI as
hdfs://#, besides calling  
DistributedCache.createSymlink(conf);


Thanks
Amareshwari


akhil1988 wrote:


Please ask any questions if I am not clear above about the problem I am
facing.

Thanks,
Akhil

akhil1988 wrote:
  
  

Hi All!

I want a directory to be present in the local working directory of the
task for which I am using the following statements: 


DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);
DistributedCache.createSymlink(conf);




Here Config is a directory which I have zipped and put at the given
location in HDFS



I have zipped the directory because the API doc of DistributedCache
(http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
the
archive files are unzipped in the local cache directory :

DistributedCache can be used to distribute simple, read-only data/text
files and/or more complex types such as archives, jars etc. Archives
(zip,
tar and tgz/tar.gz files) are un-archived at the slave nodes.

So, from my understanding of the API docs I expect that the Config.zip
file will be unzipped to Config directory and since I have SymLinked
them
I can access the directory in the following manner from my map function:

FileInputStream fin = new FileInputStream("Config/file1.config");

But I get the FileNotFoundException on the execution of this statement.
Please let me know where I am going wrong.

Thanks,
Akhil

Re: Using addCacheArchive

2009-06-25 Thread Amareshwari Sriramadasu


Hi Akhil,

DistributedCache.addCacheArchive takes path on hdfs. From your code, it looks 
like you are passing local path.
Also, if you want to create symlink, you should pass URI as hdfs://#, besides calling  
DistributedCache.createSymlink(conf);


Thanks
Amareshwari


akhil1988 wrote:

Please ask any questions if I am not clear above about the problem I am
facing.

Thanks,
Akhil

akhil1988 wrote:
  

Hi All!

I want a directory to be present in the local working directory of the
task for which I am using the following statements: 


DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);
DistributedCache.createSymlink(conf);



Here Config is a directory which I have zipped and put at the given
location in HDFS


I have zipped the directory because the API doc of DistributedCache
(http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
archive files are unzipped in the local cache directory :

DistributedCache can be used to distribute simple, read-only data/text
files and/or more complex types such as archives, jars etc. Archives (zip,
tar and tgz/tar.gz files) are un-archived at the slave nodes.

So, from my understanding of the API docs I expect that the Config.zip
file will be unzipped to Config directory and since I have SymLinked them
I can access the directory in the following manner from my map function:

FileInputStream fin = new FileInputStream("Config/file1.config");

But I get the FileNotFoundException on the execution of this statement.
Please let me know where I am going wrong.

Thanks,
Akhil

Re: Unable to run Jar file in Hadoop.

2009-06-25 Thread Amareshwari Sriramadasu


Is your jar file in local file system or hdfs?
The jar file should be in local fs.

Thanks
Amareshwari
Shravan Mahankali wrote:

Am as well having similar... there is no solution yet!!!

Thank You,
Shravan Kumar. M 
Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

-
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system
administrator - netopshelpd...@catalytic.com

-Original Message-
From: krishna prasanna [mailto:svk_prasa...@yahoo.com] 
Sent: Thursday, June 25, 2009 1:01 PM

To: core-user@hadoop.apache.org
Subject: Unable to run Jar file in Hadoop.

Hi, 
 
When i am trying to run a Jar in Hadoop, it is giving me the following error
 
had...@krishna-dev:/usr/local/hadoop$ bin/hadoop jar
/user/hadoop/hadoop-0.18.0-examples.jar 
java.io.IOException: Error opening job jar:

/user/hadoop/hadoop-0.18.0-examples.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:114)
at java.util.jar.JarFile.(JarFile.java:133)
at java.util.jar.JarFile.(JarFile.java:70)
at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
... 4 more
 
Here is the file permissions,

 rw-r--r-   2 hadoop supergroup  91176 2009-06-25 12:49
/user/hadoop/hadoop-0.18.0-examples.jar
 
Some body help me on this 
 
Thanks in advance,

Krishna.


  Cricket on your mind? Visit the ultimate cricket website. Enter
http://cricket.yahoo.com

Re: where is the "addDependingJob"?

2009-06-24 Thread Amareshwari Sriramadasu


You can use 0.21-dev.
If not, you can try using old api jobcontrol to create dependingJobs by 
getting the conf from

org.apache.hadoop.mapreduce.Job.getConfiguration().

Thanks
Amareshwari
HRoger wrote:

Thanks for your answer,I am using the 0.20 and programing with the new api,so
how can I make one job ran after the other job in one class with the new
api?

Amareshwari Sriramadasu wrote:
  

HRoger wrote:


Hi
As you know in the "org.apache.hadoop.mapred.jobcontrol.Job" there is a
method called "addDependingJob" but not in
"org.apache.hadoop.mapreduce.Job".Is there some method works like
addDependingJob in "mapreduce" package?

  
  
"org.apache.hadoop.mapred.jobcontrol.Job" is moved to 
"org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob" in 0.21. In 
0.20, The corresponding class for 
org.apache.hadoop.mapred.jobcontrol.Job with new api is not present. So, 
in 0.20, you should use "org.apache.hadoop.mapred.jobcontrol.Job".


Thanks
Amareshwari

Re: where is the "addDependingJob"?

2009-06-24 Thread Amareshwari Sriramadasu


HRoger wrote:

Hi
As you know in the "org.apache.hadoop.mapred.jobcontrol.Job" there is a
method called "addDependingJob" but not in
"org.apache.hadoop.mapreduce.Job".Is there some method works like
addDependingJob in "mapreduce" package?

  
"org.apache.hadoop.mapred.jobcontrol.Job" is moved to 
"org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob" in 0.21. In 
0.20, The corresponding class for 
org.apache.hadoop.mapred.jobcontrol.Job with new api is not present. So, 
in 0.20, you should use "org.apache.hadoop.mapred.jobcontrol.Job".


Thanks
Amareshwari

Re: external jars in .20

2009-06-01 Thread Amareshwari Sriramadasu


Hi Lance,

Where are you passing the -libjars parameter? It is now GenericOption. 
It is no more a parameter for jar command.


Thanks
Amareshwari

Lance Riedel wrote:

We are trying to upgrade to .20 from 19.1 due to several issues we are
having.  Now are jobs are failing with class not found exceptions.

I am very confused about the final state for using external jars in .20.

-libjars no long works

placing all dependent jars in the jar /lib directory doesn't work

See:
https://issues.apache.org/jira/browse/HADOOP-4612

where the code:

//adding libjars to the classpath
Configuration conf = JobClient.getCommandLineConfig();
URL[] libJars = GenericOptionsParser.getLibJars(conf);
if(libJars!=null) {
  for(URL url : libJars){
classPath.add(url);
  }
}

Has been removed, and says it is no longer needed.. Where are the docs
for this change?


Thanks,
Lance

Re: JobInProgress and TaskInProgress

2009-05-18 Thread Amareshwari Sriramadasu


You can use RunningJob handle to query map/reduce progress.
See api @ 
http://hadoop.apache.org/core/docs/r0.20.0/api/org/apache/hadoop/mapred/RunningJob.html


Thanks
Amareshwari
Jothi Padmanabhan wrote:

Look at JobClient -- There are some useful methods there.
For example, displayTasks and monitorAndPrintJob might provide most of the
information that you are looking for.

Jothi


On 5/19/09 10:14 AM, "Rakhi Khatwani"  wrote:

  

Hi,
  I am looking for the following:
for each task: % completed for both map n reduce, exceptions (if
encountered).
for each job: % completed, status (RUNNING,FAILED,PAUSED etc).
i would wanna write a program so that i can programatically access the above
information at any point of time.

Thanks,
Raakhi

On Mon, May 18, 2009 at 7:46 PM, Jothi Padmanabhan
wrote:



Could you let us know what information are you looking to extract from
these
classes? You possibly could get them from other classes.

Jothi


On 5/18/09 6:23 PM, "Rakhi Khatwani"  wrote:

  

Hi,
  how do i get the job progress n task progress information
programmaticaly at any point of time using the API's
there is a JobInProgress and TaskInProgress classes... but both of them


are
  

private

any suggestions?

Thanks,
Raakhi

Re: intermediate files of killed tasks not purged

2009-04-28 Thread Amareshwari Sriramadasu

Again, where are you seeing the attemptid directories? are they at 
mapred/local/ or at 
mapred/local/taskTracker/jobCache//.
If you are seeing files at mapred/local/, then it is bug. 
Please raise a jira and attach tasktracker logs if possible.
If not, mapred/local/taskTracker/jobCache// directories 
are cleaned up on a KillTaskAction and 
mapred/local/taskTracker/jobCache/ directories are cleanedup on 
KillJobAction. Can you verify from TaskTracker logs, the attemptid got a 
KillTaskAction or jobid got a KillJobAction? If not, This is fixed by 
HADOOP-5247.


Thanks
Amareshwari

Sandhya E wrote:

Hi Amareshwari

We are on 0.18 version. I verified from jobtracker website that not
all killed tasks have left overs in mapred/local.  Also there are some
tasks that were successful have left their tmp folders in mapred/local

Can you please give some pointers on how to debug it further.

Regards
Sandhya

On Tue, Apr 28, 2009 at 2:02 PM, Amareshwari Sriramadasu
 wrote:
  

Hi Sandhya,

 Which version of HADOOP are you using? There could be 
directories in mapred/local, pre 0.17. Now, there should not be any such
directories.
From version 0.17 onwards, the attempt directories will be present only at
mapred/local/taskTracker/jobCache// . If you are seeing the
directories in any other location, then it seems like a bug.

HADOOP-4654 is to cleanup temporary data in DFS for failed tasks, it does
not change local FileSystem files.

Thanks
Amareshwari
Edward J. Yoon wrote:


Hi,

It seems related with https://issues.apache.org/jira/browse/HADOOP-4654.

On Tue, Apr 28, 2009 at 4:01 PM, Sandhya E 
wrote:

  

Hi

Under /mapred/local there are directories like
"attempt_200904262046_0026_m_02_0"
Each of these directories contains files of format: intermediate.1
intermediate.2  intermediate.3  intermediate.4  intermediate.5
There are many directories in this format. All these correspond to
killed task attempts. As they contain huge intermediate files, we
landed up in disk space issues.

They are cleaned up  when mapred cluster is restarted. But otherwise,
how can these be cleaned up without having to restart cluster.

Conf parameter "keep.failed.task.files" is set to "false" in our case.

Many Thanks
Sandhya

Re: intermediate files of killed tasks not purged

2009-04-28 Thread Amareshwari Sriramadasu


Hi Sandhya,

 Which version of HADOOP are you using? There could be  
directories in mapred/local, pre 0.17. Now, there should not be any such 
directories.
From version 0.17 onwards, the attempt directories will be present only 
at mapred/local/taskTracker/jobCache// . If you are 
seeing the directories in any other location, then it seems like a bug.


HADOOP-4654 is to cleanup temporary data in DFS for failed tasks, it 
does not change local FileSystem files.


Thanks
Amareshwari
Edward J. Yoon wrote:

Hi,

It seems related with https://issues.apache.org/jira/browse/HADOOP-4654.

On Tue, Apr 28, 2009 at 4:01 PM, Sandhya E  wrote:
  

Hi

Under /mapred/local there are directories like
"attempt_200904262046_0026_m_02_0"
Each of these directories contains files of format: intermediate.1
intermediate.2  intermediate.3  intermediate.4  intermediate.5
There are many directories in this format. All these correspond to
killed task attempts. As they contain huge intermediate files, we
landed up in disk space issues.

They are cleaned up  when mapred cluster is restarted. But otherwise,
how can these be cleaned up without having to restart cluster.

Conf parameter "keep.failed.task.files" is set to "false" in our case.

Many Thanks
Sandhya

Re: Hadoop streaming performance: elements vs. vectors

2009-04-05 Thread Amareshwari Sriramadasu

You can add your jar to distributed cache and add it to classpath by 
passing it in configuration propery - "mapred.job.classpath.archives".


-Amareshwari
Peter Skomoroch wrote:

If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
way to add it to the classpath without the following patch?

https://issues.apache.org/jira/browse/HADOOP-3570

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3c48cf78e3.10...@yahoo-inc.com%3e

On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
wrote:

  

Paco,

Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
suggest and report back later...

-Pete


On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN  wrote:



hi peter,
thinking aloud on this -

trade-offs may depend on:

  * how much grouping would be possible (tracking a PDF would be
interesting for metrics)
  * locality of key/value pairs (distributed among mapper and reducer
tasks)

to that point, will there be much time spent in the shuffle?  if so,
it's probably cheaper to shuffle/sort the grouped row vectors than the
many small key,value pair

in any case, when i had a similar situation on a large data set (2-3
Tb shuffle) a good pattern to follow was:

  * mapper emitted small key,value pairs
  * combiner grouped into row vectors

that combiner may get invoked both at the end of the map phase and at
the beginning of the reduce phase (more benefit)

also, using byte arrays if possible to represent values may be able to
save much shuffle time

best,
paco


On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
 wrote:
  

Hadoop streaming question: If I am forming a matrix M by summing a


number of
  

elements generated on different mappers, is it better to emit tons of


lines
  

from the mappers with small key,value pairs for each element, or should


I
  

group them into row vectors before sending to the reducers?

For example, say I'm summing frequency count matrices M for each user on


a
  

different map task, and the reducer combines the resulting sparse user


count
  

matrices for use in another calculation.

Should I emit the individual elements:

i (j, Mij) \n
3 (1, 3.4) \n
3 (2, 3.4) \n
3 (3, 3.4) \n
4 (1, 2.3) \n
4 (2, 5.2) \n

Or posting list style vectors?

3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
4 ((1, 2.3), (2, 5.2)) \n

Using vectors will at least save some message space, but are there any


other
  

benefits to this approach in terms of Hadoop streaming overhead (sorts
etc.)?  I think buffering issues will not be a huge concern since the


length
  

of the vectors have a reasonable upper bound and will be in a sparse
format...


--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch




--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: job status from command prompt

2009-04-05 Thread Amareshwari Sriramadasu


Elia Mazzawi wrote:
is there a command that i can run from the shell that says this job 
passed / failed


I found these but they don't really say pass/fail they only say what 
is running and percent complete.


this shows what is running
./hadoop job -list

and this shows the completion
./hadoop job -status job_200903061521_0045

The following command lists all jobs in prep, running, completed:
./hadoop job -list all

-Amareshwari

Re: reduce task failing after 24 hours waiting

2009-03-25 Thread Amareshwari Sriramadasu

Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours 
to higher value. By default, their values are 24 hours. These might be 
the reason for failure, though I'm not sure.


Thanks
Amareshwari

Billy Pearson wrote:
I am seeing on one of my long running jobs about 50-60 hours that 
after 24 hours all

active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them 
again.


Billy

Re: Unable to access job details

2009-03-22 Thread Amareshwari Sriramadasu

Can you look for Exception from jetty in JT logs and report here? That 
would tell us the cause for ERROR 500.


Thanks
Amareshwari
Nathan Marz wrote:
Sometimes I am unable to access a job's details and instead only see. 
I am seeing this on 0.19.2 branch.


HTTP ERROR: 500

Internal Server Error

RequestURI=/jobdetails.jsp

Powered by Jetty://

Does anyone know the cause of this?

Re: Task Side Effect files and copying(getWorkOutputPath)

2009-03-16 Thread Amareshwari Sriramadasu


Saptarshi Guha wrote:

Hello,
I would like to produce side effect files which will be later copied
to the outputfolder.
I am using FileOuputFormat, and in the Map's close() method i copy
files (from the local tmp/ folder) to
FileOutputFormat.getWorkOutputPath(job);

  

FileOutputFormat.getWorkOutputPath(job) is the correct method to get directory 
for task-side effect files.

You should not use close() method, because promotion to output directory 
happens before close(). You can use configure() method.

See org.apache.hadoop.tools.HadoopArchives.

void close()  {
if (shouldcopy) {
ArrayList lop = new ArrayList();
for(String ff :  tempdir.list()){
lop.add(new Path(temppfx+ff));
}
dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath);
}

However, this throws an error java.io.IOException:
`hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_00_0':
specified destination directory doest not exist

I though this is the right to place to drop side effect files. Prior
to this I was copying o the output folder, but many were not copied,
or in fact all were, but during the reduce output stage many were
deleted - am not sure(I used NullOutputFormat and all the files were
present in the output folder)  So i resorted to getWorkOutputPath
which threw the above exception.

So if I'm using FileOutputFormat, and my maps and/or reduces produce
side effects files on the localFS
1)when should I copy them to the DFS (e.g the close method? or one at
a time in the map/reduce method)
2) Where should i copy them to.

I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1);
Also, each side effect file produced has a unique name, i.e there is
no overwriting.
  
You need not set jobConf.setNumTasksToExecutePerJvm(-1), even otherwise, 
each attempt will have unique work output path.


Thanks
Amareshwari

Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-15 Thread Amareshwari Sriramadasu

Instantiation of Reducer is moved to the place where reduce() is getting 
called, in branch 0.19.1. See HADOOP-5002. Hope that should solve your 
issue with configure() method.

Thanks
Amareshwari
Chris K Wensel wrote:

fwiw, we have released a workaround for this issue in Cascading 1.0.5.

http://www.cascading.org/
http://cascading.googlecode.com/files/cascading-1.0.5.tgz

In short, Hadoop 0.19.0 and .1 instantiate the users Reducer class and 
subsequently calls configure() when there is no intention to use the 
class (during job/task cleanup tasks).


This clearly can cause havoc for users who use configure() to 
initialize resources used by the reduce() method.


Testing for jobConf.getNumReduceTasks() is 0 inside the configure() 
method seems to work out well.


branch-0.19 looks like it won't instantiate the Reducer class during 
job/task cleanup tasks, so I expect will leak into future releases.


cheers,

ckw

On Mar 12, 2009, at 8:20 PM, Amareshwari Sriramadasu wrote:


Are you seeing reducers getting spawned from web ui? then, it is a bug.
If not, there won't be reducers spawned, it could be job-setup/ 
job-cleanup task that is running on a reduce slot. See HADOOP-3150 
and HADOOP-4261.

-Amareshwari
Chris K Wensel wrote:


May have found the answer, waiting on confirmation from users.

Turns out 0.19.0 and .1 instantiate the reducer class when the task 
is actually intended for job/task cleanup.


branch-0.19 looks like it resolves this issue by not instantiating 
the reducer class in this case.


I've got a workaround in the next maint release:
http://github.com/cwensel/cascading/tree/wip-1.0.5

ckw

On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote:


Hey all

Have some users reporting intermittent spawning of Reducers when 
the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly 
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default 
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't 
bother not setting the value, and subsequently comes to ones 
attention rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is 
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



--Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/





--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Amareshwari Sriramadasu


Are you seeing reducers getting spawned from web ui? then, it is a bug.
If not, there won't be reducers spawned, it could be job-setup/ 
job-cleanup task that is running on a reduce slot. See HADOOP-3150 and 
HADOOP-4261.

-Amareshwari
Chris K Wensel wrote:


May have found the answer, waiting on confirmation from users.

Turns out 0.19.0 and .1 instantiate the reducer class when the task is 
actually intended for job/task cleanup.


branch-0.19 looks like it resolves this issue by not instantiating the 
reducer class in this case.


I've got a workaround in the next maint release:
http://github.com/cwensel/cascading/tree/wip-1.0.5

ckw

On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote:


Hey all

Have some users reporting intermittent spawning of Reducers when the 
job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly 
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default 
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't 
bother not setting the value, and subsequently comes to ones 
attention rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is 
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

Re: streaming inputformat: class not found

2009-03-11 Thread Amareshwari Sriramadasu

Till 0.18.x,  files are not added to client-side classpath. Use 0.19, 
and run following command to use custom input format


bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -mapper
mapper.pl -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input
test.data -output test-output -file  -inputformat MyFormatter
-libjars 

Thanks
Amareshwari
t-alleyne wrote:

Hello,

I'm try to run a mapreduce job on a data file in which the keys and values
alternate rows.  E.g.

key1
value1
key2
...

I've written my own InputFormat by extending FileInputFormat (the code for
this class is below.)  The problem is that when I run hadoop streaming with
the command

bin/hadoop jar contrib/streaming/hadoop-0.18.3-streaming.jar -mapper
mapper.pl -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input
test.data -output test-output -file  -inputformat
MyFormatter

I get the error

-inputformat : class not found : MyFormatter
java.lang.RuntimeException: -inputformat : class not found : MyFormatter
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:550)
...

I have tried putting .java, .class, and .jar file of MyFormatter in the job
jar using the -file parameter.  I have also tried putting them on the hdfs
using -copyFromLocal, but I still get the same error.  Can anyone give me
some hints as to what the problem might be?  Also, I tried to hack together
my formatter based on the hadoop examples, so does it seems like it should
properly process the input files I described above?

Trevis




public final class MyFormatter extends
org.apache.hadoop.mapred.FileInputFormat

{

@Override
public RecordReader getRecordReader( InputSplit split,
JobConf job, Reporter reporter ) throws IOException
{
return new MyRecordReader( job, (FileSplit) split );
}



static class MyRecordReader implements RecordReader

{
private LineRecordReader _in   = null;
private LongWritable _junk = null;

public FastaRecordReader( JobConf job, FileSplit split ) throws

IOException
{
_junk = new LongWritable();

_in = new LineRecordReader( job, split );
}

@Override
public void close() throws IOException
{
_in.close();
}

@Override
public Text createKey()
{
return new Text();
}

@Override
public Text createValue()
{
return new Text();
}

@Override
public long getPos() throws IOException
{
return _in.getPos();
}

@Override
public float getProgress() throws IOException
{
return _in.getProgress();
}

@Override
public boolean next( Text key, Text value ) throws IOException
{
if ( _in.next( _junk, key ) )
{
if ( _in.next( _junk, value ) )
{
return true;
}
}

key.clear();

value.clear();

return false;

}
}
}

Re: Jobs stalling forever

2009-03-10 Thread Amareshwari Sriramadasu


This is due to HADOOP-5233. Got fixed in branch 0.19.2

-Amareshwari
Nathan Marz wrote:
Every now and then, I have jobs that stall forever with one map task 
remaining. The last map task remaining says it is at "100%" and in the 
logs, it says it is in the process of committing. However, the task 
never times out, and the job just sits there forever. Has anyone else 
seen this? Is there a JIRA ticket open for it already?

Re: Throwing an IOException in Map, yet task does not fail

2009-03-05 Thread Amareshwari Sriramadasu


Is your job a streaming job?
If so, Which version of hadoop are you using? what is the configured 
value for stream.non.zero.exit.is.failure? Can you see 
stream.non.zero.exit.is.failure to true and try again?

Thanks
Amareshwari
Saptarshi Guha wrote:

Hello,
I have given a case where my mapper should fail. That is, based on a
result it throws an exception
if(res==0) throw new IOException("Error in code!, see stderr/out");
,
When i go to the JobTracker website, e.g
http://tracker.com:50030/jobdetails.jsp?jobid=job_200903051709_0024&refresh=30
and click on one of the running tasks, I see an IOException in the
errors column.
But on the jobtracker page for the job, it doesn't fail - it stays in
the running column , never moving to the failed/killed columns (not
even after 10 minutes)

Why so?
Regards


Saptarshi Guha

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Amareshwari Sriramadasu


Are you hitting HADOOP-2771?
-Amareshwari
Sandy wrote:

Hello all,

For the sake of benchmarking, I ran the standard hadoop wordcount example on
an input file using 2, 4, and 8 mappers and reducers for my job.
In other words,  I do:

time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 2 -r 2
sample.txt output
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 4 -r 4
sample.txt output2
time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m 8 -r 8
sample.txt output3

Strangely enough, when this increase in mappers and reducers result in
slower running times!
-On 2 mappers and reducers it ran for 40 seconds
on 4 mappers and reducers it ran for 60 seconds
on 8 mappers and reducers it ran for 90 seconds!

Please note that the "sample.txt" file is identical in each of these runs.

I have the following questions:
- Shouldn't wordcount get -faster- with additional mappers and reducers,
instead of slower?
- If it does get faster for other people, why does it become slower for me?
  I am running hadoop on psuedo-distributed mode on a single 64-bit Mac Pro
with 2 quad-core processors, 16 GB of RAM and 4 1TB HDs

I would greatly appreciate it if someone could explain this behavior to me,
and tell me if I'm running this wrong. How can I change my settings (if at
all) to get wordcount running faster when i increases that number of maps
and reduces?

Thanks,
-SM

Re: binary format for streaming

2009-03-03 Thread Amareshwari Sriramadasu


[HADOOP-1722] Make streaming to handle non-utf8 byte array
http://issues.apache.org/jira/browse/HADOOP-1722
is committed to branch 0.21

Yasuyuki Watanabe wrote:

Hi,

I would like to know the status of binary input/output format
support for streaming.

We found HADOOP-3227 and it was open. So we just posted some
class files and patches we created. They will work with Hadoop 0.19.1.

[HADOOP-3227] Implement a binary input/output format for Streaming
http://issues.apache.org/jira/browse/HADOOP-3227

Then, we also found HADOOP-1722 and it looks like its scope
includes binary format support for streaming and its status is
"resolved."

[HADOOP-1722] Make streaming to handle non-utf8 byte array
http://issues.apache.org/jira/browse/HADOOP-1722

I suppose the patches in HADOOP-1722 are the right answer for
my question. If my understanding is correct, can I ask which
version of Hadoop will have this feature in it and when?

Thanks,

Yasu

Re: FAILED_UNCLEAN?

2009-02-25 Thread Amareshwari Sriramadasu



This could be mostly because of HADOOP-5269. By any chance did you see 
TaskTracker web ui? was it holding up some FAILED_UNCLEAN tasks?

Can you attach JobTracker and TaskTracker logs and task logs, if possible?

Thanks
Amareshwari
Nathan Marz wrote:
This is on Hadoop 0.19.1. The first time I saw it happen, the job was 
hung. That is, 5 map tasks were "running", but looking at each task 
there was the FAILED_UNCLEAN task attempt and no other task attempts. 
I reran it again, the job failed immediately, and some of the tasks 
had FAILED_UNCLEAN.


There is one job that runs in parallel with this job, but it's of the 
same priority. The other job had failed when the job I'm describing 
got hung.



On Feb 24, 2009, at 10:46 PM, Amareshwari Sriramadasu wrote:


Nathan Marz wrote:
I have a large job operating on over 2 TB of data, with about 5 
input splits. For some reason (as yet unknown), tasks started 
failing on two of the machines (which got blacklisted). 13 mappers 
failed in total. Of those 13, 8 of the tasks were able to execute on 
another machine without any issues. 5 of the tasks *did not* get 
re-executed on another machine, and their status is marked as 
"FAILED_UNCLEAN". Anyone have any idea what's going on? Why isn't 
Hadoop running these tasks on other machines?


Has the job failed/killed or Succeded when you see this situation ? 
Once the job completes, the unclean attempts will not get scheduled.
If not, are there other jobs of higher priority running at the same 
time preventing the cleanups to be launched?

What version of Hadoop are you using? latest trunk?

Thanks
Amareshwari

Thanks,
Nathan Marz

Re: FAILED_UNCLEAN?

2009-02-24 Thread Amareshwari Sriramadasu


Nathan Marz wrote:
I have a large job operating on over 2 TB of data, with about 5 
input splits. For some reason (as yet unknown), tasks started failing 
on two of the machines (which got blacklisted). 13 mappers failed in 
total. Of those 13, 8 of the tasks were able to execute on another 
machine without any issues. 5 of the tasks *did not* get re-executed 
on another machine, and their status is marked as "FAILED_UNCLEAN". 
Anyone have any idea what's going on? Why isn't Hadoop running these 
tasks on other machines?


Has the job failed/killed or Succeded when you see this situation ? Once 
the job completes, the unclean attempts will not get scheduled.
If not, are there other jobs of higher priority running at the same time 
preventing the cleanups to be launched?

What version of Hadoop are you using? latest trunk?

Thanks
Amareshwari

Thanks,
Nathan Marz

Re: Hadoop Streaming -file option

2009-02-24 Thread Amareshwari Sriramadasu


Arun C Murthy wrote:


On Feb 23, 2009, at 2:01 AM, Bing TANG wrote:


Hi, everyone,
Could somdone tell me the principle of "-file" when using Hadoop
Streaming. I want to ship a big file to Slaves, so how it works?

Hadoop uses "SCP" to copy? How does Hadoop deal with -file option?



No, -file just copies the file from the local filesystem to HDFS, and 
the DistributedCache copies it to the local filesystem of the node on 
which the map/reduce task runs.


-file option does not use DistributedCache yet. HADOOP-2622 is still 
open for the same. -file option ships the files along with the streaming 
jar. (it unpacks the jar and copy the files and pack the jar again). You 
can use -files, -libjars and -archives to copy the files to distributed 
cache.

-Amareshwari

Arun

Re: How to use Hadoop API to submit job?

2009-02-20 Thread Amareshwari Sriramadasu


You should implement Tool interface and submit jobs.
For example see org.apache.hadoop.examples.WordCount

-Amareshwari
Wu Wei wrote:

Hi,

I used to submit Hadoop job with the utility RunJar.main() on hadoop 
0.18. On hadoop 0.19, because the commandLineConfig of JobClient was 
null, I got a NullPointerException error when RunJar.main() calls 
GenericOptionsParser to get libJars (0.18 didn't do this call). I also 
tried the class JobShell to submit job, but it catches all exceptions 
and sends to stderr so that I cann't handle the exceptions myself.


I noticed that if I can call JobClient's setCommandLineConfig method, 
everything goes easy. But this method has default package 
accessibility, I cann't see the method out of package 
org.apache.hadoop.mapred.


Any advices on using Java APIs to submit job?

Wei

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Amareshwari Sriramadasu


Yes. The configuration is read only when the taskTracker starts.
You can see more discussion on jira HADOOP-5170 
(http://issues.apache.org/jira/browse/HADOOP-5170) for making it per job.

-Amareshwari
jason hadoop wrote:

I certainly hope it changes but I am unaware that it is in the todo queue at
present.

2009/2/18 S D 

  

Thanks Jason. That's useful information. Are you aware of plans to change
this so that the maximum values can be changed without restarting the
server?

John

2009/2/18 jason hadoop 



The .maximum values are only loaded by the Tasktrackers at server start
time
at present, and any changes you make will be ignored.


2009/2/18 S D 

  

Thanks for your response Rasit. You may have missed a portion of my


post.


On a different note, when I attempt to pass params via -D I get a
  

usage


message; when I use


-jobconf the command goes through (and works in the case of
  

mapred.reduce.tasks=0 for


example) but I get  a deprecation warning).
  

I'm using Hadoop 0.19.0 and -D is not working. Are you using version


0.19.0
  

as well?

John


On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS 


wrote:
  

John, did you try -D option instead of -jobconf,

I had -D option in my code, I changed it with -jobconf, this is what
  

I


get:


...
...
Options:
 -input DFS input file(s) for the Map step
 -outputDFS output directory for the Reduce step
 -mapper The streaming command to run
 -combiner  Combiner has to be a Java class
 -reducerThe streaming command to run
 -file  File/dir to be shipped in the Job jar file
 -inputformat
TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
 -outputformat TextOutputFormat(default)|JavaClassName  Optional.
 -partitioner JavaClassName  Optional.
 -numReduceTasks   Optional.
 -inputreader   Optional.
 -cmdenv   =Optional. Pass env.var to streaming commands
 -mapdebug   Optional. To run this script when a map task fails
 -reducedebug   Optional. To run this script when a reduce task
  

fails


 -verbose

Generic options supported are
-conf  specify an application configuration
  

file
  

-D use value for given property
-fs   specify a namenode
-jt specify a job tracker
-files specify comma separated
  

files


to


be copied to the map reduce cluster
-libjars specify comma separated
  

jar


files
to include in the classpath.
-archives specify comma
  

separated


archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info



I think -jobconf is not used in v.0.19 .

2009/2/18 S D 

  

I'm having trouble overriding the maximum number of map tasks that


run
  

on


a
  

given machine in my cluster. The default value of
mapred.tasktracker.map.tasks.maximum is set to 2 in


hadoop-default.xml.
  

When
running my job I passed

-jobconf mapred.tasktracker.map.tasks.maximum=1

to limit map tasks to one per machine but each machine was still


allocated
  

2
map tasks (simultaneously).  The only way I was able to guarantee a


maximum
  

of one map task per machine was to change the value of the property


in
  

hadoop-site.xml. This is unsatisfactory since I'll often be


changing


the


maximum on a per job basis. Any hints?

On a different note, when I attempt to pass params via -D I get a


usage
  

message; when I use -jobconf the command goes through (and works in


the
  

case
of mapred.reduce.tasks=0 for example) but I get  a deprecation


warning).


Thanks,
John




--
M. Raşit ÖZDAŞ

Re: Persistent completed jobs status not showing in jobtracker UI

2009-02-18 Thread Amareshwari Sriramadasu


Bill Au wrote:

I have enabled persistent completed jobs status and can see them in HDFS.
However, they are not listed in the jobtracker's UI after the jobtracker is
restarted.  I thought that jobtracker will automatically look in HDFS if it
does not find a job in its memory cache.  What am I missing?  How to I
retrieve the persistent completed job status?

Bill

  
JobTracker web ui doesn't look at persistence storage after a restart. 
You can access the old jobs from job history. History link is accesible 
from web ui.

-Amareshwari

Re: Testing with Distributed Cache

2009-02-10 Thread Amareshwari Sriramadasu


Nathan Marz wrote:
I have some unit tests which run MapReduce jobs and test the 
inputs/outputs in standalone mode. I recently started using 
DistributedCache in one of these jobs, but now my tests fail with 
errors such as:


Caused by: java.io.IOException: Incomplete HDFS URI, no host: 
hdfs:///tmp/file.data
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:70) 

at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)

at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at 
org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:472) 

at 
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:676) 




Does anyone know of a way to get DistributedCache working in a test 
environment?
You can look at the source code for 
org.apache.hadoop.mapred.TestMiniMRDFSCaching.
And DistributedCache does not work with LocalJobRunner. see 
http://issues.apache.org/jira/browse/HADOOP-2914

-Amareshwari

Re: only one reducer running in a hadoop cluster

2009-02-08 Thread Amareshwari Sriramadasu


Nick Cen wrote:

Hi,

I hava a hadoop cluster with 4 pc. And I wanna to integrate hadoop and
lucene together, so i copy some of the source code from nutch's Indexer
class, but when i run my job, i found that there is only 1 reducer running
on 1 pc, so the performance is not as far as expect.

  

what is the configuration of mapred.tasktracker.reduce.tasks.maximum ?


-Amareshwari

Re: Task tracker archive contains too many files

2009-02-04 Thread Amareshwari Sriramadasu


Andrew wrote:
I've noticed that task tracker moves all unpacked jars into 
${hadoop.tmp.dir}/mapred/local/taskTracker.


We are using a lot of external libraries, that are deployed via "-libjars" 
option. The total number of files after unpacking is about 20 thousands.


After running a number of jobs, tasks start to be killed with timeout reason 
("Task attempt_200901281518_0011_m_000173_2 failed to report status for 601 
seconds. Killing!"). All killed tasks are in "initializing" state. I've 
watched the tasktracker logs and found such messages:



Thread 20926 (Thread-10368):
  State: BLOCKED
  Blocked count: 3611
  Waited count: 24
  Blocked on java.lang.ref.reference$l...@e48ed6
  Blocked by 20882 (Thread-10341)
  Stack:
java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
java.lang.StringCoding.encode(StringCoding.java:272)
java.lang.String.getBytes(String.java:947)
java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
java.io.File.isDirectory(File.java:754)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:427)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)
org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:433)


This is exactly as in HADOOP-4780. 
As I understand, patch brings the code, which stores map of directories along 
with their DU's, thus reducing the number of calls to DU. This must help but 
the process of deleting 2 files taks too long. I've manually deleted 
archive after 10 jobs had run and it took over 30 minutes on XFS. Three times 
more, that default timeout for tasks!


Is there is the way to prohibit unpacking of jars? Or at least not to hold the 
archive? Or any other better way to solve this problem?


Hadoop version: 0.19.0.


  
Now, there is no way to stop DistributedCache from stopping unpacking of 
jars. I think it should have an option (thru configuration) whether to 
unpack or not.

Can you raise a jira for the same?

Thanks
Amareshwari

Re: Hadoop Streaming Semantics

2009-02-02 Thread Amareshwari Sriramadasu

S D wrote:

Thanks for your response. I'm using version 0.19.0 of Hadoop.
I tried your suggestion. Here is the line I use to invoke Hadoop

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \\
-input /user/hadoop/hadoop-input/inputFile.txt \\
-output /user/hadoop/hadoop-output \\
-mapper map-script.sh \\
-file map-script.sh \\
-file additional-script.rb \\ # Called by map-script.sh
-file utils.rb \\
-file env.sh \\
-file aws-s3-credentials-file \\# For permissions to use AWS::S3
-jobconf mapred.reduce.tasks=0 \\
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

Everything works fine if the -inputformat switch is not included but when I
include it I get the following message:
ERROR streaming.StreamJob: Job not Successful!
and a Runtime exception shows up in the jobtracker log:
PipeMapRed.waitOutputThreads(): subprocess failed with code 1

My map functions read each line of the input file and create a directory
(one for each line) on Hadoop (in our case S3 Native) in which corresponding
data is produced and stored. The name of the created directories are based
on the contents of the corresponding line. When I include the -inputformat
line above I've noticed that instead of the directories I'm expecting (named
after the data found in the input file), the directories are given seemingly
arbitrary numeric names; e.g., when the input file contained four lines of
data, the directories were named: 0, 273, 546 and 819.

LineRecordReader reads "line" as VALUE and the KEY is offset in the file.
Looks like your directories are getting named with KEY. But I don't see
any reason for that, because it is working fine with TextInputFormat
(both TextInFormat and NLineInputFormat use LineRecordReader.)

-Amareshwari

Any thoughts?

John

On Sun, Feb 1, 2009 at 11:00 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

Which version of hadoop are you using?

You can directly use -inputformat
org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. You
need not include it in your streaming jar.
-Amareshwari

S D wrote:

Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop--streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind
of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

You can use NLineInputFormat for this, which splits one line (N=1, by
default) as one split.
So, each map task processes one line.
See

http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari

S D wrote:

Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list
and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that
need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop
streaming
without a reduce function to process each file and save the results
(back
to
S3 native in my case). To handle to long processing time of each file
I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby
script
reading from STDIN:

STDIN.each_line do |line|
# Get file from contents of line
# Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have
up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being
allocated
one task each and the other worker sitting idly. If I split my input
file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will
take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether
a
new task is allocated per file or line. My understanding of Hadoop Java
is
that (unlike Hadoop streaming) when given a single input file, the file
will
be broken up into separate lines and the maximum number of map tasks
will
automagically be allocated to handle the lines of the file (assuming the
use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD

Re: Hadoop Streaming Semantics

2009-02-01 Thread Amareshwari Sriramadasu


Which version of hadoop are you using?

You can directly use -inputformat 
org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. 
You need not include it in your streaming jar.

-Amareshwari

S D wrote:

Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop--streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:

  

You can use NLineInputFormat for this, which splits one line (N=1, by
default) as one split.
So, each map task processes one line.
See
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari

S D wrote:



Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list
and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that
need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop
streaming
without a reduce function to process each file and save the results (back
to
S3 native in my case). To handle to long processing time of each file I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
reading from STDIN:

STDIN.each_line do |line|
  # Get file from contents of line
  # Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have
up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being allocated
one task each and the other worker sitting idly. If I split my input file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will
take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether a
new task is allocated per file or line. My understanding of Hadoop Java is
that (unlike Hadoop streaming) when given a single input file, the file
will
be broken up into separate lines and the maximum number of map tasks will
automagically be allocated to handle the lines of the file (assuming the
use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD

Re: [ANNOUNCE] Hadoop release 0.18.3 available

2009-01-30 Thread Amareshwari Sriramadasu


Anum Ali wrote:

Hi,


Need some kind of guidance related to started with Hadoop Installation and
system setup. Iam newbie regarding to Hadoop . Our system OS is Fedora 8,
should I start from a stable release of Hadoop or get it from svn developing
version (from contribute site).



Thank You



  

Download a stable release from http://hadoop.apache.org/core/releases.html
For installation and setup, You can see 
http://hadoop.apache.org/core/docs/current/quickstart.html and 
http://hadoop.apache.org/core/docs/current/cluster_setup.html


-Amareshwari

On Thu, Jan 29, 2009 at 7:38 PM, Nigel Daley  wrote:

  

Release 0.18.3 fixes many critical bugs in 0.18.2.

For Hadoop release details and downloads, visit:
http://hadoop.apache.org/core/releases.html

Hadoop 0.18.3 Release Notes are at
http://hadoop.apache.org/core/docs/r0.18.3/releasenotes.html

Thanks to all who contributed to this release!

Nigel

Re: Counters in Hadoop

2009-01-29 Thread Amareshwari Sriramadasu


Kris Jirapinyo wrote:

Hi all,
I am using counters in Hadoop via the reporter.  I can see this custom
counter fine after I run my job.  However, if somehow I restart the cluster,
then when I look into the Hadoop Job History, I can't seem to find the
information of my previous counter values anywhere.  Where is it stored (or
is it not)?  Also, I need to be able to write this counter value to either a
local file or even to a file in HDFS...is there a good way to do that?

  
Counters are available in JobHistory, but yes they are not shown on the 
web ui.
There is an open jira for the same: 
https://issues.apache.org/jira/browse/HADOOP-3200.

For now, you can open the history log file and see the counters.


Thanks
Amareshwari

Re: Hadoop Streaming Semantics

2009-01-29 Thread Amareshwari Sriramadasu

You can use NLineInputFormat for this, which splits one line (N=1, by
default) as one split.

So, each map task processes one line.
See
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari
S D wrote:

Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop streaming
without a reduce function to process each file and save the results (back to
S3 native in my case). To handle to long processing time of each file I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby script
reading from STDIN:

STDIN.each_line do |line|
# Get file from contents of line
# Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being allocated
one task each and the other worker sitting idly. If I split my input file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether a
new task is allocated per file or line. My understanding of Hadoop Java is
that (unlike Hadoop streaming) when given a single input file, the file will
be broken up into separate lines and the maximum number of map tasks will
automagically be allocated to handle the lines of the file (assuming the use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD

Re: Interrupting JobClient.runJob

2009-01-27 Thread Amareshwari Sriramadasu


Edwin wrote:

Hi

I am looking for a way to interrupt a thread that entered
JobClient.runJob(). The runJob() method keep polling the JobTracker until
the job is completed. After reading the source code, I know that the
InterruptException is caught in runJob(). Thus, I can't interrupt it using
Thread.interrupt() call. Is there anyway I can interrupt a polling thread
without terminating the job? If terminating the job is the only way to
escape, how can I terminate the current job?

Thank you very much.

Regards
Edwin

  

Yes. there is noway to stop the client from polling.
If you want to Stop client thread, use +c or kill the client 
process itself.


You can kill a job using the command:
bin/hadoop job -kill 

-Amareshwari

Re: Debugging in Hadoop

2009-01-26 Thread Amareshwari Sriramadasu


patektek wrote:

Hello list, I am trying to add some functionality to Hadoop-core and I am
having serious issues
debugging it. I have searched in the list archive and still have not been
able to resolve the issues.

Simple question:
If I want to insert "LOG.INFO()" statements in Hadoop code is not that as
simple as  modifying
log4j.properties file to include the class which has the statements. For
example, if I want to
print out the LOG.info("I am here!") statements in MapTask. class
I would add to the lo4j.properites file the following line:


  
LOG.info statements in MapTask will be shown in syslog in task logs.  
The directory is ${hadoop.log.dir}/userlogs/.

The same can be browsed on the web ui of the task.

-Amareshwari

# Custom Logging levels
.
.
.
log4j.logger.org.apache.hadoop.mapred.MapTask=INFO

This approach is clearly not working for me.
What am I missing?

Thank you,
patektek

Re: NLineInputFormat and very high number of maptasks

2009-01-20 Thread Amareshwari Sriramadasu


Saptarshi Guha wrote:
Sorry, i see - every line is now a maptask - one split,one task.(in 
this case N=1 line per split)

Is that correct?
Saptarshi

You are right. NLineInputFormat splits N lines of input as one split and 
each split is given to a map task.
By default, N is 1. N can configured through the property 
"mapred.line.input.format.linespermap".

On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote:


Hello,
When I use NLIneInputFormat, when I output:
System.out.println("mapred.map.tasks:"+jobConf.get("mapred.map.tasks")); 

Where are you printing this statement? Looks like the JobConf, that you 
are looking at, is not set with the correct value of number of map tasks 
yet.

I see 51, but on the jobtracker site, the number is 18114. Yet with
TextInputFormat it shows 51.
I'm using Hadoop - 0.19

Any ideas why?
Regards
Saptarshi

--Saptarshi Guha - saptarshi.g...@gmail.com


Saptarshi Guha | saptarshi.g...@gmail.com | 
http://www.stat.purdue.edu/~sguha

If the church put in half the time on covetousness that it does on lust,
this would be a better world.
-- Garrison Keillor, "Lake Wobegon Days"



-Amareshwari

Re: How to debug a MapReduce application

2009-01-18 Thread Amareshwari Sriramadasu

From the exception you pasted, it looks like your io.serializations did 
not set the SerializationFactory properly. Do you see any logs on your 
console for adding serialization class?
Can you try running your app on pseudo distributed mode, instead of 
LocalJobRunner ?
You can find pseudo distributed setup  at 
http://hadoop.apache.org/core/docs/r0.19.0/quickstart.html#PseudoDistributed


Thanks
Amareshwari

Pedro Vivancos wrote:

Dear friends,

I am new at Hadoop and at MapReduce techniques. I've developed my first
map-reduce application using hadoop but I can't manage to make it work. I
get the following error at the very beginning of the execution:

java.lang.NullPointerException
at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:504)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:295)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
16-ene-2009 18:29:30 es.vocali.intro.tools.memo.MemoAnnotationMerging main
GRAVE: Se ha producido un error
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at
es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:160)
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at
es.vocali.intro.tools.memo.MemoAnnotationMerging.main(MemoAnnotationMerging.java:160)

Sorry if I don't give you more information but I don't know where to start
to find the error. My app is quite simple. It just gets some rows from a
postgresql database and try to see which ones can be deleted.

Here you have the configuration I am using:

MemoAnnotationMerging memo = new MemoAnnotationMerging();

Map parametros = memo.checkParams(args);

memo.initDataStore(parametros.get(DATASTORE_URL));

JobConf conf = new JobConf(MemoAnnotationMerging.class);
conf.setJobName("memo - annotation merging");

conf.setMapperClass(MemoAnnotationMapper.class);
conf.setCombinerClass(MemoAnnotationReducer.class);
conf.setReducerClass(MemoAnnotationReducer.class);

DBConfiguration.configureDB(conf, DRIVER_CLASS,
parametros.get(DATASTORE_URL));

// ???
//conf.setInputFormat(DBInputFormat.class);
//conf.setOutputFormat(TextOutputFormat.class);

conf.setMapOutputKeyClass(LongWritable.class);
conf.setMapOutputValueClass(Annotation.class);


//conf.setOutputKeyClass(Annotation.class);
//conf.setOutputValueClass(BooleanWritable.class);

DBInputFormat.setInput(conf, MemoAnnotationDBWritable.class,
GET_ANNOTATIONS_QUERY, COUNT_ANNOTATIONS_QUERY);

FileOutputFormat.setOutputPath(conf, new Path("eliminar.txt"));

// ejecutamos el algoritmo map-reduce para mezclar anotaciones
try {
JobClient.runJob(conf);

} catch (IOException e) {
e.printStackTrace();
System.exit(-1);
}

Thanks in advance.

 Pedro Vivancos Vicente
Vócali Sistemas Inteligentes S.L. 
Edificio CEEIM, Campus de Espinardo
30100, Espinardo, Murcia, Spain
Tel. +34 902 929 644

Re: Calling a mapreduce job from inside another

2009-01-18 Thread Amareshwari Sriramadasu


You can use Job Control.
See
http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html#Job+Control
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/jobcontrol/Job.html
and
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html

Thanks
Amareshwari
Aditya Desai wrote:

Is it possible to call a mapreduce job from inside another, if yes how?
and is it possible to disable the reducer completely that is suspend the job
immediately after call to map has been terminated.
I have tried -reducer "NONE". I am using the streaming api to code in python

Regards,
Aditya Desai.

Re: streaming question.

2009-01-18 Thread Amareshwari Sriramadasu

You can also have a look at NLineInputFormat. 
@http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html


Thanks
Amareshwari
Abdul Qadeer wrote:

Dmitry,

If you are talking about Text data, then the splits can be anywhere.  But
LineRecordReader will take care of this thing and your mapper code will
get the correct whole line.

Abdul Qadeer

On Sun, Jan 18, 2009 at 9:59 AM, Dmitry Pushkarev  wrote:

  

Dear hadoop users.



When I use streaming on one large file, that is being split in many map
tasks, can I be sure that splits won't fall in the middle of the line?

(i.e. if split size needs to be larger than  64Mb to fit end of the line it
will be increased?



Thanks.

---

Dmitry Pushkarev

+1-650-644-8988

Re: hadoop job -history

2009-01-15 Thread Amareshwari Sriramadasu


 is the location specified by the configuration property 
"hadoop.job.history.user.location". If you don't specify anything for the property, 
the job history logs will be created in job's output directory. So, to view your history give 
your jobOutputDir, if you havent specified any location.
Hope this helps.

Thanks
Amareshwari


Bill Au wrote:

I am having trouble getting the hadoop command "job -hisotry" to work.  What
am I suppose to use for ?  I can see the job history from the
JobTracker web ui.  I tried specifing the history directory on the
JobTracker but it didn't work:

$ hadoop job -history logs/history/
Exception in thread "main" java.io.IOException: Not able to initialize
History viewer
at
org.apache.hadoop.mapred.HistoryViewer.(HistoryViewer.java:88)
at
org.apache.hadoop.mapred.JobClient.viewHistory(JobClient.java:1596)
at org.apache.hadoop.mapred.JobClient.run(JobClient.java:1560)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1727)
Caused by: java.io.IOException: History directory
logs/history/_logs/historydoes not exist
at
org.apache.hadoop.mapred.HistoryViewer.(HistoryViewer.java:70)
... 5 more

Re: Problem loading hadoop-site.xml - dumping parameters

2008-12-29 Thread Amareshwari Sriramadasu


Saptarshi Guha wrote:

Hello,
I had previously emailed regarding heap size issue and have discovered
that the hadoop-site.xml is not loading completely, i.e
 Configuration defaults = new Configuration();
JobConf jobConf = new JobConf(defaults, XYZ.class);
System.out.println("1:"+jobConf.get("mapred.child.java.opts"));
System.out.println("2:"+jobConf.get("mapred.map.tasks"));
System.out.println("3:"+jobConf.get("mapred.reduce.tasks"));

System.out.println("3:"+jobConf.get("mapred.tasktracker.reduce.tasks.maximum"));

returns -Xmx200m, 2,1,2 respectively, even though the numbers in the
hadoop-site.xml are very different.

Is there a way for hadoop to dump the parameters read in from
hadoop-site.xml and hadoop-default.xml?

  

Is your hadoop-site.xml present in the conf (HADOOP_CONF_DIR) directory?
http://hadoop.apache.org/core/docs/r0.19.0/cluster_setup.html#Configuration

-Amareshwari

Re: Does anyone have a working example for using MapFiles on the DistributedCache?

2008-12-28 Thread Amareshwari Sriramadasu

Sean Shanny wrote:

To all,

Version: hadoop-0.17.2.1-core.jar

I have created a MapFile.

What I don't seem to be able to do is correctly place the MapFile in
the DistributedCache and the make use of it in a map method.

I need the following info please:

1.How and where to place the MapFile directory so that it is
visible to the hadoop job.
You have to place your files in DFS. If it is directory you can place an
archive of it.

2.How to add the files to the DistributedCache.
You can use DistributedCache.addCacheFile or
DistributedCache.addCacheArchive.
See more documentation @
http://hadoop.apache.org/core/docs/r0.17.2/api/org/apache/hadoop/filecache/DistributedCache.html

and
http://hadoop.apache.org/core/docs/r0.17.2/mapred_tutorial.html#DistributedCache

3.How to create a MapFile.Reader from files in the DistributedCache.

I didn't understand what you want to do here. Do you want see the files
in directory MapFile? or do you want them in classpath etc?
You can use DistributedCache.addFileToClassPath or
DistributedCache.addArchiveToClassPath

Hope this helps.

Thanks
Amareshwari
I can get this to work with a local file on a single node system
outside of the DistributedCache but for the life of me cannot get it
to work within a DistributedCache.

We are trying to load up key value mappings for a Data Warehouse ETL
process. The mapper will take an input record, lookup the keys based
on values and emit the resulting key only record.

Happy to answer any questions to help me make this work.

Thanks.

--sean

Re: OutofMemory Error, inspite of large amounts provided

2008-12-28 Thread Amareshwari Sriramadasu


Saptarshi Guha wrote:

Caught it in action.
Running  ps -e -o 'vsz pid ruser args' |sort -nr|head -5
on a machine where the map task was running
04812 16962 sguha/home/godhuli/custom/jdk1.6.0_11/jre/bin/java
-Djava.library.path=/home/godhuli/custom/hadoop/bin/../lib/native/Linux-amd64-64:/home/godhuli/custom/hdfs/mapred/local/taskTracker/jobcache/job_200812282102_0003/attempt_200812282102_0003_m_00_0/work
-Xmx200m 
-Djava.io.tmpdir=/home/godhuli/custom/hdfs/mapred/local/taskTracker/jobcache/job_200812282102_0003/attempt_200812282102_0003_m_00_0/work/tmp
-classpath /attempt_200812282102_0003_m_00_0/work
-Dhadoop.log.dir=/home/godhuli/custom/hadoop/bin/../logs
-Dhadoop.root.logger=INFO,TLA
-Dhadoop.tasklog.taskid=attempt_200812282102_0003_m_00_0
-Dhadoop.tasklog.totalLogFileSize=0 org.apache.hadoop.mapred.Child
127.0.0.1 40443 attempt_200812282102_0003_m_00_0 1525207782

Also, the reducer only used 540mb. I notice -Xmx200m was passed, how
to change it?
Regards
Saptarshi

  

You can set the configuration property mapred.child.java.opts as -Xmx540m.

Thanks
Amareshwari

On Sun, Dec 28, 2008 at 10:19 PM, Saptarshi Guha
 wrote:
  

On Sun, Dec 28, 2008 at 4:33 PM, Brian Bockelman  wrote:


Hey Saptarshi,

Watch the running child process while using "ps", "top", or Ganglia
monitoring.  Does the map task actually use 16GB of memory, or is the memory
not getting set properly?

Brian
  

I haven't figured out how to run ganglia, however, also the children
quit before i can see their memory usage. The trackers all use
16GB.(from the ps command). However, i noticed some use 512MB
only(when i manged to catch them in time)

Regards

Re: Reduce not completing

2008-12-23 Thread Amareshwari Sriramadasu

b: Job complete: 
job_200812221742_0076


For the middle run, the job tracker long has the following exceptions 
reported, but for the first and last the log does not list any 
exceptions:


2008-12-23 19:04:27,816 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_200812221742_0075_m_00_0' has completed 
task_200812221742_0075_m_00 successfully.
2008-12-23 19:04:36,505 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from attempt_200812221742_0075_r_00_0: 
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 

at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 

at 
org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


2008-12-23 19:04:39,050 INFO org.apache.hadoop.mapred.JobTracker: 
Removed completed task 'attempt_200812221742_0075_r_00_0' from 
'tracker_hnode2.cor.mystrands.in:localhost/127.0.0.1:36968'
2008-12-23 19:04:39,214 INFO org.apache.hadoop.mapred.JobTracker: 
Adding task 'attempt_200812221742_0075_r_00_1' to tip 
task_200812221742_0075_r_00, for tracker 
'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:3'
2008-12-23 19:04:44,237 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from attempt_200812221742_0075_r_00_1: 
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 

at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 

at 
org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


2008-12-23 19:04:47,845 INFO org.apache.hadoop.mapred.JobTracker: 
Removed completed task 'attempt_200812221742_0075_r_00_1' from 
'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:3'
2008-12-23 19:04:47,856 INFO org.apache.hadoop.mapred.JobTracker: 
Adding task 'attempt_200812221742_0075_r_00_2' to tip 
task_200812221742_0075_r_00, for tracker 
'tracker_hnode1.cor.mystrands.in:localhost/127.0.0.1:37971'
2008-12-23 19:04:57,781 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from attempt_200812221742_0075_r_00_2: 
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 

at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 

at 
org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


2008-12-23 19:04:57,781 INFO org.apache.hadoop.mapred.JobTracker: 
Removed completed task 'attempt_200812221742_0075_r_00_2' from 
'tracker_hnode1.cor.mystrands.in:localhost/127.0.0.1:37971'


Thanks,
RDH

On Dec 23, 2008, at 1:00 AM, Amareshwari Sriramadasu wrote:

You can report status from streaming job by emitting 
reporter:status:  in stderr.
See documentation @ 
http://hadoop.apache.org/core/docs/r0.18.2/streaming.html#How+do+I+update+status+in+streaming+applications%3F 



But from the exception trace, it doesn't look like lack of 
report(timeout). The trace tells that reducer jvm process exited with 
exit-code 1.
It's mostly a bug in reducer code. What is the configuration value of 
the property "stream.non.zero.exit.status.is.failure" ?


Thanks
Amareshwari
Rick Hangartner wrote:

Hi,

We seem to be seeing an runtime exception in the Reduce phase of a 
streaming Map-Reduce that has been mentioned before on this list.


http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/%3c482bf75c.9030...@iponweb.net%3e 



When I Google the exception, the only thing returned is to this one 
short thread on the mailing list.  Unfortunately, we don't quite 
understand the exception message in our current situation or the 
eventual explanation and resolution of that previous case.


We have tested that the Python script run in the Reduce phase runs 
without problems.  It returns the correct results when run from the 
command line fed from stdin by a file that is the output of the map 
phase for a small map-reduce job that fails in this way.


Here's the exception we are seeing from the jobtracker log:

2008-12-22 18:13:36,415 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_200812221742_0004_m_09_0' has completed

Re: Reduce not completing

2008-12-23 Thread Amareshwari Sriramadasu

You can report status from streaming job by emitting 
reporter:status:  in stderr.
See documentation @ 
http://hadoop.apache.org/core/docs/r0.18.2/streaming.html#How+do+I+update+status+in+streaming+applications%3F


But from the exception trace, it doesn't look like lack of 
report(timeout). The trace tells that reducer jvm process exited with 
exit-code 1.
It's mostly a bug in reducer code. What is the configuration value of 
the property "stream.non.zero.exit.status.is.failure" ?


Thanks
Amareshwari
Rick Hangartner wrote:

Hi,

We seem to be seeing an runtime exception in the Reduce phase of a 
streaming Map-Reduce that has been mentioned before on this list.


http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/%3c482bf75c.9030...@iponweb.net%3e 



When I Google the exception, the only thing returned is to this one 
short thread on the mailing list.  Unfortunately, we don't quite 
understand the exception message in our current situation or the 
eventual explanation and resolution of that previous case.


We have tested that the Python script run in the Reduce phase runs 
without problems.  It returns the correct results when run from the 
command line fed from stdin by a file that is the output of the map 
phase for a small map-reduce job that fails in this way.


Here's the exception we are seeing from the jobtracker log:

2008-12-22 18:13:36,415 INFO org.apache.hadoop.mapred.JobInProgress: 
Task 'attempt_200812221742_0004_m_09_0' has completed 
task_200812221742_0004_m_09 successfully.
2008-12-22 18:13:50,607 INFO org.apache.hadoop.mapred.TaskInProgress: 
Error from attempt_200812221742_0004_r_00_0: 
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess 
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:301) 

at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:518) 

at 
org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:102)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


2008-12-22 18:13:52,045 INFO org.apache.hadoop.mapred.JobTracker: 
Removed completed task 'attempt_200812221742_0004_r_00_0' from 
'tracker_hnode3.cor.mystrands.in:localhost/127.0.0.1:3'
2008-12-22 18:13:52,175 INFO org.apache.hadoop.mapred.JobTracker: 
Adding task 'attempt_200812221742_0004_r_00_1' to tip 
task_200812221742_0004_r_00, for tracker 
'tracker_hnode5.cor.mystrands.in:localhost/127.0.0.1:55254'


We typically see 4 repetitions of this exception in the log. And we 
may see 1-2 sets of those repetitions.


If someone could explain what this exception actually means, and 
perhaps what we might need to change in our configuration to fix it, 
we would be most appreciative.   Naively, it almost seems if a task is 
just taking slightly too long to complete and report that fact, 
perhaps because of other Hadoop or MR processes going on at the same 
time.  If we re-run this map-reduce, it does sometimes run to a 
successful completion without an exception.


We are just testing map-reduce as a candidate for a number of data 
reduction tasks right now.  We are running Hadoop 18.1 on a cluster of 
9 retired desktop machines that just have 100Mb networking and about 
2GB of RAM each, so that's why we are suspecting this could just be a 
problem that tasks are taking slightly too long to report back they 
have completed, rather than an actual bug.   (We will be upgrading 
this test  cluster to Hadoop 19.x and 1Gb networking very shortly.)


Thanks,
RDH

Begin forwarded message:


From: "Rick Cox" 
Date: May 14, 2008 9:01:31 AM PDT
To: core-user@hadoop.apache.org, apan...@iponweb.net
Subject: Re: Streaming and subprocess error code
Reply-To: core-user@hadoop.apache.org

Does the syslog output from a should-have-failed task contain
something like this?

   java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1

(In particular, I'm curious if it mentions the RuntimeException.)

Tasks that consume all their input and then exit non-zero are
definitely supposed to be counted as failed, so there's either a
problem with the setup or a bug somewhere.

rick

On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov  
wrote:

Hi,

I've tested this new option "-jobconf
stream.non.zero.exit.status.is.failure=true". Seems working but 
still not

good for me. When mapper/reducer program have read all input data
successfully and fails after that, streaming still finishes 
successfully so

there are no chances to know about some data post-processing errors in
subprocesses :(



Andrey Pankov wrote:

Re: Failed to start TaskTracker server

2008-12-22 Thread Amareshwari Sriramadasu

You can set the configuration property 
"mapred.task.tracker.http.address" to 0.0.0.0:0 . If the port is given 
as 0, then the server will start on a free port.


Thanks
Amareshwari

Sagar Naik wrote:


- check hadoop-default.xml
in here u will find all the ports used. Copy the xml-nodes from 
hadoop-default.xml to hadoop-site.xml. Change the port values in 
hadoop-site.xml

and deploy it on datanodes .


Rico wrote:
Well the machines are all servers that probably running many services 
but I have no permission to change or modify other users' programs or 
settings. Is there any way to change 50060 to other port?


Sagar Naik wrote:
Well u have some process which grabs this port and Hadoop is not 
able to bind the port
By the time u check, there is a chance that socket connection has 
died but was occupied when hadoop processes was attempting


Check all the processes running on the system
Do any of the processes acquire ports ?

-Sagar
ascend1 wrote:
I have made a Hadoop platform on 15 machines recently. NameNode - 
DataNodes work properly but when I use bin/start-mapred.sh to start 
MapReduce framework only 3 or 4 TaskTracker could be started 
properly. All those couldn't be started have the same error.

Here's the log:

2008-12-19 16:16:31,951 INFO org.apache.hadoop.mapred.TaskTracker: 
STARTUP_MSG: 
/

STARTUP_MSG: Starting TaskTracker
STARTUP_MSG: host = msra-5lcd05/172.23.213.80
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.19.0
STARTUP_MSG: build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 
-r 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008

/
2008-12-19 16:16:33,248 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4
2008-12-19 16:16:33,248 INFO org.mortbay.util.Credential: Checking 
Resource aliases
2008-12-19 16:16:33,608 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@e51b2c
2008-12-19 16:16:33,655 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/static,/static]
2008-12-19 16:16:33,811 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@edf389
2008-12-19 16:16:33,936 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/logs,/logs]
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
org.mortbay.jetty.servlet.webapplicationhand...@17b0998
2008-12-19 16:16:34,092 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]
2008-12-19 16:16:34,155 WARN org.mortbay.util.ThreadedServer: 
Failed to start: socketlisten...@0.0.0.0:50060
2008-12-19 16:16:34,155 ERROR org.apache.hadoop.mapred.TaskTracker: 
Can not start task tracker because java.net.BindException: Address 
already in use: JVM_Bind

at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:359)
at java.net.ServerSocket.bind(ServerSocket.java:319)
at java.net.ServerSocket.(ServerSocket.java:185)
at 
org.mortbay.util.ThreadedServer.newServerSocket(ThreadedServer.java:391) 


at org.mortbay.util.ThreadedServer.open(ThreadedServer.java:477)
at org.mortbay.util.ThreadedServer.start(ThreadedServer.java:503)
at org.mortbay.http.SocketListener.start(SocketListener.java:203)
at org.mortbay.http.HttpServer.doStart(HttpServer.java:761)
at org.mortbay.util.Container.start(Container.java:72)
at org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:894)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
2008-12-19 16:16:34,155 INFO org.apache.hadoop.mapred.TaskTracker: 
SHUTDOWN_MSG: 
/

SHUTDOWN_MSG: Shutting down TaskTracker at msra-5lcd05/172.23.213.80
/

Then I use "netstat -an", but port 50060 isn't in the list. "ps 
-af" also show that no program using 50060. The strange point is 
that when I repeat "bin/start-mapred.sh" and "bin/stop-mapred.sh" 
several times, the machines list that could start TaskTracker seems 
randomly.


Could anybody help me solve this problem?

Re: Reducing Hadoop Logs

2008-12-09 Thread Amareshwari Sriramadasu


Arv Mistry wrote:
 
I'm using hadoop 0.17.0. Unfortunately I cant upgrade to 0.19.0 just

yet.

I'm trying to control the amount of extraneous files. I noticed there
are the following log files produced by hadoop;

On Slave
- userlogs (for each map/reduce job)
- stderr
- stdout
- syslog
- datanode .log file
- datanode .out file
- tasktracker .log file
- tasktracker .out file

On Master
- jobtracker .log file
- jobtracker .out file
- namenode   .log file
- namenode   .out file
- secondarynamenode .log file
- secondarynamenode .out file   
- job .xml file
- history
- xml file for job


Does any body know of how to configure hadoop so I don't have to delete
these files manually? Or just so that they don't get created at all.

For the history files, I set hadoop.job.history.user.location to none in
the hadoop-site.xml file but I still get the history files created.
  
Setting hadoop.job.history.user.location to "none", makes only history 
location specified for user. JT still has history location. History will 
be cleanup after a month.


Userlogs will be cleaned up after "mapred.userlog.retain.hours", by 
default , 24hrs.


Thanks
Amareshwari

Also I set in the log4j.properties the hadoop.root.logger=WARN but I
still see INFO messages in datanode,jobtracker etc logs

Thanks, in advance

Cheers Arv

Re: Optimized way

2008-12-04 Thread Amareshwari Sriramadasu


Hi Aayush,
Do you want one map to run one command? You can give input file 
consisting of lines of  . Use NLineInputFormat which 
splits N lines of input as one split. i.e gives N lines to one map for 
processing. By default, N is one. Then your map can just run the shell 
command on input line. Will this optimize your need?

More details @
http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
Thanks,
Amareshwari
Aayush Garg wrote:

Hi,

I am having a 5 node cluster for hadoop usage. All nodes are multi-core.
I am running a shell command in Map function of my program and this shell
command takes one file as an input. Many of such files are copied in the
HDFS.

So in summary map function will run a command like ./run 


Could you please suggest the optimized way to do this..like if I can use
multi core processing of nodes and many of such maps in parallel.

Thanks,
Aayush

Re: Error with Sequence File in hadoop-18

2008-11-27 Thread Amareshwari Sriramadasu

The issue is to remove the log message. If you are ok with the log 
message, you can continue using 18.2. If not, you can apply the patch 
available on jira and rebuild, since 18.3 is not yet released.

-Amareshwari

Palleti, Pallavi wrote:

Hi Amareshwari,
 
 Thanks for the reply. We recently upgraded hadoop cluster to

hadoop-18.2. Can you please suggest a simple way of avoiding this issue.
I mean, do we need to do the full upgrade to hadoop-0.18.3 or is there a
simple way of taking the patch and adding it to the existing code
repository and rebuild? 


Thanks
Pallavi

-Original Message-
From: Amareshwari Sriramadasu [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 28, 2008 10:56 AM

To: core-user@hadoop.apache.org
Subject: Re: Error with Sequence File in hadoop-18

It got fixed in 0.18.3 (HADOOP-4499).

-Amareshwari
Palleti, Pallavi wrote:
  

Hi,

 I am getting "Check sum ok was sent" errors when I am using hadoop.


Can
  

someone please let me know why this error is coming and how to avoid


it.
  

It was running perfectly fine when I used hadoop-17. And, this error


is
  

coming when I upgraded the system to hadoop-18.2.

 


The full stack trace is:

08/11/27 13:02:58 INFO fs.FSInputChecker: java.io.IOException:


Checksum
  

ok was sent and should not be sent again

at
org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863)

at



org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java
  

:1392)

at



org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428)
  

at



org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377)
  

at java.io.DataInputStream.readByte(DataInputStream.java:248)

at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:324)

at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:345)

at



org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:16
  

48)

at



org.apache.hadoop.io.SequenceFile$Reader.readBlock(SequenceFile.java:168
  

8)

at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1850)

at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)

at
org.apache.hadoop.io.MapFile$Reader.readIndex(MapFile.java:318)

at
org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:434)

at
org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:416)

at org.apache.hadoop.io.MapFile$Reader.seek(MapFile.java:403)

at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:522)

 


Thanks

Pallavi

Re: Error with Sequence File in hadoop-18

2008-11-27 Thread Amareshwari Sriramadasu


It got fixed in 0.18.3 (HADOOP-4499).

-Amareshwari
Palleti, Pallavi wrote:

Hi,

 I am getting "Check sum ok was sent" errors when I am using hadoop. Can
someone please let me know why this error is coming and how to avoid it.

It was running perfectly fine when I used hadoop-17. And, this error is
coming when I upgraded the system to hadoop-18.2.

 


The full stack trace is:

08/11/27 13:02:58 INFO fs.FSInputChecker: java.io.IOException: Checksum
ok was sent and should not be sent again

at
org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:863)

at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java
:1392)

at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1428)

at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1377)

at java.io.DataInputStream.readByte(DataInputStream.java:248)

at
org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:324)

at
org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:345)

at
org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:16
48)

at
org.apache.hadoop.io.SequenceFile$Reader.readBlock(SequenceFile.java:168
8)

at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1850)

at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)

at
org.apache.hadoop.io.MapFile$Reader.readIndex(MapFile.java:318)

at
org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:434)

at
org.apache.hadoop.io.MapFile$Reader.seekInternal(MapFile.java:416)

at org.apache.hadoop.io.MapFile$Reader.seek(MapFile.java:403)

at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:522)

 


Thanks

Pallavi

Re: how can I decommission nodes on-the-fly?

2008-11-25 Thread Amareshwari Sriramadasu


Jeremy Chow wrote:

Hi list,

 I added a property dfs.hosts.exclude to my conf/hadoop-site.xml. Then
refreshed my cluster with command
 bin/hadoop dfsadmin -refreshNodes
It showed that it can only shut down the DataNode process but not included
the TaskTracker process on each slaver specified in the excludes file.
  

Presently, decommissioning TaskTracker on-the-fly is not available.

The jobtracker web still show that I hadnot shut down these nodes.
How can i totally decommission these slaver nodes on-the-fly? Is it can be
achieved only by operation on the master node?

  

I think one way to shutdown a TaskTracker is to kill it.

Thanks
Amareshwari

Thanks,
Jeremy

Re: Newbie: error=24, Too many open files

2008-11-23 Thread Amareshwari Sriramadasu


tim robertson wrote:

Hi all,

I am running MR which is scanning 130M records and then trying to
group them into around 64,000 files.

The Map does the grouping of the record by determining the key, and
then I use a MultipleTextOutputFormat to write the file based on the
key:
@Override
protected String generateFileNameForKeyValue(WritableComparable
key,Writable value, String name) {
return "cell_" + key.toString();
}

This approach works for small input files, but for the 130M it fails with:

org.apache.hadoop.mapred.Merger$MergeQueue Down to the last
merge-pass, with 10 segments left of total size: 12291866391 bytes
org.apache.hadoop.mapred.LocalJobRunner$Job reduce > reduce
org.apache.hadoop.mapred.JobClient  map 100% reduce 66%
org.apache.hadoop.mapred.LocalJobRunner$Job reduce > reduce
...
org.apache.hadoop.mapred.LocalJobRunner$Job reduce > reduce
org.apache.hadoop.mapred.LocalJobRunner$Job job_local_0001
java.io.IOException: Cannot run program "chmod": error=24, Too many open files
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:317)
at 
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:532)
at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:284)
at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:403)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:117)
at 
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44)
at 
org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:300)
at 
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201)
Caused by: java.io.IOException: error=24, Too many open files
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:53)
at java.lang.ProcessImpl.start(ProcessImpl.java:91)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 17 more
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
at 
com.ibiodiversity.index.mapreduce.occurrence.geometry.OccurrenceByPolygonIntersection.splitOccurrenceDataIntoCells(OccurrenceByPolygonIntersection.java:95)
at 
com.ibiodiversity.index.mapreduce.occurrence.geometry.OccurrenceByPolygonIntersection.run(OccurrenceByPolygonIntersection.java:54)
at 
com.ibiodiversity.index.mapreduce.occurrence.geometry.OccurrenceByPolygonIntersection.main(OccurrenceByPolygonIntersection.java:190)


Is this a problem because I am working on my single machine at the
moment, that will go away when I run on the cluster of 25?

  
Yes. The problem could be because of single machine and LocalJobRuner. I 
think this should go away on a cluster.

-Amareshwari

I am configuring the job:
  conf.setNumMapTasks(10);
  conf.setNumReduceTasks(5);

Are there perhaps better parameters so it does not try to manage the
temp files all in one go?

Thanks for helping!

Tim

Re: NLine Input Format

2008-11-19 Thread Amareshwari Sriramadasu


Rahul Tenany wrote:

Hi Amareshwari,
It is in the ToolRunner.run() method that i am setting the 
FileInputFormat as NLineInputFormat and in the same function i am 
setting the mapred.line.input.format.linespermap property. Will that 
not work? How can i overload LineRecordReader, so that it returns the 
value as N Lines?


Thanks
Rahul

On Mon, Nov 17, 2008 at 9:43 AM, Amareshwari Sriramadasu 
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:


Hi Rahul,

How did you set the configuration
"mapred.line.input.format.linespermap" and your input format? You
have to set them in hadoop-site.xml or pass them through -D option
to the job.
NLineInputFormat will split N lines of input as one split. So,
each map gets N lines.
But the RecordReader is still LineRecordReader, which reads one
line at time, thereby Key is the offset in the file and Value is
the line.
If you want N lines as Key, you may to override LineRecordReader.

Thanks
Amareshwari


Rahul Tenany wrote:

Hi,   I am writing a Binary Search Tree on Hadoop and for the
same i require
to use NLineInputFormat. I'll read n lines at a time, convert
the numbers in
each line from string to int and then insert them into the
binary tree. Once
the binary tree is made i'll search for elements in it. But
even if i set
that input format as NLineInputFormat and set the
mapred.line.input.format.linespermap
to 10, i am able to read only 1 line at the time. Any idea
where am i going
wrong? How can i find whether NLineInputFormat is working or not?

I want my program to work for any object that is comparable
and not just
integers, so in there any way i can read NObjects at a time?

I am completely stuck. Any help will be appreciated.

Thanks
Rahul

 




One more thing, I don't think you need to use NLineInputFormat for your 
requirement. NLineInputFormat splits N lines as one split, thus each map 
processes N lines. In your application, you don't want each map to 
process just N lines, but you want value as N lines, right? So, you 
should right a new input format extending FileInputFormat and 
getRecordReader should return your new RecordReader implementation. Does 
this make sense?


Thanks
Amareshwari

Re: NLine Input Format

2008-11-19 Thread Amareshwari Sriramadasu


Rahul Tenany wrote:

Hi Amareshwari,
It is in the ToolRunner.run() method that i am setting the 
FileInputFormat as NLineInputFormat and in the same function i am 
setting the mapred.line.input.format.linespermap property. Will that 
not work? How can i overload LineRecordReader, so that it returns the 
value as N Lines?


Setting Configuration in run() method will also work. You have to extend 
LineRecordReader and override method next() to return N lines as value 
instead of 1 line.


Thanks
Amareshwari


Thanks
Rahul

On Mon, Nov 17, 2008 at 9:43 AM, Amareshwari Sriramadasu 
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:


Hi Rahul,

How did you set the configuration
"mapred.line.input.format.linespermap" and your input format? You
have to set them in hadoop-site.xml or pass them through -D option
to the job.
NLineInputFormat will split N lines of input as one split. So,
each map gets N lines.
But the RecordReader is still LineRecordReader, which reads one
line at time, thereby Key is the offset in the file and Value is
the line.
If you want N lines as Key, you may to override LineRecordReader.

Thanks
Amareshwari


Rahul Tenany wrote:

Hi,   I am writing a Binary Search Tree on Hadoop and for the
same i require
to use NLineInputFormat. I'll read n lines at a time, convert
the numbers in
each line from string to int and then insert them into the
binary tree. Once
the binary tree is made i'll search for elements in it. But
even if i set
that input format as NLineInputFormat and set the
mapred.line.input.format.linespermap
to 10, i am able to read only 1 line at the time. Any idea
where am i going
wrong? How can i find whether NLineInputFormat is working or not?

I want my program to work for any object that is comparable
and not just
integers, so in there any way i can read NObjects at a time?

I am completely stuck. Any help will be appreciated.

Thanks
Rahul

Re: NLine Input Format

2008-11-16 Thread Amareshwari Sriramadasu


Hi Rahul,

How did you set the configuration "mapred.line.input.format.linespermap" 
and your input format? You have to set them in hadoop-site.xml or pass 
them through -D option to the job.
NLineInputFormat will split N lines of input as one split. So, each map 
gets N lines.
But the RecordReader is still LineRecordReader, which reads one line at 
time, thereby Key is the offset in the file and Value is the line.

If you want N lines as Key, you may to override LineRecordReader.

Thanks
Amareshwari

Rahul Tenany wrote:

Hi,   I am writing a Binary Search Tree on Hadoop and for the same i require
to use NLineInputFormat. I'll read n lines at a time, convert the numbers in
each line from string to int and then insert them into the binary tree. Once
the binary tree is made i'll search for elements in it. But even if i set
that input format as NLineInputFormat and set the
mapred.line.input.format.linespermap
to 10, i am able to read only 1 line at the time. Any idea where am i going
wrong? How can i find whether NLineInputFormat is working or not?

I want my program to work for any object that is comparable and not just
integers, so in there any way i can read NObjects at a time?

I am completely stuck. Any help will be appreciated.

Thanks
Rahul

Re: distributed cache

2008-11-11 Thread Amareshwari Sriramadasu


Jeremy Pinkham wrote:

We are using the distributed cache in one of our jobs and have noticed
that the local copies on all of the task nodes never seem to get cleaned
up.  Is there a mechanism in the API to tell the framework that those
copies are no longer needed so they can be deleted.  I've tried using
releaseCache and deleting the source file from hdfs... but it still
remains in the local directories on each node
(/temp/hadoop-hadoop/mapred/local/taskTracker/archive

  
These files are shared by all the jobs. And DistributedCache does a lazy 
deletion. So, releaseCache doesn't delete the files, it just decrements 
the reference count.
The cache will be cleanedup when there is no space left, which can be 
specified by configuration, "local.cache.size". i.e If the total size of 
archive directory is more than allowed size (local.cache.size, default 
value is 1MB), files with zero-reference count will get cleaned up. You 
can specify a lower value for "local.cache.size" if needed.


Thanks
Amareshwari

Am I doing something wrong? not seeing the right API method to do the
cleaning? or is it intentional that these files must be removed
manually?

Thanks

jeremy


The information transmitted in this email is intended only for the person(s) or 
entity to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this email in error, 
please contact the sender and permanently delete the email from any computer.

Re: reading input for a map function from 2 different files?

2008-11-09 Thread Amareshwari Sriramadasu


some speed wrote:

I was wondering if it was possible to read the input for a map function from
2 different files:

1st file ---> user-input file from a particular location(path)
2nd file=---> A resultant file (has just one  pair) from a
previous MapReduce job. (I am implementing a chain MapReduce function)

Now, for every  pair in the user-input file, I would like to use
the same  pair from the 2nd file for some calculations.

  
I think you can use DistributedCache for distributing your second file 
among maps.
Please see more documentation at 
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache


Thanks
Amareshwari

Is it possible for me to do so? Can someone guide me in the right direction
please?


Thanks!

Re: map/reduce driver as daemon

2008-11-05 Thread Amareshwari Sriramadasu


shahab mehmandoust wrote:

I'm try to write a daemon that periodically wakes up and runs map/reduce
jobs, but I've have little luck.  I've tried different ways (including using
cascading) and I keep arriving at the below exception:

java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:359)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:185)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at
com.txing.mapred.watcher.DirWatcherImpl2.runMapReduce(DirWatcherImpl2.java:29)
at com.txing.mapred.watcher.DirWatcher.run(DirWatcher.java:52)
at java.lang.Thread.run(Thread.java:637)

I've set this propertyl: mapred.child.java.opts  larger and larger makes no
difference.

  
Setting this property doesn't make sense for LocalJobRunner, because 
LocalJobRunner doesn't spawn jvm for the child process. The whole job 
will be run by a single thread. Mostly LocalJobRunner is used for 
debugging/testing.


Thanks
Amareshwari

Furthermore, I get working like this:

WARN | No job jar file set.  User classes may not be found. See
JobConf(Class) or JobConf#setJar(String). | JobClient.java:637 |
org.apache.hadoop.mapred.JobClient | Thread-0 |
WARN | job_local_1 | LocalJobRunner.java:234 |
org.apache.hadoop.mapred.LocalJobRunner | Thread-15 |

Do I have to submit jar files to hadoop?  Can't I daemonize this?

Thanks,
Shahab

Re: _temporary directories not deleted

2008-11-04 Thread Amareshwari Sriramadasu



Nathan Marz wrote:

Hello all,

Occasionally when running jobs, Hadoop fails to clean up the 
"_temporary" directories it has left behind. This only appears to 
happen when a task is killed (aka a speculative execution), and the 
data that task has outputted so far is not cleaned up. Is this a known 
issue in hadoop? 
Yes. It is possible that _temporary gets created by a speculative, after 
the cleanup in some corner cases.
Is the data from that task guaranteed to be duplicate data of what was 
outputted by another task? Is it safe to just delete this directory 
without worrying about losing data?


Yes. You are right. It is duplicate data created by the speculative 
task. You can go ahead and delete it.

-Amareshwari

Thanks,
Nathan Marz
Rapleaf

Re: Debugging / Logging in Hadoop?

2008-10-30 Thread Amareshwari Sriramadasu


Some more links:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Other+Useful+Features
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Debugging

-Amareshwari

Arun C Murthy wrote:


On Oct 30, 2008, at 1:16 PM, Scott Whitecross wrote:

Is the presentation online as well?  (Hard to see some of the slides 
in the video)




http://wiki.apache.org/hadoop/HadoopPresentations

Arun


On Oct 30, 2008, at 1:34 PM, Alex Loddengaard wrote:


Arun gave a great talk about debugging and tuning at the Rapleaf event.
Take a look:


Alex

On Thu, Oct 30, 2008 at 6:20 AM, Malcolm Matalka <
[EMAIL PROTECTED]> wrote:

I'm not sure of the correct way, but when I need to log a job I 
have it

print out with some unique identifier and then just do:

for i in list of each box; do ssh $i 'grep -R PREFIX path/to/logs'; 
done

results


It works well in a pinch

-Original Message-
From: Scott Whitecross [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 29, 2008 22:14
To: core-user@hadoop.apache.org
Subject: Debugging / Logging in Hadoop?

I'm curious to what the best method for debugging and logging in
Hadoop?  I put together a small cluster today and a simple application
to process log files.  While it worked well, I had trouble trying to
get logging information out.  Is there any way to attach a debugger,
or get log4j to write a log file?  I tried setting up a Logger in the
class I used for the map/reduce, but I had no luck.

Thanks.

Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-29 Thread Amareshwari Sriramadasu


Zhengguo 'Mike' SUN wrote:

Hi, Peeyush,

I guess I didn't make myself clear. I am trying to run a Hadoop pipes job with 
a combination of Java classes and C++ classes. So the command I am using is 
like:

hadoop pipes -conf myconf.xml -inputformat MyInputFormat.class -input in 
-output out

And it threw ClassNotFoundException for my InputSplit class.
As I understand "hadoop jar" is used to run a jar file, which is not my case. And there 
is a -jar option in "hadoop pipes". But, unfortunately, it is not working for me. So the 
question I want to ask is how to include customized Java classes, such as MyInputSplit, in a pipes 
job?

  
You are right. -jar option also doesn't add the jar file to classpath on 
the client-side. You can use -libjars option with 0.19. Then, the 
command looks like


hadoop pipes -conf myconf.xml -libjars  -inputformat 
MyInputFormat.class -input in -output out

I don't see  a way to do this in 0.17.*, one way could be you add it 
explicitly to the classpath on client-side, and add it through the 
option -jar for the job.

Thanks,
Amareshwari

Thanks,
Mike





From: Peeyush Bishnoi <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Sent: Wednesday, October 29, 2008 12:52:18 PM
Subject: RE: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hello Zhengguo ,

Yes , -libjars is the new feature in Hadoop. This feature has been available from Hadoop-0.17.x , but it is more stable from hadoop 0.18.x 


example to use -libjars...

hadoop jar -libjars  ...


Thanks ,

---
Peeyush


-Original Message-
From: Zhengguo 'Mike' SUN [mailto:[EMAIL PROTECTED]
Sent: Wed 10/29/2008 9:22 AM
To: core-user@hadoop.apache.org
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi, Amareshwari,

Is -libjars a new option in Hadoop 0.19? I am using 0.17.2. The only option I 
see is -jar, which didn't work for me. And besides passing them as jar file, is 
there any other ways to do that?

Thanks
Mike



From: Amareshwari Sriramadasu <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, October 28, 2008 11:58:33 PM
Subject: Re: How do I include customized InputFormat, InputSplit and 
RecordReader in a C++ pipes job?

Hi,

How are you passing your classes to the pipes job? If you are passing 
them as a jar file, you can use -libjars option. From branch 0.19, the 
libjar files are added to the client classpath also.


Thanks
Amareshwari
Zhengguo 'Mike' SUN wrote:
  

Hi,

I implemented customized classes for InputFormat, InputSplit and RecordReader 
in Java and was trying to use them in a C++ pipes job. The customized 
InputFormat class could be included using the -inputformat option, but it threw 
ClassNotFoundException for my customized InputSplit class. It seemed the 
classpath has not been correctly set. Is there any way that let me include my 
customized classes in a pipes job?

Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-28 Thread Amareshwari Sriramadasu


Hi,

How are you passing your classes to the pipes job? If you are passing 
them as a jar file, you can use -libjars option. From branch 0.19, the 
libjar files are added to the client classpath also.


Thanks
Amareshwari
Zhengguo 'Mike' SUN wrote:

Hi,

I implemented customized classes for InputFormat, InputSplit and RecordReader 
in Java and was trying to use them in a C++ pipes job. The customized 
InputFormat class could be included using the -inputformat option, but it threw 
ClassNotFoundException for my customized InputSplit class. It seemed the 
classpath has not been correctly set. Is there any way that let me include my 
customized classes in a pipes job?

Re: Problems running the Hadoop Quickstart

2008-10-20 Thread Amareshwari Sriramadasu

Has your task-tracker started? I mean, do you see non-zero nodes on your 
job tracker UI?


-Amareshwari
John Babilon wrote:

Hello,

I've been trying to get Hadoop up and running on a Windows Desktop running 
Windows XP.  I've installed Cygwin and Hadoop.  I run the start-all.sh script, 
it starts the namenode, but does not seem to start the datanode.  I found that 
if I run hadoop datanode then, the datanode starts.  When I run the bin/hadoop 
jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' it seems to start 
doing something, I get status showing that 0% has completed, but the job 
tracker does show a job scheduled and waiting.  Any ideas as to where I should 
start looking to determine what might be wrong?  Thanks.

John B.

Re: Add jar file via -libjars - giving errors

2008-10-06 Thread Amareshwari Sriramadasu


Hi,

From 0.19, the jars added using -libjars are available on the client 
classpath also, fixed by HADOOP-3570.


Thanks
Amareshwari

Mahadev Konar wrote:

HI Tarandeep,
 the libjars options does not add the jar on the client side. Their is an
open jira for that ( id ont remember which one)...

Oyu have to add the jar to the

HADOOP_CLASSPATH on the client side so that it gets picked up on the client
side as well.


mahadev


On 10/6/08 2:30 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote:

  

Hi,

I want to add a jar file (that is required by mappers and reducers) to the
classpath. Initially I had copied the jar file to all the slave nodes in the
$HADOOP_HOME/lib directory and it was working fine.

However when I tried the libjars option to add jar files -

$HADOOP_HOME/bin/hadoop  jar myApp.jar -conf $MY_CONF_FILE -libjars jdom.jar


I got this error-

java.lang.NoClassDefFoundError: org/jdom/input/SAXBuilder

Can someone please tell me what needs to be fixed here ?

Thanks,
Taran

Re: Using different file systems for Map Reduce job input and output

2008-10-06 Thread Amareshwari Sriramadasu


Hi Naama,

Yes. It is possible to specify using the apis

FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath(). 


You can specify the FileSystem uri for the path.

Thanks,
Amareshwari
Naama Kraus wrote:

Hi,

I wanted to know if it is possible to use different file systems for Map
Reduce job input and output.
I.e. have a M/R job input reside on one file system and the M/R output be
written to another file system (e.g. input on HDFS, output on KFS. Input on
HDFS output on local file system, or anything else ...).

Is it possible to somehow specify that through
FileInputFormat#setInputPaths(), FileOutputFormat#setOutputPath() ?
Or by any other mechanism ?

Thanks, Naama

Re: streaming silently failing when executing binaries with unresolved dependencies

2008-10-02 Thread Amareshwari Sriramadasu

This is because the non-zero exit status of streaming process was not 
treated as failure until 0.17. In 0.17, you can specify the 
configuration property "stream.non.zero.exit.is.failure" as "true", to 
consider the non-zero exit as failure. From 0.18, the default value 
for/  stream.non.zero.exit.is.failure' is true.


Thanks
Amareshwari
/Chris Dyer wrote:

Hi all-
I am using streaming with some c++ mappers and reducers.  One of the
binaries I attempted to run this evening had a dependency on a shared
library that did not exist on my cluster, so it failed during
execution.  However, the streaming framework didn't appear to
recognize this failure, and the job tracker indicated that the mapper
returned success, but did not produce any results.  Has anyone else
encountered this issue?  Should I open a JIRA issue about this?  I'm
using Hadoop-17.2
Thanks-
Chris

Re: LZO and native hadoop libraries

2008-09-30 Thread Amareshwari Sriramadasu


Are you seeing HADOOP-2009?

Thanks
Amareshwari
Nathan Marz wrote:
Unfortunately, setting those environment variables did not help my 
issue. It appears that the "HADOOP_LZO_LIBRARY" variable is not 
defined in both LzoCompressor.c and LzoDecompressor.c. Where is this 
variable supposed to be set?




On Sep 30, 2008, at 12:33 PM, Colin Evans wrote:


Hi Nathan,
You probably need to add the Java headers to your build path as well 
- I don't know why the Mac doesn't ship with this as a default setting:


export 
CPATH="/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/include 
"
export 
CPPFLAGS="-I/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/include" 






Nathan Marz wrote:
Thanks for the help. I was able to get past my previous issue, but 
the native build is still failing. Here is the end of the log output:


[exec] then mv -f ".deps/LzoCompressor.Tpo" 
".deps/LzoCompressor.Plo"; else rm -f ".deps/LzoCompressor.Tpo"; 
exit 1; fi

[exec] mkdir .libs
[exec]  gcc -DHAVE_CONFIG_H -I. 
-I/Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo 
-I../../../../../../.. -I/Library/Java/Home//include 
-I/Users/nathan/Downloads/hadoop-0.18.1/src/native/src -g -Wall 
-fPIC -O2 -m32 -g -O2 -MT LzoCompressor.lo -MD -MP -MF 
.deps/LzoCompressor.Tpo -c 
/Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c  
-fno-common -DPIC -o .libs/LzoCompressor.o
[exec] 
/Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c: 
In function 
'Java_org_apache_hadoop_io_compress_lzo_LzoCompressor_initIDs':
[exec] 
/Users/nathan/Downloads/hadoop-0.18.1/src/native/src/org/apache/hadoop/io/compress/lzo/LzoCompressor.c:135: 
error: syntax error before ',' token

[exec] make[2]: *** [LzoCompressor.lo] Error 1
[exec] make[1]: *** [all-recursive] Error 1
[exec] make: *** [all] Error 2


Any ideas?



On Sep 30, 2008, at 11:53 AM, Colin Evans wrote:


There's a patch to get the native targets to build on Mac OS X:

http://issues.apache.org/jira/browse/HADOOP-3659

You probably will need to monkey with LDFLAGS as well to get it to 
work, but we've been able to build the native libs for the Mac 
without too much trouble.



Doug Cutting wrote:

Arun C Murthy wrote:
You need to add libhadoop.so to your java.library.patch. 
libhadoop.so is available in the corresponding release in the 
lib/native directory.


I think he needs to first build libhadoop.so, since he appears to 
be running on OS X and we only provide Linux builds of this in 
releases.


Doug

Re: streaming question

2008-09-16 Thread Amareshwari Sriramadasu


Looks like you have to wait for HADOOP-3570 and use -libjars for the same.

Thanks
Amareshwari
Christian Ulrik Søttrup wrote:
Ok i've tried what you suggested and all sorts of combinations with no 
luck.
Then I went through the source of the Streaming lib. It looks like it 
checks for the existence
of the combiner while it is building the jobconf i.e. before the job 
is sent to the nodes.
It calls class.forName() on the combiner in goodClassOrNull() from 
StreamUtil.java

called from setJobconf() in StreamJob.java.

Anybody have an idea how i can use a custom combiner? would I have to 
package it into the streaming jar?


cheers,
Christian

Dennis Kubes wrote:

If testlink is a package, it should be:

hadoop -jar streaming/hadoop-0.17.0-streaming.jar -input store 
-output cout -mapper MyProg -combiner testlink.combiner -reducer 
testlink.reduce -file /home/hadoop/MyProg -cacheFile 
/shared/part-0#in.cl -cacheArchive /related/MyJar.jar#testlink


if not a package, remove the testlink part.

Dennis

Christian Ulrik Søttrup wrote:
Ok, so I added the JAR to the cacheArchive option and my command 
looks like this:


hadoop jar streaming/hadoop-0.17.0-streaming.jar  -input /store/ 
-output /cout/ -mapper MyProg -combiner testlink/combiner.class 
-reducer testlink/reduce.class -file /home/hadoop/MyProg -cacheFile 
/shared/part-0#in.cl -cacheArchive /related/MyJar.jar#testlink


Now it fails because it cannot find the combiner.  The cacheArchive 
option creates a symlink in the local running directory, correct? 
Just like the cacheFile option? If not how can i then specify which 
class to use?


cheers,
Christian

Amareshwari Sriramadasu wrote:

Dennis Kubes wrote:
If I understand what you are asking you can use the -cacheArchive 
with the path to the jar to including the jar file in the 
classpath of your streaming job.


Dennis

You can also use -cacheArchive option to include jar file and 
symlink the unjarred directory from cwd by providing the uri as 
hdfs://#link. You have to provide -reducer and -combiner 
options as appropriate paths in the unjarred directory.


Thanks
Amareshwari

Christian Søttrup wrote:

Hi all,

I have an application that i use to run with the "hadoop jar" 
command.

I have now written an optimized version of the mapper in C.
I have run this using the streaming library and everything looks 
ok (using num.reducers=0).


Now i want to use this mapper together with the combiner and 
reducer from my old .jar file.
How do i do this? How can i distribute the jar and run the 
reducer and combiner from it?

While also running the c program as the mapper in streaming mode.

cheers,
Christian

Re: streaming question

2008-09-14 Thread Amareshwari Sriramadasu


Dennis Kubes wrote:
If I understand what you are asking you can use the -cacheArchive with 
the path to the jar to including the jar file in the classpath of your 
streaming job.


Dennis

You can also use -cacheArchive option to include jar file and symlink 
the unjarred directory from cwd by providing the uri as 
hdfs://#link. You have to provide -reducer and -combiner options 
as appropriate paths in the unjarred directory.


Thanks
Amareshwari

Christian Søttrup wrote:

Hi all,

I have an application that i use to run with the "hadoop jar" command.
I have now written an optimized version of the mapper in C.
I have run this using the streaming library and everything looks ok 
(using num.reducers=0).


Now i want to use this mapper together with the combiner and reducer 
from my old .jar file.
How do i do this? How can i distribute the jar and run the reducer 
and combiner from it?

While also running the c program as the mapper in streaming mode.

cheers,
Christian

Re: Logging best practices?

2008-09-08 Thread Amareshwari Sriramadasu


Per Jacobsson wrote:

Hi all.
I've got a beginner question: Are there any best practices for how to do
logging from a task? Essentially I want to log warning messages under
certain conditions in my map and reduce tasks, and be able to review them
later.

  
stdout, stderr and the logs using commons-logging  from the task are 
stored in userlogs directory i.e. ${hadoop.log.dir}/userlogs/ . 
They are also available on the web UI.

Is good old commons-logging using the TaskLogAppender the best way to solve
this? 

I think using commons-logging is good.

I assume I'd have to configure it to log to StdErr to be able to see
the log messages in the jobtracker webapp. The Reporter would be useful to
track statistics but not for something like this. And the JobHistory class
and history logs are intended for internal use only?

  
JobHistory class is for internal use only. But the history logs can be 
viewed from the web UI and HistoryViewer.

Thanks a lot,
Per

  


Thanks
Amareshwari

Re: input files

2008-08-20 Thread Amareshwari Sriramadasu

You can add more paths to input using 
FileInputFormat.addInputPath(JobConf, Path).
You can also specify comma separated filenames as input path using 
FileInputFormat.setInputPaths(JobConf, String commaSeparatedPaths)
More details at 
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html


You can also use glob path to specify multiple paths in a single path.

Thanks
Amareshwari
Deepak Diwakar wrote:

Hadoop usually takes either a single file or a folder as an input parameter.
But is it possible to modify it so that it can take list of files(not a
folder) as input parameter

Re: help,error "...failed to report status for xxx seconds..."

2008-08-03 Thread Amareshwari Sriramadasu

The Mapred framework kills the map/reduce tasks if they dont report
status within 10 minutes. If your mapper/reducer needs more time they
should report status using
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Reporter.html
More documentation at
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Reporter
You can increase the task timeout by setting mapred.task.timeout.

Thanks
Amareshwari
wangxu wrote:
> Hi,all
> I always met this kind of error when do mapping job.
>
> Task task_200807130149_0067_m_00_0 failed to report status for 604 
> seconds. Killing!
>
>
>
> I am using hadoop-0.16.4-core.jar ,one namenode,one datanode.
>
> What does this error message suggest? Does it mean functions in mapper is too 
> slow? 
> I assume there is no network connection issue.
> What can I do about this error?
>
>
>
> Thanks,
> Xu
>
>
>
>
>

Re: mapper input file name

2008-08-03 Thread Amareshwari Sriramadasu

You can get the file name accessed by the mapper using the config 
property "map.input.file"


Thanks
Amareshwari
Deyaa Adranale wrote:

Hi,

I need to know inside my mapper, the name of the file that contains 
the current record.
I saw that I can access the name of the input directories inside 
mapper.config(), but my input contains different files and I need to 
know the name of the current one.


any hints?

thanks in advance,

Deyaa

Re: Could not find any valid local directory for task

2008-08-03 Thread Amareshwari Sriramadasu

The error "Could not find any valid local directory for task" means that 
the task could not find a local directory to write file, mostly because 
there is no enough space on any of the disks.


Thanks
Amareshwari

Shirley Cohen wrote:

Hi,

Does anyone know what the following error means?

hadoop-0.16.4/logs/userlogs/task_200808021906_0002_m_14_2]$ cat 
syslog
2008-08-02 20:28:00,443 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2008-08-02 20:28:00,684 INFO org.apache.hadoop.mapred.MapTask: 
numReduceTasks: 15
2008-08-02 20:30:08,594 WARN org.apache.hadoop.mapred.TaskTracker: 
Error running child

java.io.IOException
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:719)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2084)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: 
Could not find any valid local directory for 
task_200808021906_0002_m_14_2/spill4.out


Please let me know if you need more information about my setup.

Thanks in advance,

Shirley

Re: Running mapred job from remote machine to a pseudo-distributed hadoop

2008-08-03 Thread Amareshwari Sriramadasu


Arv Mistry wrote:

I'll try again, can anyone tell me should it be possible to run hadoop
in a pseudo-distributed mode (i.e. everything on one machine) and then
submit a mapred job using the ToolRunner from another machine on that
hadoop configuration?

Cheers Arv
 
  

Yes. It is possible to do. You can start hadoop cluster on single node.
Documentation available at 
http://hadoop.apache.org/core/docs/current/quickstart.html#PseudoDistributed
Once the cluster is up, you can submit jobs from any client, but the 
client configuration should be aware of Namenode and JobTracker nodes. 
You can use the generic options *-fs* and *-jt* on commandline for the same.


Thanks
Amareshwari


-Original Message-
From: Arv Mistry [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 31, 2008 2:32 PM

To: core-user@hadoop.apache.org
Subject: Running mapred job from remote machine to a pseudo-distributed
hadoop

 
I have hadoop setup in a pseudo-distributed mode i.e. everything on one

machine, And I'm trying to submit a hadoop mapred job from another
machine to that hadoop setup.

At the point that I run the mapred job I get the following error. Any
ideas as to what I'm doing wrong?
Is this possible in a pseudo-distributed mode?

Cheers Arv

 INFO   | jvm 1| 2008/07/31 14:01:00 | 2008-07-31 14:01:00,547 ERROR
[HadoopJobTool] java.io.IOException:
/tmp/hadoop-root/mapred/system/job_200807310809_0006/job.xml: No such
file or directory
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
java.lang.reflect.Method.invoke(Method.java:597)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
INFO   | jvm 1| 2008/07/31 14:01:00 |
INFO   | jvm 1| 2008/07/31 14:01:00 |
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
/tmp/hadoop-root/mapred/system/job_200807310809_0006/job.xml: No such
file or directory
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
java.lang.reflect.Method.invoke(Method.java:597)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
INFO   | jvm 1| 2008/07/31 14:01:00 |
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.ipc.Client.call(Client.java:557)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
$Proxy5.submitJob(Unknown Source)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO   | jvm 1| 2008/07/31 14:01:00 |   at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl

Re: Where can i download hadoop-0.17.1-examples.jar

2008-07-30 Thread Amareshwari Sriramadasu


Hi Srilatha,

You can download hadoop release tar ball from 
http://hadoop.apache.org/core/releases.html

You will find hadoop-*-examples.jar when you untar it.

Thanks,
Amareshwari

us latha wrote:

HI All,

Trying to run the wordcount example on single node hadoop setup.
Could anyone please point me the location from where I could download
hadoop-0.17.1-examples.jar?

Thankyou
Srilatha

Re: JobTracker History data+analysis

2008-07-28 Thread Amareshwari Sriramadasu


Paco NATHAN wrote:

Thanks Amareshwari -

That could be quite useful to access summary analysis from within the code.

Currently this is not written as a public class, which makes it
difficult to use inside application code.

Are there plans to make it a public class?

  
I created a jira for the same, 
https://issues.apache.org/jira/browse/HADOOP-3850. You can give you 
inputs there.


Thanks
Amareshwari

Paco


On Mon, Jul 28, 2008 at 1:42 AM, Amareshwari Sriramadasu
<[EMAIL PROTECTED]> wrote:
  

HistoryViewer is used in JobClient to view the history files in the
directory provided on the command line. The command is
$ bin/hadoop job -history   #by default history is stored in
output dir.
outputDir in the constructor of HistoryViewer is the directory passed on the
command-line.

You can specify a location to store the history files of a particular job
using "hadoop.job.history.user.location". If nothing is specified, the logs
are stored in the job's
output directory i.e. "mapred.output.dir". The files are stored in
"_logs/history/" inside the directory.
Thanks
Amareshwari

Paco NATHAN wrote:


Thank you, Amareshwari -

That helps.  Hadn't noticed HistoryViewer before. It has no JavaDoc.

What is a typical usage?  In other words, what would be the
"outputDir" value in the context of ToolRunner, JobClient, etc. ?

Paco


On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu
<[EMAIL PROTECTED]> wrote:

  

Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if
it
make sense?

Thanks
Amareshwari

Paco NATHAN wrote:



We have a need to access data found in the JobTracker History link.
Specifically in the "Analyse This Job" analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map, shuffle, reduce.

After tracing the flow of the JSP in "src/webapps/job"...  Is there a
better way to get at this data, *not* from the web UI perspective but
from the code?

Tried to find any applicable patterns in JobTracker, ClusterStatus,
JobClient, etc., but no joy.

Thanks,
Paco

Re: JobTracker History data+analysis

2008-07-27 Thread Amareshwari Sriramadasu

HistoryViewer is used in JobClient to view the history files in the 
directory provided on the command line. The command is
$ bin/hadoop job -history   #by default history is stored 
in output dir.
outputDir in the constructor of HistoryViewer is the directory passed on 
the command-line.


You can specify a location to store the history files of a particular 
job using "hadoop.job.history.user.location". If nothing is specified, 
the logs are stored in the job's
output directory i.e. "mapred.output.dir". The files are stored in 
"_logs/history/" inside the directory.

Thanks
Amareshwari

Paco NATHAN wrote:

Thank you, Amareshwari -

That helps.  Hadn't noticed HistoryViewer before. It has no JavaDoc.

What is a typical usage?  In other words, what would be the
"outputDir" value in the context of ToolRunner, JobClient, etc. ?

Paco


On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu
<[EMAIL PROTECTED]> wrote:
  

Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it
make sense?

Thanks
Amareshwari

Paco NATHAN wrote:


We have a need to access data found in the JobTracker History link.
Specifically in the "Analyse This Job" analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map, shuffle, reduce.

After tracing the flow of the JSP in "src/webapps/job"...  Is there a
better way to get at this data, *not* from the web UI perspective but
from the code?

Tried to find any applicable patterns in JobTracker, ClusterStatus,
JobClient, etc., but no joy.

Thanks,
Paco

Re: JobTracker History data+analysis

2008-07-27 Thread Amareshwari Sriramadasu

Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if 
it make sense?


Thanks
Amareshwari

Paco NATHAN wrote:

We have a need to access data found in the JobTracker History link.
Specifically in the "Analyse This Job" analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map, shuffle, reduce.

After tracing the flow of the JSP in "src/webapps/job"...  Is there a
better way to get at this data, *not* from the web UI perspective but
from the code?

Tried to find any applicable patterns in JobTracker, ClusterStatus,
JobClient, etc., but no joy.

Thanks,
Paco

Re: Tasktrackers job cache directories not always cleaned up

2008-07-09 Thread Amareshwari Sriramadasu

The proposal on http://issues.apache.org/jira/browse/HADOOP-3386 takes 
care of this.


Thanks
Amareshwari
Amareshwari Sriramadasu wrote:
If task tracker didn't receive KillJobAction, its true that job 
directory will not removed.
And your observation is correct that some task trackers didn't receive 
KillJobAction for the job.
If a reduce task has finished before the job completion, the task will 
be sent KillTaskAction.

Looks like there is a bug in sending KillJobAction to the task tracker.
Could you please file jira for this?

Thanks
Amareshwari

The task subdirectories are being deleted, but the job directory and
its work subdirectory are not. This is causing a problem since disk
space is filling up over time, and restarting the cluster after a long
time is very slow as the tasktrackers clear out the jobcache
directories.

This doesn't happen for every task run by a tasktracker, but it is
happening to a significant number.

I think it has something to do with the KillJobAction not being called
because if I grep the log for lines from the relevant job containing
"Kill" I see this:

2008-07-01 10:15:04,046 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_00_0
2008-07-01 10:15:16,223 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_01_0
2008-07-01 10:15:31,556 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_03_0
2008-07-01 10:15:39,882 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
task_200806300936_0279_r_02_0
2008-07-01 10:15:41,863 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_04_0
2008-07-01 10:15:51,484 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
task_200806300936_0279_r_06_0
2008-07-01 10:15:51,939 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_07_0
2008-07-01 10:15:59,695 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
task_200806300936_0279_r_08_0
2008-07-01 10:16:45,620 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0202:localhost/127.0.0.1:47183 -> KillTaskAction:
task_200806300936_0279_r_05_0
2008-07-01 10:16:47,328 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0216:localhost/127.0.0.1:37282 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:47,334 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f020c:localhost/127.0.0.1:52033 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:47,453 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0210:localhost/127.0.0.1:35235 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:47,768 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f020d:localhost/127.0.0.1:41562 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:48,652 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0203:localhost/127.0.0.1:65277 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,005 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0205:localhost/127.0.0.1:48747 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,365 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0209:localhost/127.0.0.1:59538 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,563 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,747 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f020a:localhost/127.0.0.1:40410 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:50,321 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0212:localhost/127.0.0.1:33514 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:50,352 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillJobAction:
job_200806300936_0279

Notice that tracker_m0f0202 receives a KillTaskAction (which removes
the task working directory), but not a KillJobAction (which would
remove the job directory). All the other trackers received
KillJobAction. I'm not sure what's happening here to cause this.

This is on 0.16.4.

Anyone else seen this?

Tom

Re: Tasktrackers job cache directories not always cleaned up

2008-07-02 Thread Amareshwari Sriramadasu

If task tracker didn't receive KillJobAction, its true that job 
directory will not removed.
And your observation is correct that some task trackers didn't receive 
KillJobAction for the job.
If a reduce task has finished before the job completion, the task will 
be sent KillTaskAction.

Looks like there is a bug in sending KillJobAction to the task tracker.
Could you please file jira for this?

Thanks
Amareshwari

The task subdirectories are being deleted, but the job directory and
its work subdirectory are not. This is causing a problem since disk
space is filling up over time, and restarting the cluster after a long
time is very slow as the tasktrackers clear out the jobcache
directories.

This doesn't happen for every task run by a tasktracker, but it is
happening to a significant number.

I think it has something to do with the KillJobAction not being called
because if I grep the log for lines from the relevant job containing
"Kill" I see this:

2008-07-01 10:15:04,046 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_00_0
2008-07-01 10:15:16,223 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_01_0
2008-07-01 10:15:31,556 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_03_0
2008-07-01 10:15:39,882 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
task_200806300936_0279_r_02_0
2008-07-01 10:15:41,863 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_04_0
2008-07-01 10:15:51,484 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
task_200806300936_0279_r_06_0
2008-07-01 10:15:51,939 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction:
task_200806300936_0279_r_07_0
2008-07-01 10:15:59,695 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction:
task_200806300936_0279_r_08_0
2008-07-01 10:16:45,620 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0202:localhost/127.0.0.1:47183 -> KillTaskAction:
task_200806300936_0279_r_05_0
2008-07-01 10:16:47,328 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0216:localhost/127.0.0.1:37282 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:47,334 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f020c:localhost/127.0.0.1:52033 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:47,453 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0210:localhost/127.0.0.1:35235 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:47,768 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f020d:localhost/127.0.0.1:41562 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:48,652 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0203:localhost/127.0.0.1:65277 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,005 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0205:localhost/127.0.0.1:48747 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,365 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0209:localhost/127.0.0.1:59538 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,563 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0214:localhost/127.0.0.1:41484 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:49,747 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f020a:localhost/127.0.0.1:40410 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:50,321 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0212:localhost/127.0.0.1:33514 -> KillJobAction:
job_200806300936_0279
2008-07-01 10:16:50,352 DEBUG org.apache.hadoop.mapred.JobTracker:
tracker_m0f0207:localhost/127.0.0.1:37241 -> KillJobAction:
job_200806300936_0279

Notice that tracker_m0f0202 receives a KillTaskAction (which removes
the task working directory), but not a KillJobAction (which would
remove the job directory). All the other trackers received
KillJobAction. I'm not sure what's happening here to cause this.

This is on 0.16.4.

Anyone else seen this?

Tom

Re: Too many Task Manager children...

2008-06-19 Thread Amareshwari Sriramadasu


C G wrote:

Hi All:
   
  I have mapred.tasktracker.tasks.maximum set to 4 in our conf/hadoop-site.xml, yet I frequently see 5-6 instances of  org.apache.hadoop.mapred.TaskTracker$Child running on the slave nodes.  Is there another setting I need to tweak in order to dial back the number of children running?  The effect of running this many children is that our boxes have extremely high load factors, and eventually mapred tasks start timing out and failing.
  
If mapred.tasktracker.tasks.maximum is set to four, the tasktracker has 
4 map slots and 4 reduce slots, summing up to 8 slots. Then seeing 5-6 
instances of org.apache.hadoop.mapred.TaskTracker$Child is expected. If 
you want only 4 instances of it,  mapred.tasktracker.tasks.maximum 
should be 2. thus making 2 map slots and 2 reduce slots.
And as far as I know there is no other config variable for tweaking the 
number of children.
   
  Note that the number of instances is for a single job.  I see far more if I run multiple jobs simultaneously (something we do not typically do).
   
  This is on Hadoop 0.15.0, upgrading is not an option at the moment.
   
  Any help appreciate...

  Thanks,
  C G

   
  


Thanks
Amareshwari

Re: Why is there a seperate map and reduce task capacity?

2008-06-16 Thread Amareshwari Sriramadasu


Taeho Kang wrote:

Set "mapred.tasktracker.tasks.maximum"
and each node will be able to process N number of tasks - map or/and reduce.

Please note that once you set "mapred.tasktracker.tasks.maximum",
"mapred.tasktracker.map.tasks.maximum" and
"mapred.tasktracker.reduce.tasks.maximum" setting will not take effect.



  
This is valid only till 0.16.*, because the property 
"mapred.tasktracker.tasks.maximum" is removed from 0.17.
So, from 0.17, "mapred.tasktracker.map.tasks.maximum" and 
"mapred.tasktracker.reduce.tasks.maximum" should be used.

On Tue, Jun 17, 2008 at 1:46 PM, Amar Kamat <[EMAIL PROTECTED]> wrote:

  

Daniel Leffel wrote:



Why not just combine them? How do I do that?



  

Consider a case where the cluster (of n nodes) is configured to process
just one task per node. Let there be (n-1) reducers. Lets assume that the
map phase is complete and the reducers are shuffling. There will be (n-1)
nodes with reducers. Now consider a case where the only node without the
reducer gets lost. The cluster needs slots to run maps that were lost since
the reducers are waiting for the maps to finish. In such a case the job will
get stuck. To avoid such cases, there are separate maps and reduce task
slots.
Amar

 Rationale is that our tasks are very balanced in load, but unbalanced


in timing. I've found that limiting the number of total threads to be
the most safe approach to not overloading the dfs daemon. To date,
I've done that just through intelligent scheduling of jobs to stagger
maps and reduces, but have I missed a setting that exists to simply
limit number of tasks in-total?

Re: External Jar

2008-05-29 Thread Amareshwari Sriramadasu

You can put your external jar in DistributedCache. and do symlink  the 
jar in the current working directory of the task giving the value of 
mapred.create.symlink as true.  More details can be found at 
http://issues.apache.org/jira/browse/HADOOP-1660.


The jar can also be added to classpath using the api 
DistrributedCache.addArchiveToClassPath().


Thanks
Amareshwari
Einar Vollset wrote:

Pretty please with sugar on top. ;-)


On Thu, May 29, 2008 at 3:34 PM, Brian Vargas <[EMAIL PROTECTED]> wrote:
  

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

I've got a Maven2 assembly file for creating a Hadoop-runnable JAR file
using the Maven assembly plugin.  I'd be happy to share it if you'd like.

Brian

Michael Bieniosek wrote:
| When you build your job jar, you can include other jars in the lib/
| directory inside the jar.
|
| -Michael
|
| On 5/29/08 10:37 AM, "Tanton Gibbs" <[EMAIL PROTECTED]> wrote:
|
|> What  is the right way to use a jar file within my map reduce program.
|>  I want to use the simmetrics code for double metaphone, but I'm not
|> sure how to include it so that my map/reduce code can see it.
|>
|> Any pointers?
|>
|> Tanton
|
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFIPy+Q3YdPnMKx1eMRA9UyAKCZSnIhTsXuNqmdYEcERQCthRN1tQCffmTm
l8frMZev58DOV9M5K44zSAM=
=YOWA
-END PGP SIGNATURE-

Re: Newbie InputFormat Question

2008-05-08 Thread Amareshwari Sriramadasu


You can have a look at TextInputFormat, KeyValueTextInputFormat etc at
http://svn.apache.org/viewvc/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/ 



coneybeare wrote:

I want to alter the default <"key", "line"> input format to be <"key", "line
number:" + "line"> so that my mapper can have a reference to the line num. 
It seems like this should be easy by overwriting either inputformat or

inputsplit... but after reading some of the docs, I am still unsure of where
to begin.  Any help is much appreciated.

-Matt

Re: Question on how to view the counters of jobs in the job tracker history

2008-04-07 Thread Amareshwari Sriramadasu


Arun C Murthy wrote:


On Apr 3, 2008, at 5:36 PM, Jason Venner wrote:

For the first day or so, when the jobs are viewable via the main page 
of the job tracker web interface, the jobs specific counters are also 
visible. Once the job is only visible in the history page, the 
counters are not visible.


You are right. Counters are available in Job history but they are not 
visible. I created https://issues.apache.org/jira/browse/HADOOP-3200 for 
the same.

Is it possible to view the counters of the older jobs?



Which version of hadoop are you running? I believe counters were 
persisted into job-history starting with 0.16.0.


Arun


--Jason Venner
Attributor - Publish with Confidence 
Attributor is hiring Hadoop Wranglers, contact if interested

Re: Hadoop streaming performance problem

2008-03-31 Thread Amareshwari Sriramadasu


LineRecordReader.readLine() is deprecated by 
HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was 
slow.
But streaming still uses the method. HADOOP-2826 
(http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in 
streaming.
This change should improve streaming performance. When I ran simple cat from streaming, with HADOOP-2826 it ran in 33 seconds whereas 
with trunk it took 52 seconds.


Thanks
Amareshwari.

lin wrote:

Hi,

I am looking into using Hadoop streaming to parallelize some simple
programs. So far the performance has been pretty disappointing.

The cluster contains 5 nodes. Each node has two CPU cores. The task capacity
of each node is 2. The Hadoop version is 0.15.

Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
standalone (on a single CPU core). Program runs for 5 minutes on the Hadoop
cluster and 4.5 minutes in standalone. Both programs run as map-only jobs.

I understand that there is some overhead in starting up tasks, reading to
and writing from the distributed file system. But they do not seem to
explain all the overhead. Most map tasks are data-local. I modified program
1 to output nothing and saw the same magnitude of overhead.

The output of top shows that the majority of the CPU time is consumed by
Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). So
I added a profile option (-agentlib:hprof=cpu=samples) to
mapred.child.java.opts.

The profile results show that most of CPU time is spent in the following
methods

   rank   self  accum   count trace method

   1 23.76% 23.76%1246 300472 java.lang.UNIXProcess.waitForProcessExit

   2 23.74% 47.50%1245 300474 java.io.FileInputStream.readBytes

   3 23.67% 71.17%1241 300479 java.io.FileInputStream.readBytes

   4 16.15% 87.32% 847 300478 java.io.FileOutputStream.writeBytes

And their stack traces show that these methods are for interacting with the
map program.


TRACE 300472:

java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)

java.lang.UNIXProcess.access$900(UNIXProcess.java:20)

java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

TRACE 300474:

java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)

java.io.FileInputStream.read(FileInputStream.java:199)

java.io.BufferedInputStream.read1(BufferedInputStream.java:256)

java.io.BufferedInputStream.read(BufferedInputStream.java:317)

java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

java.io.BufferedInputStream.read(BufferedInputStream.java:237)

java.io.FilterInputStream.read(FilterInputStream.java:66)

org.apache.hadoop.mapred.LineRecordReader.readLine(
LineRecordReader.java:136)

org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
UTF8ByteArrayUtils.java:157)

org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
PipeMapRed.java:348)

TRACE 300479:

java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)

java.io.FileInputStream.read(FileInputStream.java:199)

java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

java.io.BufferedInputStream.read(BufferedInputStream.java:237)

java.io.FilterInputStream.read(FilterInputStream.java:66)

org.apache.hadoop.mapred.LineRecordReader.readLine(
LineRecordReader.java:136)

org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
UTF8ByteArrayUtils.java:157)

org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
PipeMapRed.java:399)

TRACE 300478:

java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)

java.io.FileOutputStream.write(FileOutputStream.java:260)

java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
:65)

java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)

java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)

java.io.DataOutputStream.flush(DataOutputStream.java:106)

org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)

org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
:1760)


I don't understand why Hadoop streaming needs so much CPU time to read from
and write to the map program. Note it takes 23.67% time to read from the
standard error of the map program while the program does not output any
error at all!

Does anyone know any way to get rid of this seemingly unnecessary overhead
in Hadoop streaming?

Thanks,

Lin

Re: Hadoop streaming cacheArchive

2008-03-20 Thread Amareshwari Sriramadasu


Norbert Burger wrote:

I'm trying to use the cacheArchive command-line options with the
hadoop-0.15.3-streaming.jar.  I'm using the option as follows:

-cacheArchive hdfs://host:50001/user/root/lib.jar#lib

Unfortunately, my PERL scripts fail with an error consistent with not being
able to find the 'lib' directory (which, as I understand, should point back
to an extracted version of the lib.jar).

  
Here, lib is created as a symlink in task's working directory. It will 
have the jar file and extracted version of jar file.
Where are your PERL scripts searching for the lib? Is '.' included in 
your classpath.
Otherwise you can use "mapred.job.classpath.archives" config item, this 
adds the files to the classpath and also to the distributed cache

you can use
  -jobconf 
"mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib"

I know that the original JAR exists in HDFS, but I don't see any evidence of
lib.jar or a link called 'lib' inside my job.jar.  
link 'lib' will not be part of job.jar, but it will be distributed on 
all the nodes during task launch and task's current working directory 
will have the link 'lib' to the jar on cache.

How can I troubleshoot
cacheArchive further?  Should the files/dirs specified via cacheArchive be
contained inside the job.jar?  If not, where should they be in HDFS?

  
They can be anywhere on HDFS. You need give the complete path to add it 
to the cache.

Thanks for any help.

Norbert

Re: streaming problem

2008-03-18 Thread Amareshwari Sriramadasu


Hi Andreas,
Looks like your mapper is not available to the streaming jar. Where is 
your mapper script? Did you use distributed cache to distribute the mapper?
You can use -file  to make it part of 
jar. or Use -cacheFile /dist/wordloadmf#workloadmf to distribute the 
script. Distributing this way will add your script to the PATH.


So, now you command will be:

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper 
workloadmf -reducer NONE -input testlogs/* -output testlogs-output -cacheFile 
/dist/wordloadmf#workloadmf

or

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper workloadmf 
-reducer NONE -input testlogs/* -output testlogs-output -file 

Thanks,
Amareshwari

Andreas Kostyrka wrote:

Some additional details if it's helping, the HDFS is hosted on AWS S3,
and the input file set consists of 152 gzipped Apache log files.

Thanks,

Andreas

Am Dienstag, den 18.03.2008, 22:17 +0100 schrieb Andreas Kostyrka:
  

Hi!

I'm trying to run a streaming job on Hadoop 1.16.0, I've distributed the
scripts to be used to all nodes:

time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar -mapper 
~/dist/workloadmf -reducer NONE -input testlogs/* -output testlogs-output

Now, this gives me:

java.io.IOException: log:null
R/W/S=1/0/0 in:0=1/2 [rec/s] out:0=0/2 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Tue Mar 18 21:06:13 GMT 2008
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)


at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:107)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)

Any ideas what my problems could be?

TIA,

Andreas

Re: [Fwd: Re: runtime exceptions not killing job]

2008-03-18 Thread Amareshwari Sriramadasu


Thanks Matt for info.
I raised a Jira for this at 
https://issues.apache.org/jira/browse/HADOOP-3039


Thanks
Amareshwari
Matt Kent wrote:

Or maybe I can't use attachments, so here's the stack traces inline:

--task tracker

2008-03-17 21:58:30
Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.6.0_03-b05 mixed 
mode):


"Attach Listener" daemon prio=10 tid=0x2aab1205c400 nid=0x523d 
waiting on condition [0x..0x]

  java.lang.Thread.State: RUNNABLE

"IPC Client connection to 
bigmike.internal.persai.com/192.168.1.3:9001" daemon prio=10 
tid=0x2aab14317000 nid=0x5230 in Object.wait() 
[0x41c44000..0x41c44ba0]

  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)

   at java.lang.Object.wait(Object.java:485)
   at 
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:234)
   - locked <0x2aaaf304da08> (a 
org.apache.hadoop.ipc.Client$Connection)

   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:273)

"process reaper" daemon prio=10 tid=0x2aab1205bc00 nid=0x51c6 
runnable [0x41f47000..0x41f47da0]

  java.lang.Thread.State: RUNNABLE
   at java.lang.UNIXProcess.waitForProcessExit(Native Method)
   at java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
   at java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

"Thread-408" prio=10 tid=0x2aab14316000 nid=0x51c5 in 
Object.wait() [0x41d45000..0x41d45a20]

  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
   at java.lang.Object.wait(Object.java:485)
   at java.lang.UNIXProcess.waitFor(UNIXProcess.java:165)
   - locked <0x2aaaf2cf0948> (a java.lang.UNIXProcess)
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:152)
   at org.apache.hadoop.util.Shell.run(Shell.java:100)
   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:252)

   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:456)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:379)

"SocketListener0-0" prio=10 tid=0x2aab1205e400 nid=0x519d in 
Object.wait() [0x41038000..0x41038da0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaaf2650c20> (a 
org.mortbay.util.ThreadPool$PoolThread)

   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
   - locked <0x2aaaf2650c20> (a 
org.mortbay.util.ThreadPool$PoolThread)


"[EMAIL PROTECTED]" daemon prio=10 
tid=0x2aab183a9000 nid=0x46f5 waiting on condition 
[0x41a42000..0x41a42aa0]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)

   at java.lang.Thread.run(Thread.java:619)

"[EMAIL PROTECTED]" daemon prio=10 
tid=0x2aab183ce000 nid=0x46ef waiting on condition 
[0x4184..0x41840c20]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at 
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:597)

   at java.lang.Thread.run(Thread.java:619)

"Map-events fetcher for all reduce tasks on 
tracker_kentbox.internal.persai.com:localhost/127.0.0.1:43477" daemon 
prio=10 tid=0x2aab18438400 nid=0x4631 in Object.wait() 
[0x4173f000..0x4173fda0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaab3f0ace0> (a java.lang.Object)
   at 
org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread.run(TaskTracker.java:534) 


   - locked <0x2aaab3f0ace0> (a java.lang.Object)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10 
tid=0x2aab18427400 nid=0x462f waiting on condition 
[0x4153d000..0x4153daa0]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:423)

"IPC Server handler 1 on 43477" daemon prio=10 tid=0x2aab18476c00 
nid=0x462e in Object.wait() [0x4143c000..0x4143cb20]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aaab41356b0> (a java.util.LinkedList)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:869)
   - locked <0x2aaab41356b0> (a java.util.LinkedList)

"IPC Server handler 0 on 43477" daemon prio=10 tid=0x2aab18389c00 
nid=0x462d in Object.wait() [0x4133b000..0x4133bba0]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waitin

Re: Hadoop streaming question

2008-03-11 Thread Amareshwari Sriramadasu


Hi Andrey,

I think that is classpath problem.
Can you try using patch at 
https://issues.apache.org/jira/browse/HADOOP-2622 and see you still have 
the problem?


Thanks
Amareshwari.

Andrey Pankov wrote:

Hi all,

I'm still new to Hadoop. I'd like to use Hadoop streaming in order to 
combine mapper as Java class and reducer as C++ program. Currently I'm 
at the beginning of this task and now I have troubles with Java class. 
 It looks something like



package org.company;
 ...
public class TestMapper extends MapReduceBase implements Mapper {
 ...
  public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter) throws IOException {
 ...


I created jar file with my class and it is accessible via $CLASSPATH. 
I'm running stream job using


$HSTREAMING -mapper org.company.TestMapper -reducer "wc -l" -input 
/data -output /out1


Hadoop cannot find TestMapper class. I'm using hadoop-0.16.0. The 
error is


===
2008-03-07 18:58:07,734 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2008-03-07 18:58:07,833 INFO org.apache.hadoop.mapred.MapTask: 
numReduceTasks: 1
2008-03-07 18:58:07,910 WARN org.apache.hadoop.mapred.TaskTracker: 
Error running child
java.lang.RuntimeException: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.company.TestMapper
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:639)
at 
org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:728)
at 
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:36)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) 


at org.apache.hadoop.mapred.MapTask.run(MapTask.java:204)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
Caused by: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.company.TestMapper
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:607)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:631)

... 6 more
Caused by: java.lang.ClassNotFoundException: org.company.TestMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:587) 

at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:605)

... 7 more
===

What is interesting for me. I had put into Hadoop streaming 
(StreamJob.java and StreamUtil.java) some debugging println(). 
Streaming can see TestMapper on job configuration stage 
(StreamJob.setJobConf() routine) but cannot later. Next code creates 
new instance of TestMapper and calls toString() defined in TestMapper. 
It works.


if (mapCmd_ != null) {
  c = StreamUtil.goodClassOrNull(mapCmd_, defaultPackage);
  if (c != null) {
System.out.println("###");
try {
System.out.println(c.newInstance().toString());
} catch (Exception e) { }
System.out.println("###");
jobConf_.setMapperClass(c);
  } else {
...
  }
}


I tried to add jar file with TestMapper using option
 "-file test_mapper.jar" . The result is the same.

Could anybody advice me something? Thanks in advance,

---
Andrey Pankov.

93 matches

Mail list logo