Hadoop's datajoin

2010-07-10 Thread Denim Live
Hi,
I am trying to use the hadoop's datajoin for joining two relation. According to 
the Readme file of datajoin, it gives the following syntax:

$HADOOP_HOME/bin/hadoop jar hadoop-datajoin-examples.jar 
org.apache.hadoop.contrib.utils.join.DataJoinJob datajoin/input  
datajoin/output 
Text 1  org.apache.hadoop.contrib.utils.join.SampleDataJoinMapper  
org.apache.hadoop.contrib.utils.join.SampleDataJoinReducer  
org.apache.hadoop.contrib.utils.join.SampleTaggedMapOutput Text


But I do not find hadoop-datajoin-examples.jar anywhere in my Hadoop_home. Can 
anyone tell me how to produce it or where to find it?

Thanks in advance.



  

Re: Terasort problem

2010-07-10 Thread Tonci Buljan
Thank you for your response Owen. It is true, I haven't done that, figured
that few hours after posting here.

I'm having problems with understanding these variables:

mapred.tasktracker.reduce.tasks.maximum - Is this configured on every
datanode separately? What number shall I put here?

mapred.tasktracker.map.tasks.maximum - same question  as
mapred.tasktracker.reduce.tasks.maximum

mapred.reduce.tasks - Is this configured ONLY on Namenode and what value
should it have for my 8 node cluster?

mapred.map.tasks - same question as mapred.reduce.tasks


I've tried playing with these variables but getting error:Too many
fetch-failures...

Please, if anyone have any idea how to setup this the right way.

Thank you.

On 9 July 2010 15:33, Owen O'Malley omal...@apache.org wrote:

 I would guess that you didn't set the number of reducers for the job,
 and it defaulted to 2.

 -- Owen



java.lang.OutOfMemoryError: Java heap space

2010-07-10 Thread Shuja Rehman
Hi All

I am facing a hard problem. I am running a map reduce job using streaming
but it fails and it gives the following error.

Caught: java.lang.OutOfMemoryError: Java heap space
at Nodemapper5.parseXML(Nodemapper5.groovy:25)

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)

at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)



I have increased the heap size in hadoop-env.sh and make it 2000M. Also I
tell the job manually by following line.

-D mapred.child.java.opts=-Xmx2000M \

but it still gives the error. The same job runs fine if i run on shell using
1024M heap size like

cat file.xml | /root/Nodemapper5.groovy


Any clue?

Thanks in advance.


-- 
Regards
Shuja-ur-Rehman Baig
_
MS CS - School of Science and Engineering
Lahore University of Management Sciences (LUMS)
Sector U, DHA, Lahore, 54792, Pakistan
Cell: +92 3214207445


java.lang.OutOfMemoryError: Java heap space

2010-07-10 Thread Shuja Rehman
Hi All

I am facing a hard problem. I am running a map reduce job using streaming
but it fails and it gives the following error.

Caught: java.lang.OutOfMemoryError: Java heap space
at Nodemapper5.parseXML(Nodemapper5.groovy:25)

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)

at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)



I have increased the heap size in hadoop-env.sh and make it 2000M. Also I
tell the job manually by following line.

-D mapred.child.java.opts=-Xmx2000M \

but it still gives the error. The same job runs fine if i run on shell using
1024M heap size like

cat file.xml | /root/Nodemapper5.groovy


Any clue?

Thanks in advance.

-- 
Regards
Shuja-ur-Rehman Baig
_
MS CS - School of Science and Engineering
Lahore University of Management Sciences (LUMS)
Sector U, DHA, Lahore, 54792, Pakistan
Cell: +92 3214207445


Re: java.lang.OutOfMemoryError: Java heap space

2010-07-10 Thread Alex Kozlov
Hi Shuja,

It looks like the OOM is happening in your code.  Are you running MapReduce
in a cluster?  If so, can you send the exact command line your code is
invoked with -- you can get it with a 'ps -Af | grep Nodemapper5.groovy'
command on one of the nodes which is running the task?

Thanks,

Alex K

On Sat, Jul 10, 2010 at 10:40 AM, Shuja Rehman shujamug...@gmail.comwrote:

 Hi All

 I am facing a hard problem. I am running a map reduce job using streaming
 but it fails and it gives the following error.

 Caught: java.lang.OutOfMemoryError: Java heap space
at Nodemapper5.parseXML(Nodemapper5.groovy:25)

 java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
 failed with code 1
at
 org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at
 org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)

at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at
 org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


 I have increased the heap size in hadoop-env.sh and make it 2000M. Also I
 tell the job manually by following line.

 -D mapred.child.java.opts=-Xmx2000M \

 but it still gives the error. The same job runs fine if i run on shell
 using
 1024M heap size like

 cat file.xml | /root/Nodemapper5.groovy


 Any clue?

 Thanks in advance.

 --
 Regards
 Shuja-ur-Rehman Baig
 _
 MS CS - School of Science and Engineering
 Lahore University of Management Sciences (LUMS)
 Sector U, DHA, Lahore, 54792, Pakistan
 Cell: +92 3214207445



Re: java.lang.OutOfMemoryError: Java heap space

2010-07-10 Thread Shuja Rehman
Hi Alex

Yeah, I am running a job on cluster of 2 machines and using Cloudera
distribution of hadoop. and here is the output of this command.

root  5277  5238  3 12:51 pts/200:00:00 /usr/jdk1.6.0_03/bin/java
-Xmx1023m -Dhadoop.log.dir=/usr/lib /hadoop-0.20/logs
-Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
-Dhadoop.id.str= -Dhado op.root.logger=INFO,console
-Dhadoop.policy.file=hadoop-policy.xml -classpath
/usr/lib/hadoop-0.20/conf:/usr/
jdk1.6.0_03/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2+320.jar:/usr/lib/hadoo
p-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.3.jar:/usr/lib/hadoop-0.20/lib/common
s-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1
.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.ja
r:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2+320.jar:/usr/l
ib/hadoop-0.20/lib/hadoop-scribe-log4j-0.20.2+320.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/h
adoop-0.20/lib/hsqldb.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jackso
n-mapper-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-ru
ntime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.14.jar:/usr/lib
/hadoop-0.20/lib/jetty-util-6.1.14.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.
2.2.jar:/usr/lib/hadoop-0.20/lib/libfb303.jar:/usr/lib/hadoop-0.20/lib/libthrift.jar:/usr/lib/hadoop-0.20/lib
/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-jav
a-5.0.8-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/u
sr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0
.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api
-2.1.jar org.apache.hadoop.util.RunJar
/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2+320.jar
-D mapred.child.java.opts=-Xmx2000M -inputformat StreamInputFormat
-inputreader StreamXmlRecordReader,begin= mdc xmlns:HTML=
http://www.w3.org/TR/REC-xml;,end=/mdc -input
/user/root/RNCDATA/MDFDORKUCRAR02/A20100531
.-0700-0015-0700_RNCCN-MDFDORKUCRAR02 -jobconf mapred.map.tasks=1
-jobconf mapred.reduce.tasks=0 -output  RNC11 -mapper
/home/ftpuser1/Nodemapper5.groovy -reducer
org.apache.hadoop.mapred.lib.IdentityReducer -file /
home/ftpuser1/Nodemapper5.groovy
root  5360  5074  0 12:51 pts/100:00:00 grep Nodemapper5.groovy


--
and what is meant by OOM and thanks for helping,

Best Regards


On Sun, Jul 11, 2010 at 12:30 AM, Alex Kozlov ale...@cloudera.com wrote:

 Hi Shuja,

 It looks like the OOM is happening in your code.  Are you running MapReduce
 in a cluster?  If so, can you send the exact command line your code is
 invoked with -- you can get it with a 'ps -Af | grep Nodemapper5.groovy'
 command on one of the nodes which is running the task?

 Thanks,

 Alex K

 On Sat, Jul 10, 2010 at 10:40 AM, Shuja Rehman shujamug...@gmail.com
 wrote:

  Hi All
 
  I am facing a hard problem. I am running a map reduce job using streaming
  but it fails and it gives the following error.
 
  Caught: java.lang.OutOfMemoryError: Java heap space
 at Nodemapper5.parseXML(Nodemapper5.groovy:25)
 
  java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
  failed with code 1
 at
 
 org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
 at
 
 org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
 
 at
 org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
 at
  org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 
 
  I have increased the heap size in hadoop-env.sh and make it 2000M. Also I
  tell the job manually by following line.
 
  -D mapred.child.java.opts=-Xmx2000M \
 
  but it still gives the error. The same job runs fine if i run on shell
  using
  1024M heap size like
 
  cat file.xml | /root/Nodemapper5.groovy
 
 
  Any clue?
 
  Thanks in advance.
 
  --
  Regards
  Shuja-ur-Rehman Baig
  _
  MS CS - School of Science and Engineering
  Lahore University of Management Sciences (LUMS)
  Sector U, DHA, Lahore, 54792, 

Re: reading distributed cache returns null pointer

2010-07-10 Thread abc xyz
Hi,

Thanks. Ok 

Path[] ps=DistributedCache.getLocalCacheFiles(cnf);

 retreives for me the correct path in pseudo-distributed mode. But when I run 
my 
program in fully-distributed mode with 5 nodes, I get a null pointer. 
Theorcatically, if it worked on pseudo-distributed mode, it should work on 
fully-distributed mode as well. What possibilities can be there for this 
behavior?

Cheers





From: Hemanth Yamijala yhema...@gmail.com
To: common-user@hadoop.apache.org
Sent: Fri, July 9, 2010 10:21:19 AM
Subject: Re: reading distributed cache returns null pointer

Hi,

 Thanks for the information. I got your point. What I specifically want to ask 
is
 that if I use the following method to read my file now in each mapper:

FileSystemhdfs=FileSystem.get(conf);
  URI[] uris=DistributedCache.getCacheFiles(conf);
  Path my_path=new Path(uris[0].getPath());

 if(hdfs.exists(my_path))
{
 FSDataInputStreamfs=hdfs.open(my_path);
 while((str=fs.readLine())!=null)
   System.out.println(str);
}
 would this method retrieve the file from HDFS? since I am using the Hadoop 
API?
 not the local file API.


It would be instructive to look at the test code in
src/test/mapred/org/apache/hadoop/mapred/TestMRWithDistributedCache.java.
This gives a fair idea of how to access the files of DistributedCache
from within the mapper. Specifically see how the LocalFileSystem is
used to access the files. You could look at the same class in the
branch-20 source code if you are using an older version of Hadoop.


 I may be understanding somehting horribly wrong. The situation is that now
 my_path contains DCache/Orders.txt and if i am reading from here, this is the
 path of file on HDFS as well. How does it know to pick the file from the local
 file system, not the HDFS?

 Thanks again




 
 From: Rahul Jain rja...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Fri, July 9, 2010 12:19:44 AM
 Subject: Re: reading distributed cache returns null pointer

 Yes, distributed cache writes files to the local file system for each mapper
 / reducer. So you should be able to access the file(s) using local file
 system APIs.

 If the files were staying in HDFS there would be no point to using
 distributed cache since all mappers already have access to the global HDFS
 directories :).

 -Rahul

 On Thu, Jul 8, 2010 at 3:03 PM, abc xyz fabc_xyz...@yahoo.com wrote:

 Hi Rahul,
 Thanks. It worked. I was using getFileClassPaths() to get the paths to the
 files
 in the cache and then use this path to access the file. It should have
 worked
 but I don't know why that doesn't produce the required result.

 I added the file HDFS file DCache/Orders.txt to my distributed cache. After
 calling DistributedCache.getCacheFiles(conf); in the configure method of
 the
 mapper node, if I read the file now from the returned path (which happens
 to be
 DCache/Orders.txt) using the Hadoop API , would the file be read from the
 local
 directory of the mapper node? More specifically I am doing this:


FileSystemhdfs=FileSystem.get(conf);
 URI[] uris=DistributedCache.getCacheFiles(conf);
 Path my_path=new Path(uris[0].getPath());

if(hdfs.exists(my_path))
{
FSDataInputStreamfs=hdfs.open(my_path);
while((str=fs.readLine())!=null)
  System.out.println(str);
 }

 Thanks


 
 From: Rahul Jain rja...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, July 8, 2010 8:15:58 PM
 Subject: Re: reading distributed cache returns null pointer

 I am not sure why you are using getFileClassPaths() API to access files...
 here is what works for us:

 Add the file(s) to distributed cache using:
 DistributedCache.addCacheFile(p.toUri(), conf);

 Read the files on the mapper using:

 URI[] uris = DistributedCache.getCacheFiles(conf);
 // access one of the files:
 paths[0] = new Path(uris[0].getPath());
 // now follow hadoop or local file APIs to access the file...


 Did you try the above and did it not work ?

 -Rahul

 On Thu, Jul 8, 2010 at 12:04 PM, abc xyz fabc_xyz...@yahoo.com wrote:

  Hello all,
 
  As a new user of hadoop, I am having some problems with understanding
 some
  things. I am writing a program to load a file to the distributed cache
 and
  read
  this file in each mapper. In my driver program, I have added the file to
 my
  distributed cache using:
 
 Path p=new
  Path(hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt);
  DistributedCache.addCacheFile(p.toUri(), conf);
 
  In the configure method of the mapper, I am reading the file from cache
  using:
  Path[] cacheFiles=DistributedCache.getFileClassPaths(conf);
  BufferedReader 

Re: java.lang.OutOfMemoryError: Java heap space

2010-07-10 Thread Alex Kozlov
Hi Shuja,

First, thank you for using CDH3.  Can you also check what m*
apred.child.ulimit* you are using?  Try adding *
-D mapred.child.ulimit=3145728* to the command line.

I would also recommend to upgrade java to JDK 1.6 update 8 at a minimum,
which you can download from the Java SE
Homepagehttp://java.sun.com/javase/downloads/index.jsp
.

Let me know how it goes.

Alex K

On Sat, Jul 10, 2010 at 12:59 PM, Shuja Rehman shujamug...@gmail.comwrote:

 Hi Alex

 Yeah, I am running a job on cluster of 2 machines and using Cloudera
 distribution of hadoop. and here is the output of this command.

 root  5277  5238  3 12:51 pts/200:00:00 /usr/jdk1.6.0_03/bin/java
 -Xmx1023m -Dhadoop.log.dir=/usr/lib /hadoop-0.20/logs
 -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20
 -Dhadoop.id.str= -Dhado op.root.logger=INFO,console
 -Dhadoop.policy.file=hadoop-policy.xml -classpath
 /usr/lib/hadoop-0.20/conf:/usr/

 jdk1.6.0_03/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2+320.jar:/usr/lib/hadoo

 p-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.3.jar:/usr/lib/hadoop-0.20/lib/common

 s-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1

 .0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.ja

 r:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2+320.jar:/usr/l

 ib/hadoop-0.20/lib/hadoop-scribe-log4j-0.20.2+320.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/h

 adoop-0.20/lib/hsqldb.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jackso

 n-mapper-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-ru

 ntime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.14.jar:/usr/lib

 /hadoop-0.20/lib/jetty-util-6.1.14.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.

 2.2.jar:/usr/lib/hadoop-0.20/lib/libfb303.jar:/usr/lib/hadoop-0.20/lib/libthrift.jar:/usr/lib/hadoop-0.20/lib

 /log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-jav

 a-5.0.8-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/u

 sr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0

 .20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api
 -2.1.jar org.apache.hadoop.util.RunJar
 /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2+320.jar
 -D mapred.child.java.opts=-Xmx2000M -inputformat StreamInputFormat
 -inputreader StreamXmlRecordReader,begin= mdc xmlns:HTML=
 http://www.w3.org/TR/REC-xml;,end=/mdc -input
 /user/root/RNCDATA/MDFDORKUCRAR02/A20100531
 .-0700-0015-0700_RNCCN-MDFDORKUCRAR02 -jobconf mapred.map.tasks=1
 -jobconf mapred.reduce.tasks=0 -output  RNC11 -mapper
 /home/ftpuser1/Nodemapper5.groovy -reducer
 org.apache.hadoop.mapred.lib.IdentityReducer -file /
 home/ftpuser1/Nodemapper5.groovy
 root  5360  5074  0 12:51 pts/100:00:00 grep Nodemapper5.groovy



 --
 and what is meant by OOM and thanks for helping,

 Best Regards


 On Sun, Jul 11, 2010 at 12:30 AM, Alex Kozlov ale...@cloudera.com wrote:

  Hi Shuja,
 
  It looks like the OOM is happening in your code.  Are you running
 MapReduce
  in a cluster?  If so, can you send the exact command line your code is
  invoked with -- you can get it with a 'ps -Af | grep Nodemapper5.groovy'
  command on one of the nodes which is running the task?
 
  Thanks,
 
  Alex K
 
  On Sat, Jul 10, 2010 at 10:40 AM, Shuja Rehman shujamug...@gmail.com
  wrote:
 
   Hi All
  
   I am facing a hard problem. I am running a map reduce job using
 streaming
   but it fails and it gives the following error.
  
   Caught: java.lang.OutOfMemoryError: Java heap space
  at Nodemapper5.parseXML(Nodemapper5.groovy:25)
  
   java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
   failed with code 1
  at
  
 
 org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
  at
  
 
 org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572)
  
  at
  org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
  at
   org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
  at
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
  
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
  at org.apache.hadoop.mapred.Child.main(Child.java:170)
  
  
   

Re: Next Release of Hadoop version number and Kerberos

2010-07-10 Thread Owen O'Malley
On Wed, Jul 7, 2010 at 8:54 AM, Todd Lipcon t...@cloudera.com wrote:
 On Wed, Jul 7, 2010 at 8:29 AM, Ananth Sarathy
 ananth.t.sara...@gmail.comwrote:

 The Security/Kerberos support is a huge project that has been in progress
 for several months, so the implementation spans tens (if not hundreds?) of
 patches. Manually adding these patches to a prior Apache release will take
 days if not weeks of work, is my guess.

Based on a quick check from Yahoo's github
(http://github.com/yahoo/hadoop-common):

Between yahoo 0.20.10 to yahoo 0.20.104.2:

421 commits
combined diff of 8.75 mb
12 person-years worth of work
consists almost exclusively of security work

For a single person, who doesn't know the code it will take months to
apply it to one of the Apache branches.

-- Owen