Re: Which branch for my patch?
Thanks, I'll get busy creating a new patch over the next few days. Niels Basjes On Wed, Nov 30, 2011 at 18:51, Eli Collins e...@cloudera.com wrote: Hey Niels, Thanks for contributing. The nest place to contribute new features is to trunk. It's currently an easy merge from trunk to branch 23 to get it in a 23.x release (you can set the jira's target version to 23.1 to indicate this). Your patch based on the old structure would be useful for backporting this feature from trunk to a release with the old structure (eg 1.x, 0.22). To request inclusion in a 1.x release set the target version to 1.1.0 (and generate a patch against branch-1). To request inclusion in 0.22 set target version to 0.22.0 (and generate a patch against branch-0.22). Thanks, Eli On Wed, Nov 30, 2011 at 8:23 AM, Niels Basjes ni...@basjes.nl wrote: Hi all, A while ago I created a feature for Hadoop and submitted this to be included (HADOOP-7076) . Around the same time the MRv2 started happening and the entire source tree was restructured. At this moment I'm prepared to change the patch I created earlier so I can submit it again for your consideration. Caused by the email about the new branches (branch-1 and branch-1.0) I'm a bit puzzled at this moment where to start. I see the mentioned branches and the trunk at probable starting points. As far as I understand the repository structure the branch-1 is the basis for the old style Hadoop and the trunk is the basis for the yarn Hadoop. For which branch of the source tree should I make my changes so you guys will reevaluate it for inclusion? Thanks. -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
[jira] [Created] (HADOOP-7876) Allow access to BlockKey/DelegationKey endoded key for RPC over protobuf
Allow access to BlockKey/DelegationKey endoded key for RPC over protobuf Key: HADOOP-7876 URL: https://issues.apache.org/jira/browse/HADOOP-7876 Project: Hadoop Common Issue Type: New Feature Components: ipc Reporter: Suresh Srinivas Assignee: Suresh Srinivas Fix For: 0.24.0 In order to support RPC over protobuf, the BlockKey needs to provide access to encoded key. The byte[] encoded key will be transported over protobuf as byte[], instead of SecretKey. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Hadoop-Common-0.23-Build #82
See https://builds.apache.org/job/Hadoop-Common-0.23-Build/82/changes Changes: [mahadev] MAPREDUCE-3452. fifoscheduler web ui page always shows 0% used for the queue. (Jonathan Eagles via mahadev) - Merging r1208999 from trunk [mahadev] MAPREDUCE-3463. Second AM fails to recover properly when first AM is killed with java.lang.IllegalArgumentException causing lost job. (Siddharth Seth via mahadev) - Merging r1208994 from trunk [jitendra] Merged r1208926 from trunk for HADOOP-7854. [mahadev] MAPREDUCE-3488. Streaming jobs are failing because the main class isnt set in the pom files. (mahadev) - Merging r1208796 from trunk [tucu] Merge -r 1208767:1208768 from trunk to branch. FIXES: MAPREDUCE-3477 [tucu] Merge -r 1208750:1208751 from trunk to branch. FIXES: HADOOP-7853 -- [...truncated 8043 lines...] [INFO] Installing https://builds.apache.org/job/Hadoop-Common-0.23-Build/ws/trunk/hadoop-dist/target/hadoop-dist-0.23.1-SNAPSHOT.jar to /home/jenkins/.m2/repository/org/apache/hadoop/hadoop-dist/0.23.1-SNAPSHOT/hadoop-dist-0.23.1-SNAPSHOT.jar [INFO] Installing https://builds.apache.org/job/Hadoop-Common-0.23-Build/ws/trunk/hadoop-dist/pom.xml to /home/jenkins/.m2/repository/org/apache/hadoop/hadoop-dist/0.23.1-SNAPSHOT/hadoop-dist-0.23.1-SNAPSHOT.pom [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Main SUCCESS [1.228s] [INFO] Apache Hadoop Project POM . SUCCESS [0.383s] [INFO] Apache Hadoop Annotations . SUCCESS [1.293s] [INFO] Apache Hadoop Project Dist POM SUCCESS [0.361s] [INFO] Apache Hadoop Assemblies .. SUCCESS [0.163s] [INFO] Apache Hadoop Auth SUCCESS [2.040s] [INFO] Apache Hadoop Auth Examples ... SUCCESS [1.007s] [INFO] Apache Hadoop Common .. SUCCESS [24.999s] [INFO] Apache Hadoop Common Project .. SUCCESS [0.028s] [INFO] Apache Hadoop HDFS SUCCESS [20.155s] [INFO] Apache Hadoop HDFS Project SUCCESS [0.030s] [INFO] hadoop-yarn ... SUCCESS [0.118s] [INFO] hadoop-yarn-api ... SUCCESS [6.767s] [INFO] hadoop-yarn-common SUCCESS [9.124s] [INFO] hadoop-yarn-server SUCCESS [0.065s] [INFO] hadoop-yarn-server-common . SUCCESS [2.889s] [INFO] hadoop-yarn-server-nodemanager SUCCESS [5.512s] [INFO] hadoop-yarn-server-web-proxy .. SUCCESS [2.508s] [INFO] hadoop-yarn-server-resourcemanager SUCCESS [6.705s] [INFO] hadoop-yarn-server-tests .. SUCCESS [0.929s] [INFO] hadoop-mapreduce-client ... SUCCESS [0.052s] [INFO] hadoop-mapreduce-client-core .. SUCCESS [10.496s] [INFO] hadoop-yarn-applications .. SUCCESS [0.059s] [INFO] hadoop-yarn-applications-distributedshell . SUCCESS [1.883s] [INFO] hadoop-yarn-site .. SUCCESS [0.102s] [INFO] hadoop-mapreduce-client-common SUCCESS [6.297s] [INFO] hadoop-mapreduce-client-shuffle ... SUCCESS [1.564s] [INFO] hadoop-mapreduce-client-app ... SUCCESS [5.824s] [INFO] hadoop-mapreduce-client-hs SUCCESS [2.345s] [INFO] hadoop-mapreduce-client-jobclient . SUCCESS [2.595s] [INFO] Apache Hadoop MapReduce Examples .. SUCCESS [2.838s] [INFO] hadoop-mapreduce .. SUCCESS [0.081s] [INFO] Apache Hadoop MapReduce Streaming . SUCCESS [3.162s] [INFO] Apache Hadoop Tools ... SUCCESS [0.050s] [INFO] Apache Hadoop Distribution SUCCESS [0.092s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 2:04.838s [INFO] Finished at: Thu Dec 01 09:05:11 UTC 2011 [INFO] Final Memory: 151M/914M [INFO] + cd hadoop-common-project + /home/jenkins/tools/maven/latest/bin/mvn clean verify checkstyle:checkstyle findbugs:findbugs -DskipTests -Pdist -Dtar -Psrc -Pnative -Pdocs [INFO] Scanning for projects... [INFO] [INFO] Reactor Build Order: [INFO] [INFO] Apache Hadoop Annotations [INFO] Apache Hadoop Auth [INFO] Apache Hadoop Auth Examples [INFO] Apache Hadoop Common [INFO] Apache Hadoop Common Project [INFO]
RE: Hadoop - non disk based sorting?
Hi Mingxi , So, why when map outputs are huge, reducer will not able to copy them? The Reducer will copy the Map output into its inmemory buffer. When the Reducer JVM doesnt have enough memory to accomodate the Map output, then it leads to OutOfMemoryException. Can you please kindly explain what's the function of mapred.child.java.opts? how does it relate to copy? The Maps and Reducers will be launched in separate child JVMs launched at the Tasktrackers. When the Tasktracker launches the Map or Reduce JVMs, it uses the mapred.child.java.opts as JVM arguments for the new child JVMs. Regards, Ravi Teja From: Mingxi Wu [mingxi...@turn.com] Sent: 01 December 2011 12:37:54 To: common-dev@hadoop.apache.org Subject: RE: Hadoop - non disk based sorting? Thanks Ravi. So, why when map outputs are huge, reducer will not able to copy them? Can you please kindly explain what's the function of mapred.child.java.opts? how does it relate to copy? Thank you, Mingxi -Original Message- From: Ravi teja ch n v [mailto:raviteja.c...@huawei.com] Sent: Tuesday, November 29, 2011 9:46 PM To: common-dev@hadoop.apache.org Subject: RE: Hadoop - non disk based sorting? Hi Mingxi, From your stacktrace, I understand that the OutOfMemoryError has actually occured while copying the MapOutputs, not while sorting them. Since your Mapoutputs are huge and your reducer does have enough heap memory, you got the problem. When you have made the reducers to 200, your Map outputs have got partitioned amoung 200 reducers, so you didnt get this problem. By setting the max memory of your reducer with mapred.child.java.opts, you can get over this problem. Regards, Ravi teja From: Mingxi Wu [mingxi...@turn.com] Sent: 30 November 2011 05:14:49 To: common-dev@hadoop.apache.org Subject: Hadoop - non disk based sorting? Hi, I have a question regarding the shuffle phase of reducer. It appears when there are large map output (in my case, 5 billion records), I will have out of memory Error like below. Error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1452) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) However, I thought the shuffling phase is using disk-based sort, which is not constraint by memory. So, why will user run into this outofmemory error? After I increased my number of reducers from 100 to 200, the problem went away. Any input regarding this memory issue would be appreciated! Thanks, Mingxi
Re: how to check which scheduler is currently running on hadoop
Is there any other way by which we can check that other scheduler is actually running and not the default one, rather than checking mapred-site.xml file On Thu, Dec 1, 2011 at 2:10 AM, Praveen Sripati praveensrip...@gmail.comwrote: Hi, Check the mapreduce.jobtracker.taskscheduler property in the mapred-site.xml, if it's not set then check what it defaults to. Praveen On Thu, Dec 1, 2011 at 5:07 AM, shivam tiwari shivam.tiwari2...@gmail.comwrote: Hi, please tell me how I can check which scheduler is currently running on hadoop -- Regards Shivam Tiwari Graduate student CISE Department University of Florida, Gainesville FL 32611 Email - shi...@cise.ufl.edu shivam.tiwari2...@gmail.com -- Regards Shivam Tiwari Graduate student CISE Department University of Florida, Gainesville FL 32611 Email - shi...@cise.ufl.edu shivam.tiwari2...@gmail.com
Re: how to check which scheduler is currently running on hadoop
Shivam, Visit JobTrackerHost/conf to see the taskScheduler config that's in effect on the runtime. Visit JobTrackerHost/scheduler to see a scheduler web UI if its put one up (default does not provide any, others may). May I ask why you're looking to confirm? Is something not working the way you expect it to? On 01-Dec-2011, at 5:33 PM, shivam tiwari wrote: Is there any other way by which we can check that other scheduler is actually running and not the default one, rather than checking mapred-site.xml file On Thu, Dec 1, 2011 at 2:10 AM, Praveen Sripati praveensrip...@gmail.comwrote: Hi, Check the mapreduce.jobtracker.taskscheduler property in the mapred-site.xml, if it's not set then check what it defaults to. Praveen On Thu, Dec 1, 2011 at 5:07 AM, shivam tiwari shivam.tiwari2...@gmail.comwrote: Hi, please tell me how I can check which scheduler is currently running on hadoop -- Regards Shivam Tiwari Graduate student CISE Department University of Florida, Gainesville FL 32611 Email - shi...@cise.ufl.edu shivam.tiwari2...@gmail.com -- Regards Shivam Tiwari Graduate student CISE Department University of Florida, Gainesville FL 32611 Email - shi...@cise.ufl.edu shivam.tiwari2...@gmail.com
Re: how to check which scheduler is currently running on hadoop
On 01/12/11 12:03, shivam tiwari wrote: Is there any other way by which we can check that other scheduler is actually running and not the default one, rather than checking mapred-site.xml file If you are really worried you could kill -QUIT the JT process and look in the stack traces.
Re: Hadoop - non disk based sorting?
I've seen this issue in jobs with many many map tasks and small reducer heaps. There is some heap space needed for the actual map completion events, etc, and that isn't accounted for in determining when to spill the fetch outputs to disk. Would be a nice patch to add code that calculates the in-memory size of these objects during the fetch phase and subtracts them from the heap size before multiplying out the spill percentages, etc. -Todd On Thu, Dec 1, 2011 at 8:14 AM, Robert Evans ev...@yahoo-inc.com wrote: Mingxi, My understanding was that just like with the maps that when a reducer's in memory buffer fills up it too will spill to disk as part of the sort. In fact I think it uses the exact same code for doing the sort as the map does. There may be an issue where your sort buffer is some how too large for the amount of heap that you requested as part of the mapred.child.java.opts. I have personally run a reduce that took in 300GB of data, which it successfully sorted, to test this very thing. And no the box did not have 300 GB of RAM. --Bobby Evans On 12/1/11 4:12 AM, Ravi teja ch n v raviteja.c...@huawei.com wrote: Hi Mingxi , So, why when map outputs are huge, reducer will not able to copy them? The Reducer will copy the Map output into its inmemory buffer. When the Reducer JVM doesnt have enough memory to accomodate the Map output, then it leads to OutOfMemoryException. Can you please kindly explain what's the function of mapred.child.java.opts? how does it relate to copy? The Maps and Reducers will be launched in separate child JVMs launched at the Tasktrackers. When the Tasktracker launches the Map or Reduce JVMs, it uses the mapred.child.java.opts as JVM arguments for the new child JVMs. Regards, Ravi Teja From: Mingxi Wu [mingxi...@turn.com] Sent: 01 December 2011 12:37:54 To: common-dev@hadoop.apache.org Subject: RE: Hadoop - non disk based sorting? Thanks Ravi. So, why when map outputs are huge, reducer will not able to copy them? Can you please kindly explain what's the function of mapred.child.java.opts? how does it relate to copy? Thank you, Mingxi -Original Message- From: Ravi teja ch n v [mailto:raviteja.c...@huawei.com] Sent: Tuesday, November 29, 2011 9:46 PM To: common-dev@hadoop.apache.org Subject: RE: Hadoop - non disk based sorting? Hi Mingxi, From your stacktrace, I understand that the OutOfMemoryError has actually occured while copying the MapOutputs, not while sorting them. Since your Mapoutputs are huge and your reducer does have enough heap memory, you got the problem. When you have made the reducers to 200, your Map outputs have got partitioned amoung 200 reducers, so you didnt get this problem. By setting the max memory of your reducer with mapred.child.java.opts, you can get over this problem. Regards, Ravi teja From: Mingxi Wu [mingxi...@turn.com] Sent: 30 November 2011 05:14:49 To: common-dev@hadoop.apache.org Subject: Hadoop - non disk based sorting? Hi, I have a question regarding the shuffle phase of reducer. It appears when there are large map output (in my case, 5 billion records), I will have out of memory Error like below. Error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1452) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) However, I thought the shuffling phase is using disk-based sort, which is not constraint by memory. So, why will user run into this outofmemory error? After I increased my number of reducers from 100 to 200, the problem went away. Any input regarding this memory issue would be appreciated! Thanks, Mingxi -- Todd Lipcon Software Engineer, Cloudera
[jira] [Resolved] (HADOOP-7877) Federation: update Balancer documentation
[ https://issues.apache.org/jira/browse/HADOOP-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo (Nicholas), SZE resolved HADOOP-7877. Resolution: Fixed Fix Version/s: 0.23.1 0.24.0 Hadoop Flags: Reviewed I have committed this. Federation: update Balancer documentation - Key: HADOOP-7877 URL: https://issues.apache.org/jira/browse/HADOOP-7877 Project: Hadoop Common Issue Type: Task Components: documentation Affects Versions: 0.23.0 Reporter: Tsz Wo (Nicholas), SZE Assignee: Tsz Wo (Nicholas), SZE Fix For: 0.24.0, 0.23.1 Attachments: h1685_20111201.patch, h1685_20111201b.patch, screenshot for the updated cli doc.jpg Update Balancer documentation for the new balancing policy and CLI. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: how to check which scheduler is currently running on hadoop
Depending on the version of Hadoop you are using, you can goto http://jthost:50030/scheduler to check. This will work from hadoop-0.20.203 onwards. Arun Sent from my iPhone On Nov 30, 2011, at 5:38 PM, shivam tiwari shivam.tiwari2...@gmail.com wrote: Hi, please tell me how I can check which scheduler is currently running on hadoop -- Regards Shivam Tiwari Graduate student CISE Department University of Florida, Gainesville FL 32611 Email - shi...@cise.ufl.edu shivam.tiwari2...@gmail.com
Re: Snow Leopard Compilation Help
Ron, Hadoop native currently does not compile in Mac OS X. There have been some JIRAs to fix that, but nobody took on them. Thanks. Alejandro On Thu, Dec 1, 2011 at 3:55 PM, Ronald Petty ronald.pe...@gmail.com wrote: Hello, I am new to Hadoop development and seem to be stuck on building with Snow Leopard. Here is what is going on: 1. svn checkout http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-trunk 2. wget http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz 3. tar ... proto...gz; cd proto... 4. ./configure --prefix=/hadoop/contribute/protobuf/;make;make install 5. export PATH=/hadoop/contribute/protobuf/bin/:$PATH 6. cd hadoop-trunk 7. mvn clean 8. mvn install -Dmaven.test.skip.exec=true 9. mvn assembly:assembly -Pnative 10. Error [INFO] --- make-maven-plugin:1.0-beta-1:make-install (compile) @ hadoop-common --- [INFO] /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I/Library/Java/Home/include -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/target/native/src -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/target/native/javah -I/usr/local/include -g -Wall -fPIC -O2 -m64 -g -O2 -MT ZlibCompressor.lo -MD -MP -MF .deps/ZlibCompressor.Tpo -c -o ZlibCompressor.lo `test -f 'src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c' || echo './'`src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c [INFO] libtool: compile: gcc -DHAVE_CONFIG_H -I. -I/Library/Java/Home/include -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/target/native/src -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/target/native/javah -I/usr/local/include -g -Wall -fPIC -O2 -m64 -g -O2 -MT ZlibCompressor.lo -MD -MP -MF .deps/ZlibCompressor.Tpo -c src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c -fno-common -DPIC -o .libs/ZlibCompressor.o [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c: In function 'Java_org_apache_hadoop_io_compress_zlib_ZlibCompressor_initIDs': [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c:71: error: 'libnotfound' undeclared (first use in this function) [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c:71: error: (Each undeclared identifier is reported only once [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c:71: error: for each function it appears in.) [INFO] make: *** [ZlibCompressor.lo] Error 1 [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Main FAILURE [46.914s] [INFO] Apache Hadoop Project POM . SKIPPED [INFO] Apache Hadoop Annotations . SKIPPED I looked around and found this http://wiki.apache.org/hadoop/UsingLzoCompression. I tried to mess with lzo via MacPort. Seems to be there, but I am not certain where to go from here. Also, how do you search the mail archives ( http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/)? Thanks for the help. Kindest regards. Ron
Re: Snow Leopard Compilation Help
Ronald, Please take a look at https://issues.apache.org/jira/browse/HADOOP-7147, and https://issues.apache.org/jira/browse/HADOOP-7824 - milind On 12/1/11 5:31 PM, Ronald Petty ronald.pe...@gmail.com wrote: Alejandro, I suppose I will give it a go since that is the computer I have. I tried searching on JIRA for issues mac related but its hard for me to tell which ones might be related or not. Should I just figure it out and email the list with my fix (if I find one?) Ron On Thu, Dec 1, 2011 at 7:23 PM, Alejandro Abdelnur t...@cloudera.comwrote: Ron, Hadoop native currently does not compile in Mac OS X. There have been some JIRAs to fix that, but nobody took on them. Thanks. Alejandro On Thu, Dec 1, 2011 at 3:55 PM, Ronald Petty ronald.pe...@gmail.com wrote: Hello, I am new to Hadoop development and seem to be stuck on building with Snow Leopard. Here is what is going on: 1. svn checkout http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-trunk 2. wget http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz 3. tar ... proto...gz; cd proto... 4. ./configure --prefix=/hadoop/contribute/protobuf/;make;make install 5. export PATH=/hadoop/contribute/protobuf/bin/:$PATH 6. cd hadoop-trunk 7. mvn clean 8. mvn install -Dmaven.test.skip.exec=true 9. mvn assembly:assembly -Pnative 10. Error [INFO] --- make-maven-plugin:1.0-beta-1:make-install (compile) @ hadoop-common --- [INFO] /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I/Library/Java/Home/include -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/tar get/native/src -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/tar get/native/javah -I/usr/local/include -g -Wall -fPIC -O2 -m64 -g -O2 -MT ZlibCompressor.lo -MD -MP -MF .deps/ZlibCompressor.Tpo -c -o ZlibCompressor.lo `test -f 'src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c' || echo './'`src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c [INFO] libtool: compile: gcc -DHAVE_CONFIG_H -I. -I/Library/Java/Home/include -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/tar get/native/src -I/hadoop/contribute/hadoop-trunk/hadoop-common-project/hadoop-common/tar get/native/javah -I/usr/local/include -g -Wall -fPIC -O2 -m64 -g -O2 -MT ZlibCompressor.lo -MD -MP -MF .deps/ZlibCompressor.Tpo -c src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c -fno-common -DPIC -o .libs/ZlibCompressor.o [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c: In function 'Java_org_apache_hadoop_io_compress_zlib_ZlibCompressor_initIDs': [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c:71: error: 'libnotfound' undeclared (first use in this function) [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c:71: error: (Each undeclared identifier is reported only once [INFO] src/org/apache/hadoop/io/compress/zlib/ZlibCompressor.c:71: error: for each function it appears in.) [INFO] make: *** [ZlibCompressor.lo] Error 1 [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Main FAILURE [46.914s] [INFO] Apache Hadoop Project POM . SKIPPED [INFO] Apache Hadoop Annotations . SKIPPED I looked around and found this http://wiki.apache.org/hadoop/UsingLzoCompression. I tried to mess with lzo via MacPort. Seems to be there, but I am not certain where to go from here. Also, how do you search the mail archives ( http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/)? Thanks for the help. Kindest regards. Ron
RE: Hadoop - non disk based sorting?
Hi Bobby, You are right that the Map outputs when copied will be spilled to the disk, but in case the the reducer cannot accomodate the copy inmemory. (shuffleInMemory and shuffleToDisk are chosen by rammanager based on inmemory size) But according to the stack trace provided by Mingxi, org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592) The problem has occured,after the inmemory copy was chosen, Regards, Ravi Teja From: Robert Evans [ev...@yahoo-inc.com] Sent: 01 December 2011 21:44:50 To: common-dev@hadoop.apache.org Subject: Re: Hadoop - non disk based sorting? Mingxi, My understanding was that just like with the maps that when a reducer's in memory buffer fills up it too will spill to disk as part of the sort. In fact I think it uses the exact same code for doing the sort as the map does. There may be an issue where your sort buffer is some how too large for the amount of heap that you requested as part of the mapred.child.java.opts. I have personally run a reduce that took in 300GB of data, which it successfully sorted, to test this very thing. And no the box did not have 300 GB of RAM. --Bobby Evans On 12/1/11 4:12 AM, Ravi teja ch n v raviteja.c...@huawei.com wrote: Hi Mingxi , So, why when map outputs are huge, reducer will not able to copy them? The Reducer will copy the Map output into its inmemory buffer. When the Reducer JVM doesnt have enough memory to accomodate the Map output, then it leads to OutOfMemoryException. Can you please kindly explain what's the function of mapred.child.java.opts? how does it relate to copy? The Maps and Reducers will be launched in separate child JVMs launched at the Tasktrackers. When the Tasktracker launches the Map or Reduce JVMs, it uses the mapred.child.java.opts as JVM arguments for the new child JVMs. Regards, Ravi Teja From: Mingxi Wu [mingxi...@turn.com] Sent: 01 December 2011 12:37:54 To: common-dev@hadoop.apache.org Subject: RE: Hadoop - non disk based sorting? Thanks Ravi. So, why when map outputs are huge, reducer will not able to copy them? Can you please kindly explain what's the function of mapred.child.java.opts? how does it relate to copy? Thank you, Mingxi -Original Message- From: Ravi teja ch n v [mailto:raviteja.c...@huawei.com] Sent: Tuesday, November 29, 2011 9:46 PM To: common-dev@hadoop.apache.org Subject: RE: Hadoop - non disk based sorting? Hi Mingxi, From your stacktrace, I understand that the OutOfMemoryError has actually occured while copying the MapOutputs, not while sorting them. Since your Mapoutputs are huge and your reducer does have enough heap memory, you got the problem. When you have made the reducers to 200, your Map outputs have got partitioned amoung 200 reducers, so you didnt get this problem. By setting the max memory of your reducer with mapred.child.java.opts, you can get over this problem. Regards, Ravi teja From: Mingxi Wu [mingxi...@turn.com] Sent: 30 November 2011 05:14:49 To: common-dev@hadoop.apache.org Subject: Hadoop - non disk based sorting? Hi, I have a question regarding the shuffle phase of reducer. It appears when there are large map output (in my case, 5 billion records), I will have out of memory Error like below. Error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1452) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) However, I thought the shuffling phase is using disk-based sort, which is not constraint by memory. So, why will user run into this outofmemory error? After I increased my number of reducers from 100 to 200, the problem went away. Any input regarding this memory issue would be appreciated! Thanks, Mingxi