Re: About Hadoop pseudo distribution
Hi Hadoop pseudo distribution runs five java processors which are given below 1. namenode, 2. secondarynamenode 3. datanode 4 jobtracker 5. tasktracker As you know namenode, secondarynamenode, datanode processors are for HDFS and jobtracker ,tastktracker are for MR (Map Reduce). kvorion wrote: Hi All, I have been trying to set up a hadoop cluster on a number of machines, a few of which are multicore machines. I have been wondering whether the hadoop pseudo distribution is something that can help me take advantage of the multiple cores on my machines. All the tutorials say that the pseudo distribution mode lets you start each daemon in a separate java process. I have the following configuration settings for hadoop-site.xml: property namefs.default.name/name valuehdfs://athena:9000/value /property property namemapred.job.tracker/name valueathena:9001/value /property property namedfs.replication/name value2/value /property I am not sure if this is really running in the pseudo-distribution mode. Are there any indicators or outputs that confirm what mode you are running in? -- View this message in context: http://old.nabble.com/About-Hadoop-pseudo-distribution-tp26322382p26605201.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
ant test-patch does not work
We creating patch for fuse-dfs support symbolic link. We get an error When test the patch using the following command : ant -Dpatch.file=../hadoop-hdfs-trunk/HDFS-468-ver1.patch -Dforrest.home=/usr/local/apache-forrest-0.8 -Dfindbugs.home=/usr/local/findbugs-1.3.9 -Djava5.home=/usr/java/latest test-patch ... [exec] /home/tsz/apache-ant-1.7.1/bin/ant -Dversion=PATCH-a.patch -Djavac.args=-Xlint -Xmaxwarns 1000 -DHadoopPatchProcess= clean tar /home/tsz/tmp/trunkJavacWarnings.txt 21 [exec] Trunk compilation is broken? ... the reason for the failure from the log file ~/tmp/trunkJavacWarnings.txt* ... [exec] /usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v 06.rng:2053:29: error: datatype library http://www.w3.org/2001/XMLSchema-datatypes; not recognized [exec] /usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v 06.rng:2087:29: error: datatype library http://www.w3.org/2001/XMLSchema-datatypes; not recognized [exec] /usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v 06.rng:2097:30: error: datatype library http://www.w3.org/2001/XMLSchema-datatypes; not recognized [exec] /usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v 06.rng:2107:29: error: datatype library http://www.w3.org/2001/XMLSchema-datatypes; not recognized ... Can you give any advice?
Re: Web Interface Not Working
Mark Vigeant wrote: Todd, I followed your suggestion, shut down everything, restarted it, and the UI is still not there. Jps shows NN and JT working though. Web UI is precompiled JSP on jetty; the rest of the system doesn't need it, and if the JSP JARs aren't on the classpath, Jetty won't behave. * make sure that you have only one version of Jetty on your classpath * make sure you only have one set of JSP JARs on the CP * make sure the jetty jars are all consistent (not mixing versions) * check that the various servlets are live (the TT and DNs have them). No servlets: jetty is down. I think you can tell Jetty to log at more detail, worth doing if you are trying to track down problems.
0.20 ConcurrentModificationException
Hi, I've recently upgraded hadoop to 0.20 and am seeing this concurrent mod exception on startup which I never got in 0.19. Is this a known bug in 0.20? I did see this JIRA report, not sure its related http://issues.apache.org/jira/browse/HADOOP-6269 Is there a workaround or should I be getting the FS a different way in 0.20? java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(Unknown Source) at java.util.AbstractList$Itr.next(Unknown Source) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10 28) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979) at org.apache.hadoop.conf.Configuration.get(Configuration.java:435) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66) Cheers Arv
Re: 0.20 ConcurrentModificationException
Certainly looks like HADOOP-6269 to me. Can you try Cloudera's distribution?This patch is included. -Todd On Wed, Dec 2, 2009 at 4:23 AM, Arv Mistry a...@kindsight.net wrote: Hi, I've recently upgraded hadoop to 0.20 and am seeing this concurrent mod exception on startup which I never got in 0.19. Is this a known bug in 0.20? I did see this JIRA report, not sure its related http://issues.apache.org/jira/browse/HADOOP-6269 Is there a workaround or should I be getting the FS a different way in 0.20? java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(Unknown Source) at java.util.AbstractList$Itr.next(Unknown Source) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10 28) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979) at org.apache.hadoop.conf.Configuration.get(Configuration.java:435) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66) Cheers Arv
RE: 0.20 ConcurrentModificationException
Thanks Todd, I'm not sure what you mean ' Cloudera's distribution?' Is that a separate build for hadoop? If so please send me the link and I will try it Cheers Arv -Original Message- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: December 2, 2009 11:39 AM To: common-user@hadoop.apache.org Subject: Re: 0.20 ConcurrentModificationException Certainly looks like HADOOP-6269 to me. Can you try Cloudera's distribution?This patch is included. -Todd On Wed, Dec 2, 2009 at 4:23 AM, Arv Mistry a...@kindsight.net wrote: Hi, I've recently upgraded hadoop to 0.20 and am seeing this concurrent mod exception on startup which I never got in 0.19. Is this a known bug in 0.20? I did see this JIRA report, not sure its related http://issues.apache.org/jira/browse/HADOOP-6269 Is there a workaround or should I be getting the FS a different way in 0.20? java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(Unknown Source) at java.util.AbstractList$Itr.next(Unknown Source) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10 28) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979) at org.apache.hadoop.conf.Configuration.get(Configuration.java:435) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66) Cheers Arv
Re: 0.20 ConcurrentModificationException
On Wed, Dec 2, 2009 at 8:46 AM, Arv Mistry a...@kindsight.net wrote: Thanks Todd, I'm not sure what you mean ' Cloudera's distribution?' Is that a separate build for hadoop? If so please send me the link and I will try it Yes - like Redhat or Ubuntu provides distributions of Linux, we provide a distro of Hadoop. You can get it from http://archive.cloudera.com/. If you're used to the Apache tarballs, the CDH tarball should be a drop-in replacement. If you'd prefer to stick with Apache, you can manually apply the patch from that JIRA and rebuild Hadoop. If it turns out that the problem sticks around, please report back or file a JIRA. Thanks -Todd Cheers Arv -Original Message- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: December 2, 2009 11:39 AM To: common-user@hadoop.apache.org Subject: Re: 0.20 ConcurrentModificationException Certainly looks like HADOOP-6269 to me. Can you try Cloudera's distribution?This patch is included. -Todd On Wed, Dec 2, 2009 at 4:23 AM, Arv Mistry a...@kindsight.net wrote: Hi, I've recently upgraded hadoop to 0.20 and am seeing this concurrent mod exception on startup which I never got in 0.19. Is this a known bug in 0.20? I did see this JIRA report, not sure its related http://issues.apache.org/jira/browse/HADOOP-6269 Is there a workaround or should I be getting the FS a different way in 0.20? java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(Unknown Source) at java.util.AbstractList$Itr.next(Unknown Source) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10 28) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979) at org.apache.hadoop.conf.Configuration.get(Configuration.java:435) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66) Cheers Arv
hadoop idle time on terasort
Hi, I am using hadoop-0.20.1 to run terasort and randsort benchmarking tests on a small 8-node linux cluster. Most runs consist of usually low (50%) core utilizations in the map and reduce phase, as well as heavy I/O phases . There is usually a large fraction of runtime for which cores are idling and i/o disk traffic is not heavy. On average for the duration of a terasort run I get 20-30% cpu utilization, 10-30% iowait times and the rest 40-70% is idle time. This is data collected with mpstat for the duration of the run across the cores of a specific node. This utilization behaviour is true and symmetric for all tasktracker/data nodes (The namenode cores and I/O are mostly idle, so there doesn’t seem to be a bottleneck in the namenode). I am looking for an explanation for the significant idle-time in the runs. Could it have something to do with misconfigured network/RPC latency hadoop paremeters? For example, I have tried to increase mapred.heartbeats.in.second to 1000 from 100 but that didn’t help. The network bandwidth (1Gige card on each node) is not saturated during the runs, according to my netstat results. Have other people noticed significant cpu idle times that can’t be explained by I/O traffic? Is it reasonable to always expect decreasing idle times as the terasort dataset scales on the same cluster? I ‘ve only tried 2 small datasets of 40GB and 64GB each, but core utilizations didn’t increase with the runs done so far. Yahoo’s paper on terasort (http://sortbenchmark.org/Yahoo2009.pdf) mentions several performance optimizations, some of which seem relevant to idle times. I am wondering which, if any, of the yahoo patches are part of the hadoop-0.20.1 distribution. Would it be a good idea to try a development version of hadoop to resolve this issue? thanks, - Vasilis
Re: RE: Using Hadoop in non-typical large scale user-driven environment
As far as replication goes, you should look at a project called pastry. Apparently some people have used hadoop mapreduce on top of it. You will need to be clever, however, in how you do your mapreduce because you probably won't want the job to eat all the users cpu time. On Dec 2, 2009 5:11 PM, Habermaas, William william.haberm...@fatwire.com wrote: Hadoop isn't going to like losing its datanodes when people shutdown their computers. More importantly, when the datanodes are running, your users will be impacted by data replication. Unlike Seti, Hadoop doesn't know when the user's screensaver is running so it will start doing things when it feels like it. Can someone else comment on whether HOD (hadoop-on-demand) would fit this scenario? Bill -Original Message- From: Maciej Trebacz [mailto: maciej.treb...@gmail.com] Sent: Wednesday,...
Re: hadoop idle time on terasort
Hi Vasilis, This is seen reasonably often, and could be partly due to missed configuration changes. A few things to check: - Did you increase the number of tasks per node from the default? If you have a reasonable number of disks/cores, you're going to want to run a lot more than 2 map and 2 reduce tasks on each node. - Have you tuned any other settings? If you google around you can find some guides for configuration tuning that should help squeeze some performance out of your cluster. There are several patches that aren't in 0.20.1 but will be in 0.21 that help performance. These aren't eligible for backport into 0.20 since point releases are for bug fixes only. Some are eligible for backporting into Cloudera's distro (or Yahoo's) and may show up in our next release (CDH3) which should be available first in January for those who like to live on the edge. Thanks, -Todd On Wed, Dec 2, 2009 at 12:22 PM, Vasilis Liaskovitis vlias...@gmail.comwrote: Hi, I am using hadoop-0.20.1 to run terasort and randsort benchmarking tests on a small 8-node linux cluster. Most runs consist of usually low (50%) core utilizations in the map and reduce phase, as well as heavy I/O phases . There is usually a large fraction of runtime for which cores are idling and i/o disk traffic is not heavy. On average for the duration of a terasort run I get 20-30% cpu utilization, 10-30% iowait times and the rest 40-70% is idle time. This is data collected with mpstat for the duration of the run across the cores of a specific node. This utilization behaviour is true and symmetric for all tasktracker/data nodes (The namenode cores and I/O are mostly idle, so there doesn’t seem to be a bottleneck in the namenode). I am looking for an explanation for the significant idle-time in the runs. Could it have something to do with misconfigured network/RPC latency hadoop paremeters? For example, I have tried to increase mapred.heartbeats.in.second to 1000 from 100 but that didn’t help. The network bandwidth (1Gige card on each node) is not saturated during the runs, according to my netstat results. Have other people noticed significant cpu idle times that can’t be explained by I/O traffic? Is it reasonable to always expect decreasing idle times as the terasort dataset scales on the same cluster? I ‘ve only tried 2 small datasets of 40GB and 64GB each, but core utilizations didn’t increase with the runs done so far. Yahoo’s paper on terasort (http://sortbenchmark.org/Yahoo2009.pdf) mentions several performance optimizations, some of which seem relevant to idle times. I am wondering which, if any, of the yahoo patches are part of the hadoop-0.20.1 distribution. Would it be a good idea to try a development version of hadoop to resolve this issue? thanks, - Vasilis
fair scheduler preemptions timeout difficulties
Greetings, Hadoop Fans: I'm attempting to use the timeout feature of the Fair Scheduler (using Cloudera's most recently released distribution 0.20.1+152-1), but without success. I'm using the following configs: /etc/hadoop/conf/mapred-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namemapred.job.tracker/name valuehadoop-master:8021/value /property property namemapred.tasktracker.map.tasks.maximum/name value9/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value3/value /property property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/etc/hadoop/conf/pools.xml/value /property property namemapred.fairscheduler.assignmultiple/name valuetrue/value /property property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name valuedefault/value /property /configuration and /etc/hadoop/conf/pools.xml ?xml version=1.0? allocations pool name=realtime minMaps4/minMaps minReduces1/minReduces minSharePreemptionTimeout180/minSharePreemptionTimeout weight2.0/weight /pool pool name=default minMaps2/minMaps minReduces2/minReduces maxRunningJobs1/maxRunningJobs /pool /allocations but a job in the realtime pool fails to interrupt a job running in the default queue (waited for 15 minutes). Is there something wrong with my configs? Or is there anything in the logs that would be useful for debugging? (I've only found a successfully configured fairscheduler comment in the jobtracker log upon starting up the daemon.) Help would be extremely appreciated! Thanks, -James Warren
Re: hadoop idle time on terasort
Hi Todd, thanks for the reply. This is seen reasonably often, and could be partly due to missed configuration changes. A few things to check: - Did you increase the number of tasks per node from the default? If you have a reasonable number of disks/cores, you're going to want to run a lot more than 2 map and 2 reduce tasks on each node. For all tests so far, I have increased mapred.tasktracker.map.tasks.maximum, mapred.tasktracker.reduce.tasks.maximum to the number of cores per tasktracker/node. (12 cores per node). I 've also set mapred.map.tasks and mapred.reduce.tasks to a prime close to the number of nodes i.e. 8. (though the recommendation for mapred.map.tasks is a prime several times greater than the number of hosts). - Have you tuned any other settings? If you google around you can find some guides for configuration tuning that should help squeeze some performance out of your cluster. I am reusing JVMs. I also enabled default codec compression (native zlib I think) for intermediate map outputs. This decreased iowait times for some datasets. But idle time is still significant even with compression. I wonder if LZO compression would have better results - less overall execution time and perhaps less idle time? I also increased io.sort.mb (set to half the JVM heapsize) though I am not sure how that affected performance yet. If other parameters could be significant here, let me know. Would increasing the number of i/o streams (io.sort.factor I think) help, with a not-so-beefy disk system per node? If you can recommend specific tutorial/guide/blog for performance tuning, fell free to share.(though I suspect there may be so many out there) There are several patches that aren't in 0.20.1 but will be in 0.21 that help performance. These aren't eligible for backport into 0.20 since point releases are for bug fixes only. Some are eligible for backporting into Cloudera's distro (or Yahoo's) and may show up in our next release (CDH3) which should be available first in January for those who like to live on the edge. ok thanks. I 'll try to check out 0.21 or a cloudera distro at some point. I wonder if there's a cetralized svn/git somewhere if I want to build from source. Or do I need to somehow combine all subprojects hadoop-common, hadoop-mapred and hadoop-hdfs? thanks again, - Vasilis Thanks, -Todd On Wed, Dec 2, 2009 at 12:22 PM, Vasilis Liaskovitis vlias...@gmail.comwrote: Hi, I am using hadoop-0.20.1 to run terasort and randsort benchmarking tests on a small 8-node linux cluster. Most runs consist of usually low (50%) core utilizations in the map and reduce phase, as well as heavy I/O phases . There is usually a large fraction of runtime for which cores are idling and i/o disk traffic is not heavy. On average for the duration of a terasort run I get 20-30% cpu utilization, 10-30% iowait times and the rest 40-70% is idle time. This is data collected with mpstat for the duration of the run across the cores of a specific node. This utilization behaviour is true and symmetric for all tasktracker/data nodes (The namenode cores and I/O are mostly idle, so there doesn’t seem to be a bottleneck in the namenode). I am looking for an explanation for the significant idle-time in the runs. Could it have something to do with misconfigured network/RPC latency hadoop paremeters? For example, I have tried to increase mapred.heartbeats.in.second to 1000 from 100 but that didn’t help. The network bandwidth (1Gige card on each node) is not saturated during the runs, according to my netstat results. Have other people noticed significant cpu idle times that can’t be explained by I/O traffic? Is it reasonable to always expect decreasing idle times as the terasort dataset scales on the same cluster? I ‘ve only tried 2 small datasets of 40GB and 64GB each, but core utilizations didn’t increase with the runs done so far. Yahoo’s paper on terasort (http://sortbenchmark.org/Yahoo2009.pdf) mentions several performance optimizations, some of which seem relevant to idle times. I am wondering which, if any, of the yahoo patches are part of the hadoop-0.20.1 distribution. Would it be a good idea to try a development version of hadoop to resolve this issue? thanks, - Vasilis
Re: fair scheduler preemptions timeout difficulties
Todd from Cloudera solved this for me on their company's forum. What you're missing is the mapred.fairscheduler.preemption property in mapred-site.xml - without this on, the preemption settings in the allocations file are ignored... to turn it on, set that property's value to 'true' Thanks, Todd! On Wed, Dec 2, 2009 at 4:26 PM, james warren ja...@rockyou.com wrote: Greetings, Hadoop Fans: I'm attempting to use the timeout feature of the Fair Scheduler (using Cloudera's most recently released distribution 0.20.1+152-1), but without success. I'm using the following configs: /etc/hadoop/conf/mapred-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namemapred.job.tracker/name valuehadoop-master:8021/value /property property namemapred.tasktracker.map.tasks.maximum/name value9/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value3/value /property property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/etc/hadoop/conf/pools.xml/value /property property namemapred.fairscheduler.assignmultiple/name valuetrue/value /property property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name valuedefault/value /property /configuration and /etc/hadoop/conf/pools.xml ?xml version=1.0? allocations pool name=realtime minMaps4/minMaps minReduces1/minReduces minSharePreemptionTimeout180/minSharePreemptionTimeout weight2.0/weight /pool pool name=default minMaps2/minMaps minReduces2/minReduces maxRunningJobs1/maxRunningJobs /pool /allocations but a job in the realtime pool fails to interrupt a job running in the default queue (waited for 15 minutes). Is there something wrong with my configs? Or is there anything in the logs that would be useful for debugging? (I've only found a successfully configured fairscheduler comment in the jobtracker log upon starting up the daemon.) Help would be extremely appreciated! Thanks, -James Warren
Re: fair scheduler preemptions timeout difficulties
No problem :) Also worth noting for anyone listening on that this feature is not in 0.20.1 - it's been backported into CDH. It will arrive in 0.21. Thanks -Todd On Wed, Dec 2, 2009 at 4:55 PM, james warren ja...@rockyou.com wrote: Todd from Cloudera solved this for me on their company's forum. What you're missing is the mapred.fairscheduler.preemption property in mapred-site.xml - without this on, the preemption settings in the allocations file are ignored... to turn it on, set that property's value to 'true' Thanks, Todd! On Wed, Dec 2, 2009 at 4:26 PM, james warren ja...@rockyou.com wrote: Greetings, Hadoop Fans: I'm attempting to use the timeout feature of the Fair Scheduler (using Cloudera's most recently released distribution 0.20.1+152-1), but without success. I'm using the following configs: /etc/hadoop/conf/mapred-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namemapred.job.tracker/name valuehadoop-master:8021/value /property property namemapred.tasktracker.map.tasks.maximum/name value9/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value3/value /property property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/etc/hadoop/conf/pools.xml/value /property property namemapred.fairscheduler.assignmultiple/name valuetrue/value /property property namemapred.fairscheduler.poolnameproperty/name valuepool.name/value /property property namepool.name/name valuedefault/value /property /configuration and /etc/hadoop/conf/pools.xml ?xml version=1.0? allocations pool name=realtime minMaps4/minMaps minReduces1/minReduces minSharePreemptionTimeout180/minSharePreemptionTimeout weight2.0/weight /pool pool name=default minMaps2/minMaps minReduces2/minReduces maxRunningJobs1/maxRunningJobs /pool /allocations but a job in the realtime pool fails to interrupt a job running in the default queue (waited for 15 minutes). Is there something wrong with my configs? Or is there anything in the logs that would be useful for debugging? (I've only found a successfully configured fairscheduler comment in the jobtracker log upon starting up the daemon.) Help would be extremely appreciated! Thanks, -James Warren
Hadoop XML parse error
When I try to retrieve hadoop properties, I get the following error: java.lang.NoSuchMethodError: javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(Z)V at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1053) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1029) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979) at org.apache.hadoop.conf.Configuration.get(Configuration.java:435) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) I came across this post while searching and it works when I invoke my class from the command line. http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c701549.92181...@web94705.mail.in2.yahoo.com%3e But when I try to run my class from tomcat, I get the above error. I invoke tomcat with the following system property as mentioned in the above post. I suspect this error happens because tomcat runs in a separate jvm. -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl I also tried adding this system property override to hadoop java tasks using the HADOOP_*_OPTS property. But it still does not work. Any ideas on how to solve this issue? Thanks, -Suma -- View this message in context: http://old.nabble.com/Hadoop-XML-parse-error-tp26619754p26619754.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Fair Scheduler config issues
I'm using Cloudera's distribution of 0.20.1, but this seems like a general question to I'm posting here. I'm having some issues getting the Fair Scheduler setup. I followed the basic instructions, from http://hadoop.apache.org/common/docs/current/fair_scheduler.html: * Added to mapred-site.xml: property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/etc/hadoop/conf/fairscheduler.xml/value /property The fair scheduler jar was already in the installation's root lib/ * Added the basic fairscheduler.xml, based on the example in the docs. property namemapred.fairscheduler.poolnameproperty/name value${pool.name}/value description.../description /property property namepool.name/name value${user.name}/value description.../description /property Running a job (say, one of the examples, such as the pi estimator, word count, or sleep) and check myhost:50030/scheduler, I see the job listed in the Pools table in the hadoop row, since that's the user. That makes sense. In the Running Jobs table, the dropdown in the Pool column sometimes shows hadoop and sometimes default when I reload the page, which is odd. Then if I change the xml's pool.name entry's value to a hardcoded value, say foo, with a matching foo pool entry in the xml, and run a job (and restart the JobTracker to be safe), I do see a foo row in the Pools table, but it shows 0 Running Jobs, and default shows the one job. Also, the Pool listed in the dropdown in the Running Jobs table remains default, rather than foo (although foo is a choice, and I CAN select it to change the pool). I'd expect that if I set the pool.name in fairscheduler.xml that jobs would run, and appear, under that pool. Am I missing something in my setup or in my understanding of how this should work? Thanks for any insight. What I'd like to be able to do is set the pool name on the command line when running a job, with an arg of -Dpool.name=bar. Thanks, Derek
Re: Fair Scheduler config issues
Hi Derek, You should set poolnameproperty to pool.name, not ${pool.name} That should fix your issues. -Todd On Wed, Dec 2, 2009 at 7:46 PM, Derek Brown de...@media6degrees.com wrote: I'm using Cloudera's distribution of 0.20.1, but this seems like a general question to I'm posting here. I'm having some issues getting the Fair Scheduler setup. I followed the basic instructions, from http://hadoop.apache.org/common/docs/current/fair_scheduler.html: * Added to mapred-site.xml: property namemapred.jobtracker.taskScheduler/name valueorg.apache.hadoop.mapred.FairScheduler/value /property property namemapred.fairscheduler.allocation.file/name value/etc/hadoop/conf/fairscheduler.xml/value /property The fair scheduler jar was already in the installation's root lib/ * Added the basic fairscheduler.xml, based on the example in the docs. property namemapred.fairscheduler.poolnameproperty/name value${pool.name}/value description.../description /property property namepool.name/name value${user.name}/value description.../description /property Running a job (say, one of the examples, such as the pi estimator, word count, or sleep) and check myhost:50030/scheduler, I see the job listed in the Pools table in the hadoop row, since that's the user. That makes sense. In the Running Jobs table, the dropdown in the Pool column sometimes shows hadoop and sometimes default when I reload the page, which is odd. Then if I change the xml's pool.name entry's value to a hardcoded value, say foo, with a matching foo pool entry in the xml, and run a job (and restart the JobTracker to be safe), I do see a foo row in the Pools table, but it shows 0 Running Jobs, and default shows the one job. Also, the Pool listed in the dropdown in the Running Jobs table remains default, rather than foo (although foo is a choice, and I CAN select it to change the pool). I'd expect that if I set the pool.name in fairscheduler.xml that jobs would run, and appear, under that pool. Am I missing something in my setup or in my understanding of how this should work? Thanks for any insight. What I'd like to be able to do is set the pool name on the command line when running a job, with an arg of -Dpool.name=bar. Thanks, Derek
Hadoop with Multiple Inpus and Outputs
I've been trying to figure out how to do a set difference in hadoop. I would like to take 2 file, and remove the values they have in common between them. Let's say I have two bags, 'students' and 'employees'. I want to find which students are just students, and which employees are just employees. So, an example: Students: (Jane) (John) (Dave) Employees: (Dave) (Sue) (Anne) If I were to join these, I would get the students who are also employees, or: (Dave). However, what I want is the distinct values: Only_Student: (Jane) (John) Only_Employee: (Sue) (Anne) I was able to do this in pig, but I think I should be able to do it in one MapReduce pass. (With hadoop 20.1) I read from two files, and attached the file names as the values. (Students and Employees, in this case. My actually problem is on DNA, bacteria and viruses in this case.) Then I output from the reducer if I only get one value for a given key. However, I've had some real trouble figuring out MultipleOutput and the multiple inputs. I've attached my code. I'm getting this error, which is a total mystery to me: 09/12/02 22:33:52 INFO mapred.JobClient: Task Id : attempt_200911301448_0019_m_00_2, Status : FAILED java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Thanks, Jim package org.myorg; import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; //import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapred.TextOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.lib.MultipleOutputs; import org.apache.hadoop.mapred.JobConf; public class DNAUnique { public static class DNAUniqueMapper extends MapperObject, Text, Text, Text implements Configurable { private Text word = new Text(); private Text location = new Text(); private Configuration conf; private int kmerSize = 5; public Configuration getConf() { return conf; } public void setConf(Configuration inConf) { conf = inConf; } public void configure(Configuration conf) { System.out.println(in configure); kmerSize = conf.getInt(kmerSize,6); } public void map(Object key, Text value, OutputCollectorText, Text output, Reporter reporter) throws IOException, InterruptedException { configure(getConf()); FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); if(fileName.contains(bact)) { location.set(b); } else { location.set(v); } String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, location); } } public static class DNAUniqueReducer extends ReducerText,Text,Text,Text implements Configurable { private MultipleOutputs mos; private Configuration conf; public Configuration getConf() { return conf; } public void setConf(Configuration inConf) { conf = inConf; } public void configure(Configuration conf) { JobConf jconf = (JobConf) conf; mos = new MultipleOutputs(jconf); } private Text space = new Text( ); //Just some crap public void reduce(Text key, IterableText values, OutputCollector output, Reporter reporter) throws IOException, InterruptedException { configure(getConf()); int count = 0; boolean isBact = false; boolean isVirus = false; for (Text val : values) { String location = val.toString(); if(location.equals(b)) { isBact = true; } else if(location.equals(v)) { isVirus = true; } ++count; } if(count == 1) { if(isBact) {