Re: About Hadoop pseudo distribution

2009-12-02 Thread Doss_IPH

Hi 
Hadoop pseudo distribution runs five java processors which are given below 
  1. namenode, 
  2. secondarynamenode
  3. datanode
  4 jobtracker
  5. tasktracker

As you know namenode, secondarynamenode, datanode processors are for HDFS
and jobtracker ,tastktracker are for MR (Map Reduce).


kvorion wrote:
 
 Hi All,
 
 I have been trying to set up a hadoop cluster on a number of machines, a
 few of which are multicore machines. I have been wondering whether the
 hadoop pseudo distribution is something that can help me take advantage of
 the multiple cores on my machines. All the tutorials say that the pseudo
 distribution mode lets you start each daemon in a separate java process. I
 have the following configuration settings for hadoop-site.xml:
 
 property
   namefs.default.name/name
   valuehdfs://athena:9000/value
 /property
 
 property
   namemapred.job.tracker/name
   valueathena:9001/value
 /property
 
 property
   namedfs.replication/name
   value2/value
 /property
 
 I am not sure if this is really running in the pseudo-distribution mode.
 Are there any indicators or outputs that confirm what mode you are running
 in?
 
 
 

-- 
View this message in context: 
http://old.nabble.com/About-Hadoop-pseudo-distribution-tp26322382p26605201.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



ant test-patch does not work

2009-12-02 Thread zhuweimin
We creating patch for fuse-dfs support symbolic link.

We get an error When test the patch using the following command :
ant -Dpatch.file=../hadoop-hdfs-trunk/HDFS-468-ver1.patch
-Dforrest.home=/usr/local/apache-forrest-0.8
-Dfindbugs.home=/usr/local/findbugs-1.3.9 -Djava5.home=/usr/java/latest
test-patch
 
...
 [exec] /home/tsz/apache-ant-1.7.1/bin/ant -Dversion=PATCH-a.patch 
-Djavac.args=-Xlint -Xmaxwarns 1000  -DHadoopPatchProcess= clean tar  
/home/tsz/tmp/trunkJavacWarnings.txt 21
 [exec] Trunk compilation is broken?
...

the reason for the failure from the log file  ~/tmp/trunkJavacWarnings.txt*
...
[exec]
/usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v
06.rng:2053:29: error: datatype library
http://www.w3.org/2001/XMLSchema-datatypes; not recognized
 [exec]
/usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v
06.rng:2087:29: error: datatype library
http://www.w3.org/2001/XMLSchema-datatypes; not recognized
 [exec]
/usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v
06.rng:2097:30: error: datatype library
http://www.w3.org/2001/XMLSchema-datatypes; not recognized
 [exec]
/usr/local/apache-forrest-0.8/main/webapp/resources/schema/relaxng/sitemap-v
06.rng:2107:29: error: datatype library
http://www.w3.org/2001/XMLSchema-datatypes; not recognized
...

Can you give any advice?






Re: Web Interface Not Working

2009-12-02 Thread Steve Loughran

Mark Vigeant wrote:

Todd,

I followed your suggestion, shut down everything, restarted it, and the UI is 
still not there. Jps shows NN and JT working though.



Web UI is precompiled JSP on jetty; the rest of the system doesn't need 
it, and if the JSP JARs aren't on the classpath, Jetty won't behave.

 * make sure that you have only one version of Jetty on your classpath
 * make sure you only have one set of JSP JARs on the CP
 * make sure the jetty jars are all consistent (not mixing versions)
 * check that the various servlets are live (the TT and DNs have them). 
No servlets: jetty is down.


I think you can tell Jetty to log at more detail, worth doing if you are 
trying to track down problems.


0.20 ConcurrentModificationException

2009-12-02 Thread Arv Mistry
Hi,

I've recently upgraded hadoop to 0.20 and am seeing this concurrent mod
exception on startup which I never got in 0.19.
Is this a known bug in 0.20? I did see this JIRA report, not sure its
related http://issues.apache.org/jira/browse/HADOOP-6269  

Is there a workaround or should I be getting the FS a different way in
0.20? 

java.util.ConcurrentModificationException
at java.util.AbstractList$Itr.checkForComodification(Unknown Source)
at java.util.AbstractList$Itr.next(Unknown Source)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10
28)
at
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:435)
at
org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at
com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66)

Cheers Arv




Re: 0.20 ConcurrentModificationException

2009-12-02 Thread Todd Lipcon
Certainly looks like HADOOP-6269 to me.

Can you try Cloudera's distribution?This patch is included.

-Todd

On Wed, Dec 2, 2009 at 4:23 AM, Arv Mistry a...@kindsight.net wrote:

 Hi,

 I've recently upgraded hadoop to 0.20 and am seeing this concurrent mod
 exception on startup which I never got in 0.19.
 Is this a known bug in 0.20? I did see this JIRA report, not sure its
 related http://issues.apache.org/jira/browse/HADOOP-6269

 Is there a workaround or should I be getting the FS a different way in
 0.20?

 java.util.ConcurrentModificationException
at java.util.AbstractList$Itr.checkForComodification(Unknown Source)
at java.util.AbstractList$Itr.next(Unknown Source)
at
 org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10
 28)
at
 org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:435)
at
 org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at
 com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66)

 Cheers Arv





RE: 0.20 ConcurrentModificationException

2009-12-02 Thread Arv Mistry
Thanks Todd, I'm not sure what you mean ' Cloudera's  distribution?'
Is that a separate build for hadoop? If so please send me the link and I
will try it

Cheers Arv

-Original Message-
From: Todd Lipcon [mailto:t...@cloudera.com] 
Sent: December 2, 2009 11:39 AM
To: common-user@hadoop.apache.org
Subject: Re: 0.20 ConcurrentModificationException

Certainly looks like HADOOP-6269 to me.

Can you try Cloudera's distribution?This patch is included.

-Todd

On Wed, Dec 2, 2009 at 4:23 AM, Arv Mistry a...@kindsight.net wrote:

 Hi,

 I've recently upgraded hadoop to 0.20 and am seeing this concurrent
mod
 exception on startup which I never got in 0.19.
 Is this a known bug in 0.20? I did see this JIRA report, not sure its
 related http://issues.apache.org/jira/browse/HADOOP-6269

 Is there a workaround or should I be getting the FS a different way in
 0.20?

 java.util.ConcurrentModificationException
at java.util.AbstractList$Itr.checkForComodification(Unknown
Source)
at java.util.AbstractList$Itr.next(Unknown Source)
at

org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10
 28)
at
 org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:435)
at
 org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at
 com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66)

 Cheers Arv





Re: 0.20 ConcurrentModificationException

2009-12-02 Thread Todd Lipcon
On Wed, Dec 2, 2009 at 8:46 AM, Arv Mistry a...@kindsight.net wrote:

 Thanks Todd, I'm not sure what you mean ' Cloudera's  distribution?'
 Is that a separate build for hadoop? If so please send me the link and I
 will try it


Yes - like Redhat or Ubuntu provides distributions of Linux, we provide a
distro of Hadoop. You can get it from http://archive.cloudera.com/. If
you're used to the Apache tarballs, the CDH tarball should be a drop-in
replacement.

If you'd prefer to stick with Apache, you can manually apply the patch from
that JIRA and rebuild Hadoop.

If it turns out that the problem sticks around, please report back or file a
JIRA.

Thanks
-Todd



 Cheers Arv

 -Original Message-
 From: Todd Lipcon [mailto:t...@cloudera.com]
 Sent: December 2, 2009 11:39 AM
 To: common-user@hadoop.apache.org
 Subject: Re: 0.20 ConcurrentModificationException

 Certainly looks like HADOOP-6269 to me.

 Can you try Cloudera's distribution?This patch is included.

 -Todd

 On Wed, Dec 2, 2009 at 4:23 AM, Arv Mistry a...@kindsight.net wrote:

  Hi,
 
  I've recently upgraded hadoop to 0.20 and am seeing this concurrent
 mod
  exception on startup which I never got in 0.19.
  Is this a known bug in 0.20? I did see this JIRA report, not sure its
  related http://issues.apache.org/jira/browse/HADOOP-6269
 
  Is there a workaround or should I be getting the FS a different way in
  0.20?
 
  java.util.ConcurrentModificationException
 at java.util.AbstractList$Itr.checkForComodification(Unknown
 Source)
 at java.util.AbstractList$Itr.next(Unknown Source)
 at
 
 org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:10
  28)
 at
  org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
 at org.apache.hadoop.conf.Configuration.get(Configuration.java:435)
 at
  org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
 at
  com.rialto.hadoop.HadoopFileWriter.init(HadoopFileWriter.java:66)
 
  Cheers Arv
 
 
 



hadoop idle time on terasort

2009-12-02 Thread Vasilis Liaskovitis
Hi,

I am using hadoop-0.20.1 to run terasort and randsort benchmarking
tests on a small 8-node linux cluster. Most runs consist of usually
low (50%) core utilizations in the map and reduce phase, as well as
heavy I/O phases . There is usually a large fraction of runtime for
which cores are idling and i/o disk traffic is not heavy.

On average for the duration of a terasort run I get 20-30% cpu
utilization, 10-30% iowait times and the rest 40-70% is idle time.
This is data collected with mpstat for the duration of the run across
the cores of a specific node. This utilization behaviour is true and
symmetric for all tasktracker/data nodes (The namenode cores and I/O
are mostly idle, so there doesn’t seem to be a bottleneck in the
namenode).

I am looking for an explanation for the significant idle-time in the
runs. Could it have something to do with misconfigured network/RPC
latency hadoop paremeters? For example, I have tried to increase
mapred.heartbeats.in.second to 1000 from 100 but that didn’t help. The
network bandwidth (1Gige card on each node) is not saturated during
the runs, according to my netstat results.

Have other people noticed significant cpu idle times that can’t be
explained by I/O traffic?

Is it reasonable to always expect decreasing idle times as the
terasort dataset scales on the same cluster? I ‘ve only tried 2 small
datasets of 40GB and 64GB each, but core utilizations didn’t increase
with the runs done so far.

Yahoo’s paper on terasort (http://sortbenchmark.org/Yahoo2009.pdf)
mentions several performance optimizations, some of which seem
relevant to idle times. I am wondering which, if any, of the yahoo
patches are part of the hadoop-0.20.1 distribution.

Would it be a good idea to try a development version of hadoop to
resolve this issue?

thanks,

- Vasilis


Re: RE: Using Hadoop in non-typical large scale user-driven environment

2009-12-02 Thread Ed Kohlwey
As far as replication goes, you should look at a project called pastry.
Apparently some people have used hadoop mapreduce on top of it. You will
need to be clever, however, in how you do your mapreduce because you
probably won't want the job to eat all the users cpu time.

On Dec 2, 2009 5:11 PM, Habermaas, William william.haberm...@fatwire.com
wrote:

Hadoop isn't going to like losing its datanodes when people shutdown their
computers.
More importantly, when the datanodes are running, your users will be
impacted by data replication. Unlike Seti, Hadoop doesn't know when the
user's screensaver is running so it will start doing things when it feels
like it.

Can someone else comment on whether HOD (hadoop-on-demand) would fit this
scenario?
Bill

-Original Message- From: Maciej Trebacz [mailto:
maciej.treb...@gmail.com] Sent: Wednesday,...


Re: hadoop idle time on terasort

2009-12-02 Thread Todd Lipcon
Hi Vasilis,

This is seen reasonably often, and could be partly due to missed
configuration changes. A few things to check:

- Did you increase the number of tasks per node from the default? If you
have a reasonable number of disks/cores, you're going to want to run a lot
more than 2 map and 2 reduce tasks on each node.

- Have you tuned any other settings? If you google around you can find some
guides for configuration tuning that should help squeeze some performance
out of your cluster.

There are several patches that aren't in 0.20.1 but will be in 0.21 that
help performance. These aren't eligible for backport into 0.20 since point
releases are for bug fixes only. Some are eligible for backporting into
Cloudera's distro (or Yahoo's) and may show up in our next release (CDH3)
which should be available first in January for those who like to live on the
edge.

Thanks,
-Todd

On Wed, Dec 2, 2009 at 12:22 PM, Vasilis Liaskovitis vlias...@gmail.comwrote:

 Hi,

 I am using hadoop-0.20.1 to run terasort and randsort benchmarking
 tests on a small 8-node linux cluster. Most runs consist of usually
 low (50%) core utilizations in the map and reduce phase, as well as
 heavy I/O phases . There is usually a large fraction of runtime for
 which cores are idling and i/o disk traffic is not heavy.

 On average for the duration of a terasort run I get 20-30% cpu
 utilization, 10-30% iowait times and the rest 40-70% is idle time.
 This is data collected with mpstat for the duration of the run across
 the cores of a specific node. This utilization behaviour is true and
 symmetric for all tasktracker/data nodes (The namenode cores and I/O
 are mostly idle, so there doesn’t seem to be a bottleneck in the
 namenode).

 I am looking for an explanation for the significant idle-time in the
 runs. Could it have something to do with misconfigured network/RPC
 latency hadoop paremeters? For example, I have tried to increase
 mapred.heartbeats.in.second to 1000 from 100 but that didn’t help. The
 network bandwidth (1Gige card on each node) is not saturated during
 the runs, according to my netstat results.

 Have other people noticed significant cpu idle times that can’t be
 explained by I/O traffic?

 Is it reasonable to always expect decreasing idle times as the
 terasort dataset scales on the same cluster? I ‘ve only tried 2 small
 datasets of 40GB and 64GB each, but core utilizations didn’t increase
 with the runs done so far.

 Yahoo’s paper on terasort (http://sortbenchmark.org/Yahoo2009.pdf)
 mentions several performance optimizations, some of which seem
 relevant to idle times. I am wondering which, if any, of the yahoo
 patches are part of the hadoop-0.20.1 distribution.

 Would it be a good idea to try a development version of hadoop to
 resolve this issue?

 thanks,

 - Vasilis



fair scheduler preemptions timeout difficulties

2009-12-02 Thread james warren
Greetings, Hadoop Fans:

I'm attempting to use the timeout feature of the Fair Scheduler (using
Cloudera's most recently released distribution 0.20.1+152-1), but without
success.  I'm using the following configs:

/etc/hadoop/conf/mapred-site.xml

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

configuration
  property
namemapred.job.tracker/name
valuehadoop-master:8021/value
  /property
  property
 namemapred.tasktracker.map.tasks.maximum/name
 value9/value
  /property
  property
 namemapred.tasktracker.reduce.tasks.maximum/name
 value3/value
  /property
  property
 namemapred.jobtracker.taskScheduler/name
 valueorg.apache.hadoop.mapred.FairScheduler/value
  /property
  property
 namemapred.fairscheduler.allocation.file/name
 value/etc/hadoop/conf/pools.xml/value
  /property
  property
 namemapred.fairscheduler.assignmultiple/name
 valuetrue/value
  /property
  property
 namemapred.fairscheduler.poolnameproperty/name
 valuepool.name/value
  /property
  property
 namepool.name/name
 valuedefault/value
  /property

/configuration

and /etc/hadoop/conf/pools.xml

?xml version=1.0?
allocations
  pool name=realtime
minMaps4/minMaps
minReduces1/minReduces
minSharePreemptionTimeout180/minSharePreemptionTimeout
weight2.0/weight
  /pool
  pool name=default
minMaps2/minMaps
minReduces2/minReduces
maxRunningJobs1/maxRunningJobs
  /pool
/allocations

but a job in the realtime pool fails to interrupt a job running in the
default queue (waited for  15 minutes).  Is there something wrong with my
configs?  Or is there anything in the logs that would be useful for
debugging?  (I've only found a successfully configured fairscheduler
comment in the jobtracker log upon starting up the daemon.)

Help would be extremely appreciated!

Thanks,
-James Warren


Re: hadoop idle time on terasort

2009-12-02 Thread Vasilis Liaskovitis
Hi Todd,

thanks for the reply.


 This is seen reasonably often, and could be partly due to missed
 configuration changes. A few things to check:

 - Did you increase the number of tasks per node from the default? If you
 have a reasonable number of disks/cores, you're going to want to run a lot
 more than 2 map and 2 reduce tasks on each node.

For all tests so far, I have increased
mapred.tasktracker.map.tasks.maximum,
mapred.tasktracker.reduce.tasks.maximum to the number of cores per
tasktracker/node. (12 cores per node).
I 've also set mapred.map.tasks and mapred.reduce.tasks to a prime
close to the number of nodes i.e. 8. (though the recommendation for
mapred.map.tasks is a prime several times greater than the number of
hosts).

 - Have you tuned any other settings? If you google around you can find some
 guides for configuration tuning that should help squeeze some performance
 out of your cluster.

I am reusing JVMs. I also enabled default codec compression (native
zlib I think) for intermediate map outputs. This decreased iowait
times for some datasets. But idle time is still significant even with
compression. I wonder if LZO compression would have better results -
less overall execution time and perhaps less idle time?

I also increased io.sort.mb (set to half the JVM heapsize) though I am
not sure how that affected performance yet. If other parameters could
be significant here, let me know. Would increasing the number of i/o
streams (io.sort.factor I think) help, with a not-so-beefy disk system
per node?

If you can recommend specific tutorial/guide/blog for performance
tuning, fell free to share.(though I suspect there may be so many out
there)

 There are several patches that aren't in 0.20.1 but will be in 0.21 that
 help performance. These aren't eligible for backport into 0.20 since point
 releases are for bug fixes only. Some are eligible for backporting into
 Cloudera's distro (or Yahoo's) and may show up in our next release (CDH3)
 which should be available first in January for those who like to live on the
 edge.

ok thanks. I 'll try to check out 0.21 or a cloudera distro at some
point. I wonder if there's a cetralized svn/git somewhere if I want to
build from source. Or do I need to somehow combine all subprojects
hadoop-common, hadoop-mapred and hadoop-hdfs?

thanks again,

- Vasilis

 Thanks,
 -Todd

 On Wed, Dec 2, 2009 at 12:22 PM, Vasilis Liaskovitis 
 vlias...@gmail.comwrote:

 Hi,

 I am using hadoop-0.20.1 to run terasort and randsort benchmarking
 tests on a small 8-node linux cluster. Most runs consist of usually
 low (50%) core utilizations in the map and reduce phase, as well as
 heavy I/O phases . There is usually a large fraction of runtime for
 which cores are idling and i/o disk traffic is not heavy.

 On average for the duration of a terasort run I get 20-30% cpu
 utilization, 10-30% iowait times and the rest 40-70% is idle time.
 This is data collected with mpstat for the duration of the run across
 the cores of a specific node. This utilization behaviour is true and
 symmetric for all tasktracker/data nodes (The namenode cores and I/O
 are mostly idle, so there doesn’t seem to be a bottleneck in the
 namenode).

 I am looking for an explanation for the significant idle-time in the
 runs. Could it have something to do with misconfigured network/RPC
 latency hadoop paremeters? For example, I have tried to increase
 mapred.heartbeats.in.second to 1000 from 100 but that didn’t help. The
 network bandwidth (1Gige card on each node) is not saturated during
 the runs, according to my netstat results.

 Have other people noticed significant cpu idle times that can’t be
 explained by I/O traffic?

 Is it reasonable to always expect decreasing idle times as the
 terasort dataset scales on the same cluster? I ‘ve only tried 2 small
 datasets of 40GB and 64GB each, but core utilizations didn’t increase
 with the runs done so far.

 Yahoo’s paper on terasort (http://sortbenchmark.org/Yahoo2009.pdf)
 mentions several performance optimizations, some of which seem
 relevant to idle times. I am wondering which, if any, of the yahoo
 patches are part of the hadoop-0.20.1 distribution.

 Would it be a good idea to try a development version of hadoop to
 resolve this issue?

 thanks,

 - Vasilis




Re: fair scheduler preemptions timeout difficulties

2009-12-02 Thread james warren
Todd from Cloudera solved this for me on their company's forum.

What you're missing is the mapred.fairscheduler.preemption property in
mapred-site.xml - without this on, the preemption settings in the
allocations file are ignored... to turn it on, set that property's value to
'true'

Thanks, Todd!

On Wed, Dec 2, 2009 at 4:26 PM, james warren ja...@rockyou.com wrote:

 Greetings, Hadoop Fans:

 I'm attempting to use the timeout feature of the Fair Scheduler (using
 Cloudera's most recently released distribution 0.20.1+152-1), but without
 success.  I'm using the following configs:

 /etc/hadoop/conf/mapred-site.xml

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 configuration
   property
 namemapred.job.tracker/name
 valuehadoop-master:8021/value
   /property
   property
  namemapred.tasktracker.map.tasks.maximum/name
  value9/value
   /property
   property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value3/value
   /property
   property
  namemapred.jobtracker.taskScheduler/name
  valueorg.apache.hadoop.mapred.FairScheduler/value
   /property
   property
  namemapred.fairscheduler.allocation.file/name
  value/etc/hadoop/conf/pools.xml/value
   /property
   property
  namemapred.fairscheduler.assignmultiple/name
  valuetrue/value
   /property
   property
  namemapred.fairscheduler.poolnameproperty/name
  valuepool.name/value
   /property
   property
  namepool.name/name
  valuedefault/value
   /property

 /configuration

 and /etc/hadoop/conf/pools.xml

 ?xml version=1.0?
 allocations
   pool name=realtime
 minMaps4/minMaps
 minReduces1/minReduces
 minSharePreemptionTimeout180/minSharePreemptionTimeout
 weight2.0/weight
   /pool
   pool name=default
 minMaps2/minMaps
 minReduces2/minReduces
 maxRunningJobs1/maxRunningJobs
   /pool
 /allocations

 but a job in the realtime pool fails to interrupt a job running in the
 default queue (waited for  15 minutes).  Is there something wrong with my
 configs?  Or is there anything in the logs that would be useful for
 debugging?  (I've only found a successfully configured fairscheduler
 comment in the jobtracker log upon starting up the daemon.)

 Help would be extremely appreciated!

 Thanks,
 -James Warren




Re: fair scheduler preemptions timeout difficulties

2009-12-02 Thread Todd Lipcon
No problem :) Also worth noting for anyone listening on that this feature is
not in 0.20.1 - it's been backported into CDH. It will arrive in 0.21.

Thanks
-Todd

On Wed, Dec 2, 2009 at 4:55 PM, james warren ja...@rockyou.com wrote:

 Todd from Cloudera solved this for me on their company's forum.

 What you're missing is the mapred.fairscheduler.preemption property in
 mapred-site.xml - without this on, the preemption settings in the
 allocations file are ignored... to turn it on, set that property's value to
 'true'

 Thanks, Todd!

 On Wed, Dec 2, 2009 at 4:26 PM, james warren ja...@rockyou.com wrote:

  Greetings, Hadoop Fans:
 
  I'm attempting to use the timeout feature of the Fair Scheduler (using
  Cloudera's most recently released distribution 0.20.1+152-1), but without
  success.  I'm using the following configs:
 
  /etc/hadoop/conf/mapred-site.xml
 
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
  configuration
property
  namemapred.job.tracker/name
  valuehadoop-master:8021/value
/property
property
   namemapred.tasktracker.map.tasks.maximum/name
   value9/value
/property
property
   namemapred.tasktracker.reduce.tasks.maximum/name
   value3/value
/property
property
   namemapred.jobtracker.taskScheduler/name
   valueorg.apache.hadoop.mapred.FairScheduler/value
/property
property
   namemapred.fairscheduler.allocation.file/name
   value/etc/hadoop/conf/pools.xml/value
/property
property
   namemapred.fairscheduler.assignmultiple/name
   valuetrue/value
/property
property
   namemapred.fairscheduler.poolnameproperty/name
   valuepool.name/value
/property
property
   namepool.name/name
   valuedefault/value
/property
 
  /configuration
 
  and /etc/hadoop/conf/pools.xml
 
  ?xml version=1.0?
  allocations
pool name=realtime
  minMaps4/minMaps
  minReduces1/minReduces
  minSharePreemptionTimeout180/minSharePreemptionTimeout
  weight2.0/weight
/pool
pool name=default
  minMaps2/minMaps
  minReduces2/minReduces
  maxRunningJobs1/maxRunningJobs
/pool
  /allocations
 
  but a job in the realtime pool fails to interrupt a job running in the
  default queue (waited for  15 minutes).  Is there something wrong with
 my
  configs?  Or is there anything in the logs that would be useful for
  debugging?  (I've only found a successfully configured fairscheduler
  comment in the jobtracker log upon starting up the daemon.)
 
  Help would be extremely appreciated!
 
  Thanks,
  -James Warren
 
 



Hadoop XML parse error

2009-12-02 Thread sumap

When I try to retrieve hadoop properties, I get the following error:

java.lang.NoSuchMethodError:
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(Z)V
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1053)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1029)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:979)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:435)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)

I came across this post while searching and it works when I invoke my class
from the command line.
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c701549.92181...@web94705.mail.in2.yahoo.com%3e

But when I try to run my class from tomcat, I get the above error. I invoke
tomcat with the following system property as mentioned in the above post. I
suspect this error happens because tomcat runs in a separate jvm.

-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl

I also tried adding this system property override to hadoop java tasks using
the HADOOP_*_OPTS property. But it still does not work. Any ideas on how to
solve this issue?

Thanks,
-Suma



-- 
View this message in context: 
http://old.nabble.com/Hadoop-XML-parse-error-tp26619754p26619754.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Fair Scheduler config issues

2009-12-02 Thread Derek Brown
I'm using Cloudera's distribution of 0.20.1, but this seems like a general
question to I'm posting here.

I'm having some issues getting the Fair Scheduler setup. I followed the
basic instructions, from
http://hadoop.apache.org/common/docs/current/fair_scheduler.html:

* Added to mapred-site.xml:

  property
namemapred.jobtracker.taskScheduler/name
valueorg.apache.hadoop.mapred.FairScheduler/value
  /property

  property
namemapred.fairscheduler.allocation.file/name
value/etc/hadoop/conf/fairscheduler.xml/value
  /property

The fair scheduler jar was already in the installation's root lib/

* Added the basic fairscheduler.xml, based on the example in the docs.

  property
namemapred.fairscheduler.poolnameproperty/name
value${pool.name}/value
description.../description
  /property

  property
namepool.name/name
value${user.name}/value
description.../description
  /property

Running a job (say, one of the examples, such as the pi estimator, word
count, or sleep) and check myhost:50030/scheduler, I see the job listed in
the Pools table in the hadoop row, since that's the user. That makes
sense. In the Running Jobs table, the dropdown in the Pool column sometimes
shows hadoop and sometimes default when I reload the page, which is odd.

Then if I change the xml's pool.name entry's value to a hardcoded value, say
foo, with a matching foo pool entry in the xml, and run a job (and
restart the JobTracker to be safe), I do see a foo row in the Pools table,
but it shows 0 Running Jobs, and default shows the one job. Also, the Pool
listed in the dropdown in the Running Jobs table remains default, rather
than foo (although foo is a choice, and I CAN select it to change the
pool).

I'd expect that if I set the pool.name in fairscheduler.xml that jobs would
run, and appear, under that pool. Am I missing something in my setup or in
my understanding of how this should work? Thanks for any insight. What I'd
like to be able to do is set the pool name on the command line when running
a job, with an arg of -Dpool.name=bar.

Thanks,
Derek


Re: Fair Scheduler config issues

2009-12-02 Thread Todd Lipcon
Hi Derek,

You should set poolnameproperty to pool.name, not ${pool.name}

That should fix your issues.

-Todd

On Wed, Dec 2, 2009 at 7:46 PM, Derek Brown de...@media6degrees.com wrote:

 I'm using Cloudera's distribution of 0.20.1, but this seems like a general
 question to I'm posting here.

 I'm having some issues getting the Fair Scheduler setup. I followed the
 basic instructions, from
 http://hadoop.apache.org/common/docs/current/fair_scheduler.html:

 * Added to mapred-site.xml:

  property
namemapred.jobtracker.taskScheduler/name
valueorg.apache.hadoop.mapred.FairScheduler/value
  /property

  property
namemapred.fairscheduler.allocation.file/name
value/etc/hadoop/conf/fairscheduler.xml/value
  /property

 The fair scheduler jar was already in the installation's root lib/

 * Added the basic fairscheduler.xml, based on the example in the docs.

  property
namemapred.fairscheduler.poolnameproperty/name
value${pool.name}/value
description.../description
  /property

  property
namepool.name/name
value${user.name}/value
description.../description
  /property

 Running a job (say, one of the examples, such as the pi estimator, word
 count, or sleep) and check myhost:50030/scheduler, I see the job listed in
 the Pools table in the hadoop row, since that's the user. That makes
 sense. In the Running Jobs table, the dropdown in the Pool column sometimes
 shows hadoop and sometimes default when I reload the page, which is
 odd.

 Then if I change the xml's pool.name entry's value to a hardcoded value,
 say
 foo, with a matching foo pool entry in the xml, and run a job (and
 restart the JobTracker to be safe), I do see a foo row in the Pools
 table,
 but it shows 0 Running Jobs, and default shows the one job. Also, the
 Pool
 listed in the dropdown in the Running Jobs table remains default, rather
 than foo (although foo is a choice, and I CAN select it to change the
 pool).

 I'd expect that if I set the pool.name in fairscheduler.xml that jobs
 would
 run, and appear, under that pool. Am I missing something in my setup or in
 my understanding of how this should work? Thanks for any insight. What I'd
 like to be able to do is set the pool name on the command line when running
 a job, with an arg of -Dpool.name=bar.

 Thanks,
 Derek



Hadoop with Multiple Inpus and Outputs

2009-12-02 Thread James R. Leek
I've been trying to figure out how to do a set difference in hadoop.  I 
would like to take 2 file, and remove the values they have in common 
between them.  Let's say I have two bags, 'students' and 'employees'.  I 
want to find which students are just students, and which employees are 
just employees.  So, an example:


Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)

If I were to join these, I would get the students who are also 
employees, or: (Dave).


However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)


I was able to do this in pig, but I think I should be able to do it in 
one MapReduce pass.  (With hadoop 20.1) I read from two files, and 
attached the file names as the values.  (Students and Employees, in this 
case.  My actually problem is on DNA, bacteria and viruses in this 
case.)  Then I output from the reducer if I only get one value for a 
given key.  However, I've had some real trouble figuring out 
MultipleOutput and the multiple inputs.  I've attached my code.  I'm 
getting this error, which is a total mystery to me: 

09/12/02 22:33:52 INFO mapred.JobClient: Task Id : 
attempt_200911301448_0019_m_00_2, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected 
org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
   at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807)
   at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504)
   at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

   at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)


Thanks,
Jim
package org.myorg;


import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
//import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.lib.MultipleOutputs;
import org.apache.hadoop.mapred.JobConf;

public class DNAUnique {
 

public static class DNAUniqueMapper 
	extends MapperObject, Text, Text, Text
		implements Configurable  {

	private Text word = new Text();
	private Text location = new Text();
	
	private Configuration conf;
	private int kmerSize = 5;

	public Configuration getConf() {
	return conf;
	}

	public void setConf(Configuration inConf) {
	conf = inConf;
	}

	public void configure(Configuration conf) {
	System.out.println(in configure);
	kmerSize = conf.getInt(kmerSize,6);
	}

	public void map(Object key, Text value,  OutputCollectorText, Text output, Reporter reporter) throws IOException, InterruptedException {
	configure(getConf());

	FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
	String fileName = fileSplit.getPath().getName();
	if(fileName.contains(bact)) {
		location.set(b);
	} else {
		location.set(v);
	}

	String line = value.toString();
	StringTokenizer itr = new StringTokenizer(line.toLowerCase());
	
	word.set(itr.nextToken());
	output.collect(word, location);
	
	}
}
 	
public static class DNAUniqueReducer 
	extends ReducerText,Text,Text,Text implements Configurable {
	private MultipleOutputs mos;
	private Configuration conf;

	public Configuration getConf() {
	return conf;
	}

	public void setConf(Configuration inConf) {
	conf = inConf;
	}
	public void configure(Configuration conf) {   
	JobConf jconf = (JobConf) conf;
	mos = new MultipleOutputs(jconf);
	}

	private Text space = new Text( ); //Just some crap

	public void reduce(Text key, IterableText values,
			   OutputCollector output, Reporter reporter) throws IOException, InterruptedException {
	configure(getConf());
	int count = 0;
	boolean isBact = false;
	boolean isVirus = false;
	for (Text val : values) {
		String location = val.toString();
		if(location.equals(b)) {
		isBact = true;
		} else if(location.equals(v)) {
		isVirus = true;
		}
		++count;
	}

	if(count == 1) {
		if(isBact) {