Re: Container size

2012-01-17 Thread Arun C Murthy
The AM gets this information from RM via the return value for 
AMRMProtocol.registerApplicationMaster.

On Jan 17, 2012, at 11:24 PM, raghavendhra rahul wrote:

> Hi,
> 
> What is the minimum size of the container in hadoop yarn.
>capability.setmemory(xx);



Container size

2012-01-17 Thread raghavendhra rahul
Hi,

 What is the minimum size of the container in hadoop yarn.
capability.setmemory(xx);


Re: Using S3 instead of HDFS

2012-01-17 Thread Harsh J
Ah sorry about missing that. Settings would go in core-site.xml (hdfs-site.xml 
will no longer be relevant anymore, once you switch to using S3).

On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote:

> That wiki page mentiones hadoop-site.xml, but this is old, now you have
> core-site.xml and hdfs-site.xml, so which one do you put it in?
> 
> Thank you (and good night Central Time:)
> 
> mark
> 
> On Wed, Jan 18, 2012 at 12:52 AM, Harsh J  wrote:
> 
>> When using S3 you do not need to run any component of HDFS at all. It
>> is meant to be an alternate FS choice. You need to run only MR.
>> 
>> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
>> how to go about specifying your auth details to S3, either directly
>> via the fs.default.name URI or via the additional properties
>> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
>> for you?
>> 
>> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner 
>> wrote:
>>> Well, here is my error message
>>> 
>>> Starting Hadoop namenode daemon: starting namenode, logging to
>>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
>>> ERROR. Could not start Hadoop namenode daemon
>>> Starting Hadoop secondarynamenode daemon: starting secondarynamenode,
>>> logging to
>>> 
>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out
>>> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
>> URI
>>> for NameNode address (check fs.default.name): s3n://myname.testdata is
>> not
>>> of scheme 'hdfs'.
>>>   at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224)
>>>   at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209)
>>>   at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182)
>>>   at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150)
>>>   at
>>> 
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624)
>>> ERROR. Could not start Hadoop secondarynamenode daemon
>>> 
>>> And, if I don't need to start the NameNode, then where do I give the S3
>>> credentials?
>>> 
>>> Thank you,
>>> Mark
>>> 
>>> 
>>> On Wed, Jan 18, 2012 at 12:36 AM, Harsh J  wrote:
>>> 
 Hey Mark,
 
 What is the exact trouble you run into? What do the error messages
 indicate?
 
 This should be definitive enough I think:
 http://wiki.apache.org/hadoop/AmazonS3
 
 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner <
>> mark.kerz...@shmsoft.com>
 wrote:
> Hi,
> 
> whatever I do, I can't make it work, that is, I cannot use
> 
> s3://host
> 
> or s3n://host
> 
> as a replacement for HDFS while runnings EC2 cluster. I change the
 settings
> in the core-file.xml, in hdfs-site.xml, and start hadoop services,
>> and it
> fails with error messages.
> 
> Is there a place where this is clearly described?
> 
> Thank you so much.
> 
> Mark
 
 
 
 --
 Harsh J
 Customer Ops. Engineer, Cloudera
 
>> 
>> 
>> 
>> --
>> Harsh J
>> Customer Ops. Engineer, Cloudera
>> 

--
Harsh J
Customer Ops. Engineer, Cloudera



Re: Is it possible to set how many map slots to use on each job submission?

2012-01-17 Thread Harsh J
Edward,

You need to invest in configuring a non-FIFO scheduler. FairScheduler may be 
what you are looking for. Take a look at 
http://hadoop.apache.org/common/docs/current/fair_scheduler.html for the docs.

On 18-Jan-2012, at 12:27 PM, edward choi wrote:

> Hi,
> 
> I often run into situations like this:
> I am running a very heavy job(let's say job 1) on a hadoop cluster(which
> takes many hours). Then something comes up that needs to be done very
> quickly(let's say job 2).
> Job 2 only takes a couple of hours when executed on hadoop. But it will
> take a couple ten hours if run on a single machine.
> So I'd definitely want to use Hadoop for job 2. But since job 1 is already
> running on Hadoop and hogging all the map slots, I can't run job 2 on
> hadoop(it will only be queued).
> 
> So I was wondering:
> Is there a way to set a specific number of map slots(or the number of slave
> nodes) to use when submitting each job?
> I read that setNumMapTasks() is not an enforcing configuration.
> I would like to leave a couple of map slots free for occasions like above.
> 
> Ed

--
Harsh J
Customer Ops. Engineer, Cloudera



Re: Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner
That wiki page mentiones hadoop-site.xml, but this is old, now you have
core-site.xml and hdfs-site.xml, so which one do you put it in?

Thank you (and good night Central Time:)

mark

On Wed, Jan 18, 2012 at 12:52 AM, Harsh J  wrote:

> When using S3 you do not need to run any component of HDFS at all. It
> is meant to be an alternate FS choice. You need to run only MR.
>
> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
> how to go about specifying your auth details to S3, either directly
> via the fs.default.name URI or via the additional properties
> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
> for you?
>
> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner 
> wrote:
> > Well, here is my error message
> >
> > Starting Hadoop namenode daemon: starting namenode, logging to
> > /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
> > ERROR. Could not start Hadoop namenode daemon
> > Starting Hadoop secondarynamenode daemon: starting secondarynamenode,
> > logging to
> >
> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out
> > Exception in thread "main" java.lang.IllegalArgumentException: Invalid
> URI
> > for NameNode address (check fs.default.name): s3n://myname.testdata is
> not
> > of scheme 'hdfs'.
> >at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224)
> >at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209)
> >at
> >
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182)
> >at
> >
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150)
> >at
> >
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624)
> > ERROR. Could not start Hadoop secondarynamenode daemon
> >
> > And, if I don't need to start the NameNode, then where do I give the S3
> > credentials?
> >
> > Thank you,
> > Mark
> >
> >
> > On Wed, Jan 18, 2012 at 12:36 AM, Harsh J  wrote:
> >
> >> Hey Mark,
> >>
> >> What is the exact trouble you run into? What do the error messages
> >> indicate?
> >>
> >> This should be definitive enough I think:
> >> http://wiki.apache.org/hadoop/AmazonS3
> >>
> >> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner <
> mark.kerz...@shmsoft.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > whatever I do, I can't make it work, that is, I cannot use
> >> >
> >> > s3://host
> >> >
> >> > or s3n://host
> >> >
> >> > as a replacement for HDFS while runnings EC2 cluster. I change the
> >> settings
> >> > in the core-file.xml, in hdfs-site.xml, and start hadoop services,
> and it
> >> > fails with error messages.
> >> >
> >> > Is there a place where this is clearly described?
> >> >
> >> > Thank you so much.
> >> >
> >> > Mark
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> Customer Ops. Engineer, Cloudera
> >>
>
>
>
> --
> Harsh J
> Customer Ops. Engineer, Cloudera
>


Is it possible to set how many map slots to use on each job submission?

2012-01-17 Thread edward choi
Hi,

I often run into situations like this:
I am running a very heavy job(let's say job 1) on a hadoop cluster(which
takes many hours). Then something comes up that needs to be done very
quickly(let's say job 2).
Job 2 only takes a couple of hours when executed on hadoop. But it will
take a couple ten hours if run on a single machine.
So I'd definitely want to use Hadoop for job 2. But since job 1 is already
running on Hadoop and hogging all the map slots, I can't run job 2 on
hadoop(it will only be queued).

So I was wondering:
Is there a way to set a specific number of map slots(or the number of slave
nodes) to use when submitting each job?
I read that setNumMapTasks() is not an enforcing configuration.
I would like to leave a couple of map slots free for occasions like above.

Ed


Re: Using S3 instead of HDFS

2012-01-17 Thread Harsh J
When using S3 you do not need to run any component of HDFS at all. It
is meant to be an alternate FS choice. You need to run only MR.

The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
how to go about specifying your auth details to S3, either directly
via the fs.default.name URI or via the additional properties
fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
for you?

On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner  wrote:
> Well, here is my error message
>
> Starting Hadoop namenode daemon: starting namenode, logging to
> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
> ERROR. Could not start Hadoop namenode daemon
> Starting Hadoop secondarynamenode daemon: starting secondarynamenode,
> logging to
> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out
> Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI
> for NameNode address (check fs.default.name): s3n://myname.testdata is not
> of scheme 'hdfs'.
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209)
>        at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182)
>        at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150)
>        at
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624)
> ERROR. Could not start Hadoop secondarynamenode daemon
>
> And, if I don't need to start the NameNode, then where do I give the S3
> credentials?
>
> Thank you,
> Mark
>
>
> On Wed, Jan 18, 2012 at 12:36 AM, Harsh J  wrote:
>
>> Hey Mark,
>>
>> What is the exact trouble you run into? What do the error messages
>> indicate?
>>
>> This should be definitive enough I think:
>> http://wiki.apache.org/hadoop/AmazonS3
>>
>> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner 
>> wrote:
>> > Hi,
>> >
>> > whatever I do, I can't make it work, that is, I cannot use
>> >
>> > s3://host
>> >
>> > or s3n://host
>> >
>> > as a replacement for HDFS while runnings EC2 cluster. I change the
>> settings
>> > in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it
>> > fails with error messages.
>> >
>> > Is there a place where this is clearly described?
>> >
>> > Thank you so much.
>> >
>> > Mark
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer, Cloudera
>>



-- 
Harsh J
Customer Ops. Engineer, Cloudera


Re: Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner
Well, here is my error message

Starting Hadoop namenode daemon: starting namenode, logging to
/usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
ERROR. Could not start Hadoop namenode daemon
Starting Hadoop secondarynamenode daemon: starting secondarynamenode,
logging to
/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out
Exception in thread "main" java.lang.IllegalArgumentException: Invalid URI
for NameNode address (check fs.default.name): s3n://myname.testdata is not
of scheme 'hdfs'.
at
org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:150)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624)
ERROR. Could not start Hadoop secondarynamenode daemon

And, if I don't need to start the NameNode, then where do I give the S3
credentials?

Thank you,
Mark


On Wed, Jan 18, 2012 at 12:36 AM, Harsh J  wrote:

> Hey Mark,
>
> What is the exact trouble you run into? What do the error messages
> indicate?
>
> This should be definitive enough I think:
> http://wiki.apache.org/hadoop/AmazonS3
>
> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner 
> wrote:
> > Hi,
> >
> > whatever I do, I can't make it work, that is, I cannot use
> >
> > s3://host
> >
> > or s3n://host
> >
> > as a replacement for HDFS while runnings EC2 cluster. I change the
> settings
> > in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it
> > fails with error messages.
> >
> > Is there a place where this is clearly described?
> >
> > Thank you so much.
> >
> > Mark
>
>
>
> --
> Harsh J
> Customer Ops. Engineer, Cloudera
>


Re: Using S3 instead of HDFS

2012-01-17 Thread Harsh J
Hey Mark,

What is the exact trouble you run into? What do the error messages indicate?

This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3

On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner  wrote:
> Hi,
>
> whatever I do, I can't make it work, that is, I cannot use
>
> s3://host
>
> or s3n://host
>
> as a replacement for HDFS while runnings EC2 cluster. I change the settings
> in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it
> fails with error messages.
>
> Is there a place where this is clearly described?
>
> Thank you so much.
>
> Mark



-- 
Harsh J
Customer Ops. Engineer, Cloudera


Using S3 instead of HDFS

2012-01-17 Thread Mark Kerzner
Hi,

whatever I do, I can't make it work, that is, I cannot use

s3://host

or s3n://host

as a replacement for HDFS while runnings EC2 cluster. I change the settings
in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it
fails with error messages.

Is there a place where this is clearly described?

Thank you so much.

Mark


RE: How to find out whether a node is Overloaded from Cpu utilization ?

2012-01-17 Thread ArunKumar


Guys ! 

So can i say that if memory usage is more than say 90 % the node is
overloaded.
If so, what can be that threshold percent value or how can we find it ?



Arun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-find-out-whether-a-node-is-Overloaded-from-Cpu-utilization-tp3665289p3668167.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: race condition in hadoop 0.20.2 (cdh3u1)

2012-01-17 Thread Stan Rosenberg
On Tue, Jan 17, 2012 at 6:38 PM, Brock Noland  wrote:
> This class is invalid. A single thread will be executing your mapper
> or reducer but there will be multiple threads (background threads such
> as the SpillThread) creating MyKey instances which is exactly what you
> are seeing. This is by design.
>

Could you please refer me to where this design decision/assumption
is/was documented?  Imho, this assumption clashes with the overall
object re-use methodology. I would have at least considered
making 'readFields' and 'write' synchronized, even if it is to
indicate that there are multiple threads executing
serialization/de-serialization.  (As only a few threads are competing
in this case, the performance penalty would have been negligible.)

Thanks,

stan


Re: race condition in hadoop 0.20.2 (cdh3u1)

2012-01-17 Thread Brock Noland
Hi,

tl;dr DUMMY should not be static.

On Tue, Jan 17, 2012 at 3:21 PM, Stan Rosenberg
 wrote:
>
>
> class MyKey implements WritableComparable {
>  private String ip; // first part of the key
>  private final static Text DUMMY = new Text();
>  ...
>
>  public void write(DataOutput out) throws IOException {
>     // serialize the first part of the key
>     DUMMY.set(ip);
>     DUMMY.write(out);
>     ...
>  }
>
>  public void readFields(DataInput in) throws IOException {
>    // de-serialize the first part of the key
>    DUMMY.readFields(in); ip = DUMMY.toString();
>    
>  }
> }

This class is invalid. A single thread will be executing your mapper
or reducer but there will be multiple threads (background threads such
as the SpillThread) creating MyKey instances which is exactly what you
are seeing. This is by design.

Brock


Re: NameNode per-block memory usage?

2012-01-17 Thread M. C. Srivas
Konstantin's paper
http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf

mentions that on average a file consumes about 600 bytes of memory in the
name-node (1 file object + 2 block objects).

To quote from his paper (see page 9)

".. in order to store 100 million files (referencing 200 million blocks) a
name-node should have at least 60GB of RAM. This matches observations on
deployed clusters".



On Tue, Jan 17, 2012 at 7:08 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> How much memory/JVM heap does NameNode use for each block?
>
> I've tried locating this in the FAQ and on search-hadoop.com, but
> couldn't find a ton of concrete numbers, just these two:
>
> http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block?
> http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object?
>
> Thanks,
> Otis
>


Error Using Hadoop .20.2/Mahout.4 on Solr 3.4

2012-01-17 Thread Peyman Mohajerian
Hi Guys,

I'm running a Clojure code inside Solr 3.4 that makes call to Mahout
.4 for some text clustering job. Due to some issues with Clojure I had
to put all the jar files in the solr war file ('WEB-INF/lib'). I also
made sure to put hadoop core and mapreduce config xml files in the
same location with a value of ('file:/// or
hdfs://localhosthost:9000..) for 'fs.default.name'.
HOWEVER i get the following stack trace when running the code. I
notices a bug fix in Hadoop Yarn related to running Unit-Test with the
same trace. However I know for sure my code has worked for others
using Hadoop .20.2. Any ideas what could be wrong?

Thanks,
Peyman


SEVERE: java.lang.IllegalStateException: Variable substitution depth
too large: 20 ${fs.default.name}
   at 
org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:366)
   at org.apache.hadoop.conf.Configuration.get(Configuration.java:436)
   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
   at lsa4solr.mahout_matrix$distributed_matrix.invoke(mahout_matrix.clj:71)
   at 
lsa4solr.clustering_protocol$get_frequency_matrix.invoke(clustering_protocol.clj:92)
   at 
lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:105)
   at 
lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:123)
   at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:57)
   at lsa4solr.cluster$_cluster.invoke(cluster.clj:85)
   at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
   at 
org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
   at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
   at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
   at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
   at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
   at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
   at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
   at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
   at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
   at org.mortbay.jetty.Server.handle(Server.java:326)
   at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
   at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
   at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
   at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

Jan 16, 2012 11:42:17 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/lsa4solr
params={nclusters=2&k=10&q=Summary:.*&rows=100&algorithm=kmeans}
hits=0 status=500 QTime=37
Jan 16, 2012 11:42:17 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.IllegalStateException: Variable substitution depth
too large: 20 ${fs.default.name}
   at 
org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:366)
   at org.apache.hadoop.conf.Configuration.get(Configuration.java:436)
   at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
   at lsa4solr.mahout_matrix$distributed_matrix.invoke(mahout_matrix.clj:71)
   at 
lsa4solr.clustering_protocol$get_frequency_matrix.invoke(clustering_protocol.clj:92)
   at 
lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:105)
   at 
lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:123)
   at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:57)
   at lsa4solr.cluster$_cluster.invoke(cluster.clj:85)
   at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source)
   at 
org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
   at 
org.apache.solr.handler.co

race condition in hadoop 0.20.2 (cdh3u1)

2012-01-17 Thread Stan Rosenberg
Hi,

This posting is essentially about a bug, but it is also related to a
programmatic idiom endemic to hadoop.  Thus, I am posting to
'common-user' as opposed to 'common-dev'; if the latter is more
appropriate, please
let me know.  Also, I checked jira and was unable to find a bug match.

Synopsis
=

It appears that there is a race condition between the spill thread and
the main thread.  The race condition results in data corruption,
specifically ArrayIndexOutOfBoundsException in
org.apache.hadoop.mapred.MapTask
on line 1134, when running hadoop 0.20.2 (cdh3u1).   The race occurs
when the spill thread is executing 'readFields' on key(s) while
concurrently the main thread is executing 'write' in order to
serialize mapper's output.
Although this is cloudera's distribution, I believe this also affects
apache's distribution.

Details
==

Now let's delve into the details.  In order to avoid object
allocation, we are led to rely on the singleton pattern.  So, consider
this code snippet which illustrates a custom key implementation which
for the sake of performance re-uses an instance of Text:

class MyKey implements WritableComparable {
  private String ip; // first part of the key
  private final static Text DUMMY = new Text();
  ...

  public void write(DataOutput out) throws IOException {
 // serialize the first part of the key
 DUMMY.set(ip);
 DUMMY.write(out);
 ...
  }

  public void readFields(DataInput in) throws IOException {
// de-serialize the first part of the key
DUMMY.readFields(in); ip = DUMMY.toString();

  }
}

The problem with the above is that DUMMY is a shared variable which
shouldn't matter in single-threaded environments.  However, the spill
thread and the main thread are running concurrently, and as exhibited
by the stack traces below, the data race ensues.  Note that the actual
exception is java.lang.ArrayIndexOutOfBoundsException in
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1134).
In order to produce the stack traces, I caught the actual exception
inside 'write' of my custom key class and dumped the stack traces of
all running threads.


Thread[main,5,main]
java.lang.Thread.dumpThreads(Native Method)
java.lang.Thread.getAllStackTraces(Unknown Source)
com.proclivitysystems.etl.psguid.PSguidOrTid.write(PSguidOrTid.java:206)

org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)

org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:918)

org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)

org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

com.proclivitysystems.etl.psguid.WECPreprocessor.map(WECPreprocessor.java:142)

com.proclivitysystems.etl.psguid.WECPreprocessor.map(WECPreprocessor.java:25)
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
org.apache.hadoop.mapred.Child$4.run(Child.java:270)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Unknown Source)

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
org.apache.hadoop.mapred.Child.main(Child.java:264)

Thread[SpillThread,5,main]
java.lang.Thread.dumpThreads(Native Method)
java.lang.Thread.getAllStackTraces(Unknown Source)

com.proclivitysystems.etl.psguid.PSguidOrTid.readFields(PSguidOrTid.java:240)

org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:125)

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:968)
org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:101)
org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1254)

org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:712)

org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1199)


java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException
at 
com.proclivitysystems.etl.psguid.PSguidOrTid.write(PSguidOrTid.java:214)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:918)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574

org.apache.hadoop.mapred.Merger merge bug

2012-01-17 Thread Bai Shen
I think I've found a bug in the Merger code for Hadoop.

When the Map job runs, it creates spill files based on io.sort.mb.  It then
sorts io.sort.factor files at a time in order to create an output file
that's passed to the reduce job.  The higher these two settings are
configured, the more memory is used.

However, as far as I can tell, the memory used is the same no matter what
the io.sort parameters are set to.

For example, with io.sort.mb of 256, io.sort factor of 10 and 10 spill
files, we get the following scenario.  The merger merges all 10 of those
spill files into one output file, using roughly 2.5GB of memory.

If we change the io.sort.factor to 4, then the merger will merge 4 of the
10 spill files and output the result as a temp file on the hard drive.  It
then adds the resulting file back into the merge queue.  It repeats this
action with the next 4 spill files.

Now we have 2 spill files remaining and the 2 temp files which are each 4
spill files combined.  So on the third pass of the merger, we're back to
merging everything into one output file, using roughly 2.5GB of memory.

No matter what you set your io.sort.factor to, you will eventually end up
using the same amount of memory.  It's just that lower factors will take
longer due to the intermediate steps.  As such, if you only have 2GB of
memory available for the Map job, you will get an OutOfMemoryException
every time you attempt to run the job.

Can anyone confirm what I'm seeing or point out any flaws in my reasoning?

Thanks.


Re: effect on data after topology change

2012-01-17 Thread rk vishu
Thank you very much Todd. I hope futute versions of hadoop rebalcer will
include this check.

I have one more question.

If we are in the process of setting up additional nodes incrementally in
different rack (say rack-2) and rack-2 size is only 25% of rack-1, how
would data be balanced (with default implementation)?
i.e Will hadoop prefers balancing the overall nodes or will it try to obey
the topology first that could fillup rack-2 quickly?.  I am positive that
it will try to balance overall nodes but want to be sure.

Thanks and Regards
Ravi
On Tue, Jan 17, 2012 at 10:41 AM, Todd Lipcon  wrote:

> Hi Ravi,
>
> You'll probably need to up the replication level of the affected files
> and then drop it back down to the desired level. Current versions of
> HDFS do not automatically repair rack policy violations if they're
> introduced in this manner.
>
> -Todd
>
> On Mon, Jan 16, 2012 at 3:53 PM, rk vishu  wrote:
> > Hello All,
> >
> > If i change the rackid for some nodes and restart namenode, will data be
> > rearranged accordingly? Do i need to run rebalancer?
> >
> > Any information on this would be appreciated.
> >
> > Thanks and Regards
> > Ravi
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: effect on data after topology change

2012-01-17 Thread Todd Lipcon
Hi Ravi,

You'll probably need to up the replication level of the affected files
and then drop it back down to the desired level. Current versions of
HDFS do not automatically repair rack policy violations if they're
introduced in this manner.

-Todd

On Mon, Jan 16, 2012 at 3:53 PM, rk vishu  wrote:
> Hello All,
>
> If i change the rackid for some nodes and restart namenode, will data be
> rearranged accordingly? Do i need to run rebalancer?
>
> Any information on this would be appreciated.
>
> Thanks and Regards
> Ravi



-- 
Todd Lipcon
Software Engineer, Cloudera


RE: How to find out whether a node is Overloaded from Cpu utilization ?

2012-01-17 Thread Bill Brune

Hi,

The significant factor in cluster loading is memory, not CPU.  Hadoop views the 
cluster only with respect to memory and cares not about CPU utilization or Disk 
saturation.  If you run too many TaskTrackers, you risk memory overcommit where 
the Linux OOM will come out of the closet and randomly kill processes which 
will ultimately take out the box.  Disk saturation is another issue that will 
contribute to datanode timeouts, which lead to a downward spiral where the 
namenode will think the datanode is down and start 
replicating its blocks adding to the overall I/O loading contributing to other 
datanode timeouts.  If your datanode CPU's are pegged then this contributes as 
well, but not as much.
Generally, if you carve up your cluster properly, you won't have CPU over 
utilization issues, quite the opposite.

-Bill

-Original Message-
From: Amandeep Khurana [mailto:ama...@gmail.com]
Sent: Mon 1/16/2012 10:21 PM
To: common-user@hadoop.apache.org
Cc: hadoop-u...@lucene.apache.org
Subject: Re: How to find out whether a node is Overloaded from Cpu utilization ?
 
Arun,

I don't think you'll hear a fixed number. Having said that, I have seen CPU
being pegged at 95% during jobs and the cluster working perfectly fine. On
the slaves, if you have nothing else going on, Hadoop only has TaskTrackers
and DataNodes. Those two daemons are relatively light weight in terms of
CPU for the most part. So, you can afford to let your tasks take up a high
%.

Hope that helps.

-Amandeep

On Tue, Jan 17, 2012 at 2:16 PM, ArunKumar  wrote:

> Hi  Guys !
>
> When we get CPU utilization value of a node  in hadoop cluster, what
> percent
> value can be considered as overloaded ?
> Say for eg.
>
>CPU utilizationNode Status
> 85%  Overloaded
>  20%Normal
>
>
> Arun
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-find-out-whether-a-node-is-Overloaded-from-Cpu-utilization-tp3665289p3665289.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>



Re: NameNode per-block memory usage?

2012-01-17 Thread Edward Capriolo
On Tue, Jan 17, 2012 at 10:08 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> How much memory/JVM heap does NameNode use for each block?
>
> I've tried locating this in the FAQ and on search-hadoop.com, but
> couldn't find a ton of concrete numbers, just these two:
>
> http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block?
> http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object?
>
> Thanks,
> Otis
>

Some real world statistics. From NN web Interface. replication factor=2

Cluster Summary
22,061,605 files and directories, 22,151,870 blocks = 44,213,475 total.
Heap Size is 10.85 GB / 16.58 GB (65%)

compressedOOps is enabled.


Re: NameNode per-block memory usage?

2012-01-17 Thread Joey Echeverria
> How much memory/JVM heap does NameNode use for each block?

I don't remember the exact number, it also depends on which version of
Hadoop you're using

> http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object?

It's 1 GB for every *million* objects (files, blocks, etc.). This is a
good rule of thumb, at least for the 0.20.x/1.0.0 series.

Is there a reason you need a more exact estimate?

-Joey

-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


NameNode per-block memory usage?

2012-01-17 Thread Otis Gospodnetic
Hello,

How much memory/JVM heap does NameNode use for each block?

I've tried locating this in the FAQ and on search-hadoop.com, but couldn't find 
a ton of concrete numbers, just these two:

http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block?
http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object?

Thanks,
Otis


Re: hadoop filesystem cache

2012-01-17 Thread Rita
My intention isn't to make it a mandatory feature just as an option.
Keeping data locally on a filesystem as a method of Lx cache is far better
than getting it from the network and the cost of fs buffer cache is much
cheaper than a RPC call.

On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo wrote:

> The challenges of this design is people accessing the same data over and
> over again is the uncommon usecase for hadoop. Hadoop's bread and butter is
> all about streaming through large datasets that do not fit in memory. Also
> your shuffle-sort-spill is going to play havoc on and file system based
> cache. The distributed cache roughly fits this role except that it does not
> persist after a job.
>
> Replicating content to N nodes also is not a hard problem to tackle (you
> can hack up a content delivery system with ssh+rsync) and get similar
> results.The approach often taken has been to keep data that is accessed
> repeatedly and fits in memory in some other system
> (hbase/cassandra/mysql/whatever).
>
> Edward
>
>
> On Mon, Jan 16, 2012 at 11:33 AM, Rita  wrote:
>
> > Thanks. I believe this is a good feature to have for clients especially
> if
> > you are reading the same large file over and over.
> >
> >
> > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon  wrote:
> >
> > > There is some work being done in this area by some folks over at UC
> > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it
> > > has been published quite yet, but the title of the project is "PACMan"
> > > -- I expect it will be published soon.
> > >
> > > -Todd
> > >
> > > On Sat, Jan 14, 2012 at 5:30 PM, Rita  wrote:
> > > > After reading this article,
> > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I
> > was
> > > > wondering if there was a filesystem cache for hdfs. For example, if a
> > > large
> > > > file (10gigabytes) was keep getting accessed on the cluster instead
> of
> > > keep
> > > > getting it from the network why not storage the content of the file
> > > locally
> > > > on the client itself.  A use case on the client would be like this:
> > > >
> > > >
> > > >
> > > > 
> > > >  dfs.client.cachedirectory
> > > >  /var/cache/hdfs
> > > > 
> > > >
> > > >
> > > > 
> > > > dfs.client.cachesize
> > > > in megabytes
> > > > 10
> > > > 
> > > >
> > > >
> > > > Any thoughts of a feature like this?
> > > >
> > > >
> > > > --
> > > > --- Get your facts first, then you can distort them as you please.--
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
> >
>



-- 
--- Get your facts first, then you can distort them as you please.--