Re: Does hadoop support append option?

2011-10-17 Thread Uma Maheswara Rao G 72686
- Original Message -
From: kartheek muthyala 
Date: Tuesday, October 18, 2011 11:54 am
Subject: Re: Does hadoop support append option?
To: common-user@hadoop.apache.org

> I am just concerned about the use case of appends in Hadoop. I 
> know that
> they have provided support for appends in hadoop. But how 
> frequently are the
> files getting appended? . 
 In normal case file block details will not be persisted in edit log before 
closing the file. As part of close only, this will happen. If NN restart 
happens before closing the file, we loose this data.

 Consider a case, we have a very big file and data also very important, in this 
case, we should have an option to persist the block details frequently into 
editlog file rite, inorder to avoid the dataloss in case of NN restarts. To do 
this, DFS exposed the API called sync. This will basically persist the editlog 
entries to disk. To reopen the stream back again we will use append api. 

In trunk, this support has been refactored cleanly and handled many corner 
cases. APIs also provided as hflush.

There is this version concept too that is
> maintained in the block report, according to my guess this version 
> number is
> maintained to make sure that if a datanode gets disconnected once 
> and comes
> back if it has a old copy of the data , then discard read requests 
> to this
> data node. But if the files are not getting appended frequently 
> does the
> version number remain the same?. Any typical use case can you guys 
> point to?
> 
I am not sure, what is your exact question here. Can you please clarify more on 
this?

> ~Kartheek
> 
> On Mon, Oct 17, 2011 at 12:53 PM, Uma Maheswara Rao G 72686 <
> mahesw...@huawei.com> wrote:
> 
> > AFAIK, append option is there in 20Append branch. Mainly 
> supports sync. But
> > there are some issues with that.
> >
> > Same has been merged to 20.205 branch and will be released soon (rc2
> > available). And also fixed many bugs in this branch. As per our 
> basic> testing it is pretty good as of now.Need to wait for 
> official release.
> >
> > Regards,
> > Uma
> >
> > - Original Message -
> > From: bourne1900 
> > Date: Monday, October 17, 2011 12:37 pm
> > Subject: Does hadoop support append option?
> > To: common-user 
> >
> > > I know that hadoop0.19.0 supports append option, but not stable.
> > > Does the latest version support append option? Is it stable?
> > > Thanks for help.
> > >
> > >
> > >
> > >
> > > bourne
> >
> 

Regards,
Uma


Re: Does hadoop support append option?

2011-10-17 Thread kartheek muthyala
I am just concerned about the use case of appends in Hadoop. I know that
they have provided support for appends in hadoop. But how frequently are the
files getting appended? . There is this version concept too that is
maintained in the block report, according to my guess this version number is
maintained to make sure that if a datanode gets disconnected once and comes
back if it has a old copy of the data , then discard read requests to this
data node. But if the files are not getting appended frequently does the
version number remain the same?. Any typical use case can you guys point to?

~Kartheek

On Mon, Oct 17, 2011 at 12:53 PM, Uma Maheswara Rao G 72686 <
mahesw...@huawei.com> wrote:

> AFAIK, append option is there in 20Append branch. Mainly supports sync. But
> there are some issues with that.
>
> Same has been merged to 20.205 branch and will be released soon (rc2
> available). And also fixed many bugs in this branch. As per our basic
> testing it is pretty good as of now.Need to wait for official release.
>
> Regards,
> Uma
>
> - Original Message -
> From: bourne1900 
> Date: Monday, October 17, 2011 12:37 pm
> Subject: Does hadoop support append option?
> To: common-user 
>
> > I know that hadoop0.19.0 supports append option, but not stable.
> > Does the latest version support append option? Is it stable?
> > Thanks for help.
> >
> >
> >
> >
> > bourne
>


Re: How do I connect Java Visual VM to a remote task?

2011-10-17 Thread Harsh J
Hello,

(Inline)

On Tue, Oct 18, 2011 at 12:04 AM, W.P. McNeill  wrote:

> 1. *Turn on JMX remote for the tasks*...I added the following options to
> mapred.child.java.opts:
> com.sun.management.jmxremote,
> com.sun.management.jmxremote.port=8004,com.sun.management.jmxremote.authenticate=false,com.sun.management.jmxremote.ssl
> = false.
>
> This does not work because there is contention for the JMX remote port when
> multiple tasks run on the same node. All but the first task fail at JVM
> initialization time, causing the job to fail before I can see the repro.

For profiling/etc. this way, you are probably interested in just one
task. So switch down your slots to 1, and that'd be an easy way out -
1 mapper at a time, reusing the port as it goes.

> 2. *Use jstatd*...I tried running jstatd in the background on my cluster
> nodes. It launches and runs, but when I try to connect using Visual VM,
> nothing happens.

While I find it odd that jstatd doesn't seem to expose the host's jvm
metrics out for you, I don't think jstatd would let you do memory
profiling AFAIK. You need jmx for that, right? You can observe heap
charts with jstatd running though, I think.

> I am going to try adding -XX:-HeapDumpOnOutOfMemoryError, which will at
> least give me post-mortem information. Does anyone know where the heap dump
> file will  be written?

Enable keep.failed.task.files as true for your job, then hunt the
attempt directory down in your mapred.local.dir of the TaskTracker
that ran it. An easier way is to also log the Child's pwd via your
java code so you see which disk its on when you check logs first.
Under the attempt dir, you should be able to locate your heap dump.

> Has anyone debugged a similar setup? What tools did you use?

I think you'll find some (possibly odd looking) ways described on
https://issues.apache.org/jira/browse/MAPREDUCE-2637, similar to your
approach.

-- 
Harsh J


Re: Building and adding new Datanode

2011-10-17 Thread Harsh J
Hey Alexander,

Just install the DataNode packages on the machine and configure it
(copy over a config from existing DN perhaps, but make sure to check
the dfs.data.dir properties before you start), and start it up.

It will join the cluster as long as the configuration points to the
right fs.default.name address. Same for TaskTracker as long as it has
the right mapred.job.tracker address.

If you are using SCM, and this question is along the lines of using
SCM to do such a thing, please direct your question to
scm-u...@cloudera.org
(https://groups.google.com/a/cloudera.org/group/scm-user)

On Tue, Oct 18, 2011 at 6:11 AM, Gauthier, Alexander
 wrote:
> Hi guys, noob questions;
>
>
>
> What do I need to install a new node soon to be added to a cluster and
> how do I add it? I'm using CDH3 distribution.
>
>
>
> Thank you!!
>
>
>
> Alex Gauthier
>
> Engineering Manager
> Teradata Corp.
> Mobile: 510-427-5447
> Office: 858-485-2144
> fax: 858-485-2581
> www.teradata.com 
>
>
>
>



-- 
Harsh J


Re: Hadoop node disk failure - reinstall question

2011-10-17 Thread Uma Maheswara Rao G 72686
- Original Message -
From: Mayuran Yogarajah 
Date: Tuesday, October 18, 2011 4:24 am
Subject: Hadoop node disk failure - reinstall question
To: "common-user@hadoop.apache.org" 

> One of our nodes died today, it looks like the disk containing the 
> OS 
> expired.  I will need to reinstall the machine.
> Are there any known issues with using the same hostname / IP again, 
> or 
> is it better to give it a new IP / host name ?
> 
> The second disk on the machine is still operational and contains 
> HDFS 
> data so I plan on mounting it.  Is this ill-advised? Should I just 
> wipe 
> that disk too ?
Copying that data to new machine would be good option. It is gain depending on 
the replication. If you have enough replicas in your cluster. then 
automatically replication will happen to new nodes. In that case you need not 
even worry about old data. 
> 
> thanks,
> M
> 


Re: SimpleKMeansCLustering - "Failed to set permissions of path to 0700"

2011-10-17 Thread Raj Vishwanathan
Can you run any map/reduce jobs suchas word count?
Raj

Sent from my iPad
Please excuse the typos. 

On Oct 17, 2011, at 5:18 PM, robpd  wrote:

> Hi
> 
> I am new to Mahout and Hadoop. I'm currently trying to get the
> SimpleKMeansClustering example from the Maout in Action book to work. I am
> running the whole thing from under cygwin from a sh script (in which I
> explicitly add the necessary jars to the classpath).
> 
> Unfortunately I get...
> 
> Exception in thread "main" java.io.IOException: Failed to set permissions of
> path: file:/tmp/hadoop-Rob/mapred/staging/Rob1823346078/.staging to 0700
> at
> org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:525)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:499)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:318)
> at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183)
> at
> org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:797)
> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Unknown Source)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
> at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:362)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:310)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:237)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
> at kmeans.SimpleKMeansClustering.main(Unknown Source)
> 
> This is presumably because, although I am running under cygwin, Windows will
> not allow a change of privilege like this? I have done the following to
> address the problem, but without any success...
> 
> a) Ensure I started Hadoop prior to running the program (start-all.sh)
> 
> b) Edited the Hadoop conf file hdfs-site.xml to switch off the file
> permissions in HDFS
> 
> 
> dfs.permissions
> false
> 
> 
> c) I issues a hadoop fs -chmod +rwx -R /tmp to ensure that everyone was
> allowed to write to anything under tmp
> 
> I'd be very grateful of some help here if you have some ideas. Sorry if I am
> being green about things. It does seem like there's lots to learn.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SimpleKMeansCLustering-Failed-to-set-permissions-of-path-to-0700-tp3429867p3429867.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Is there a good way to see how full hdfs is

2011-10-17 Thread Rajiv Chittajallu
If you are running > 0.20.204 
http://phanpy-nn1.hadoop.apache.org:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo


ivan.nov...@emc.com wrote on 10/17/11 at 09:18:20 -0700:
>Hi Harsh,
>
>I need access to the data programatically for system automation, and hence
>I do not want a monitoring tool but access to the raw data.
>
>I am more than happy to use an exposed function or client program and not
>an internal API.
>
>So i am still a bit confused... What is the simplest way to get at this
>raw disk usage data programmatically?  Is there a HDFS equivalent of du
>and df, or are you suggesting to just run that on the linux OS (which is
>perfectly doable).
>
>Cheers,
>Ivan
>
>
>On 10/17/11 9:05 AM, "Harsh J"  wrote:
>
>>Uma/Ivan,
>>
>>The DistributedFileSystem class explicitly is _not_ meant for public
>>consumption, it is an internal one. Additionally, that method has been
>>deprecated.
>>
>>What you need is FileSystem#getStatus() if you want the summarized
>>report via code.
>>
>>A job, that possibly runs "du" or "df", is a good idea if you
>>guarantee perfect homogeneity of path names in your cluster.
>>
>>But I wonder, why won't using a general monitoring tool (such as
>>nagios) for this purpose cut it? What's the end goal here?
>>
>>P.s. I'd moved this conversation to hdfs-user@ earlier on, but now I
>>see it being cross posted into mr-user, common-user, and common-dev --
>>Why?
>>
>>On Mon, Oct 17, 2011 at 9:25 PM, Uma Maheswara Rao G 72686
>> wrote:
>>> We can write the simple program and you can call this API.
>>>
>>> Make sure Hadoop jars presents in your class path.
>>> Just for more clarification, DN will send their stats as parts of
>>>hertbeats, So, NN will maintain all the statistics about the diskspace
>>>usage for the complete filesystem and etc... This api will give you that
>>>stats.
>>>
>>> Regards,
>>> Uma
>>>
>>> - Original Message -
>>> From: ivan.nov...@emc.com
>>> Date: Monday, October 17, 2011 9:07 pm
>>> Subject: Re: Is there a good way to see how full hdfs is
>>> To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org
>>> Cc: common-...@hadoop.apache.org
>>>
 So is there a client program to call this?

 Can one write their own simple client to call this method from all
 diskson the cluster?

 How about a map reduce job to collect from all disks on the cluster?

 On 10/15/11 4:51 AM, "Uma Maheswara Rao G 72686"
 wrote:

 >/** Return the disk usage of the filesystem, including total
 capacity,>   * used space, and remaining space */
 >  public DiskStatus getDiskStatus() throws IOException {
 >return dfs.getDiskStatus();
 >  }
 >
 >DistributedFileSystem has the above API from java API side.
 >
 >Regards,
 >Uma
 >
 >- Original Message -
 >From: wd 
 >Date: Saturday, October 15, 2011 4:16 pm
 >Subject: Re: Is there a good way to see how full hdfs is
 >To: mapreduce-u...@hadoop.apache.org
 >
 >> hadoop dfsadmin -report
 >>
 >> On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
 >>  wrote:
 >> > We have a small cluster with HDFS running on only 8 nodes - I
 >> believe that
 >> > the partition assigned to hdfs might be getting full and
 >> > wonder if the web tools or java api havew a way to look at free
 >> space on
 >> > hdfs
 >> >
 >> > --
 >> > Steven M. Lewis PhD
 >> > 4221 105th Ave NE
 >> > Kirkland, WA 98033
 >> > 206-384-1340 (cell)
 >> > Skype lordjoe_com
 >> >
 >> >
 >> >
 >>
 >


>>>
>>
>>
>>
>>-- 
>>Harsh J
>>
>


pgpKRdGoIfblW.pgp
Description: PGP signature


SimpleKMeansCLustering - "Failed to set permissions of path to 0700"

2011-10-17 Thread robpd
Hi

I am new to Mahout and Hadoop. I'm currently trying to get the
SimpleKMeansClustering example from the Maout in Action book to work. I am
running the whole thing from under cygwin from a sh script (in which I
explicitly add the necessary jars to the classpath).

Unfortunately I get...

Exception in thread "main" java.io.IOException: Failed to set permissions of
path: file:/tmp/hadoop-Rob/mapred/staging/Rob1823346078/.staging to 0700
at
org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFileSystem.java:525)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:499)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:318)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:183)
at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:797)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:362)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClustersMR(KMeansDriver.java:310)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:237)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:152)
at kmeans.SimpleKMeansClustering.main(Unknown Source)

This is presumably because, although I am running under cygwin, Windows will
not allow a change of privilege like this? I have done the following to
address the problem, but without any success...

a) Ensure I started Hadoop prior to running the program (start-all.sh)

b) Edited the Hadoop conf file hdfs-site.xml to switch off the file
permissions in HDFS


dfs.permissions
false


c) I issues a hadoop fs -chmod +rwx -R /tmp to ensure that everyone was
allowed to write to anything under tmp

I'd be very grateful of some help here if you have some ideas. Sorry if I am
being green about things. It does seem like there's lots to learn.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SimpleKMeansCLustering-Failed-to-set-permissions-of-path-to-0700-tp3429867p3429867.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Is there a way to get the version of a remote hadoop instance

2011-10-17 Thread Tao.Zhang2
Hi, 

Is there a way to get the version of a remote hadoop instance through java
API?

Suppose there are two machines: A and B. I deploy the hadoop instance on
machine A, while my application is deployed on machine B. Before starting
my application, I want to check whether the hadoop instance version is
compatible with my application, so I want to get the version of the hadoop
instance on machine A from B.

I go through all the API, only find there is a VersionInfo class which can
get you the local Hadoop version, not the remote instance. Do you have any
ideas about this?

Thanks and regards!
Tao



Building and adding new Datanode

2011-10-17 Thread Gauthier, Alexander
Hi guys, noob questions; 

 

What do I need to install a new node soon to be added to a cluster and
how do I add it? I'm using CDH3 distribution. 

 

Thank you!!

 

Alex Gauthier

Engineering Manager 
Teradata Corp.
Mobile: 510-427-5447
Office: 858-485-2144 
fax: 858-485-2581
www.teradata.com   

 



Re: Hadoop node disk failure - reinstall question

2011-10-17 Thread patrick sang
This is what i would think:

> Are there any known issues with using the same hostname / IP again, or is
it better to give it a new IP / host name ?
as far as i understand, it is actually good to keep the same hostname/ip
because the TTL in both dns or client library would bite you.

> The second disk on the machine is still operational and contains HDFS data
so I plan on mounting it.
I think it should be fine because by the time that you start the datanode,
it will report to namenode what blocks it has.



On Mon, Oct 17, 2011 at 3:53 PM, Mayuran Yogarajah <
mayuran.yogara...@casalemedia.com> wrote:

> One of our nodes died today, it looks like the disk containing the OS
> expired.  I will need to reinstall the machine.
> Are there any known issues with using the same hostname / IP again, or is
> it better to give it a new IP / host name ?
>
> The second disk on the machine is still operational and contains HDFS data
> so I plan on mounting it.  Is this ill-advised? Should I just wipe that disk
> too ?
>
> thanks,
> M
>


Hadoop node disk failure - reinstall question

2011-10-17 Thread Mayuran Yogarajah
One of our nodes died today, it looks like the disk containing the OS 
expired.  I will need to reinstall the machine.
Are there any known issues with using the same hostname / IP again, or 
is it better to give it a new IP / host name ?


The second disk on the machine is still operational and contains HDFS 
data so I plan on mounting it.  Is this ill-advised? Should I just wipe 
that disk too ?


thanks,
M


Re: Jira Assignment

2011-10-17 Thread lcrawfordmills
Please remove me from this email list.
Thank you
--Original Message--
From: Arun C Murthy
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Re: Jira Assignment
Sent: Oct 17, 2011 3:08 PM

Done.

On Oct 16, 2011, at 10:00 AM, Mahadev Konar wrote:

> Arun,
> This was fixed a week ago or so. Here's the infra ticket.
> 
> https://issues.apache.org/jira/browse/INFRA-3960
> 
> You should be able to add new contributors now.
> 
> thanks
> mahadev
> 
> On Sun, Oct 16, 2011 at 9:36 AM, Arun C Murthy  wrote:
>> I've tried, and failed, many times recently to add 'contributors' to the 
>> Hadoop projects - something to do with the new UI they rolled out.
>> 
>> Let me try and follow-up with the ASF folks, thanks for being patient!
>> 
>> Arun
>> 
>> On Oct 16, 2011, at 9:32 AM, Jon Allen wrote:
>> 
>>> I've been doing some work on a Jira and want to assign it to myself but
>>> there doesn't seem to be an option to do this.  I believe I need to be
>>> assigned the contributor role before I can have issues assigned to me.  Is
>>> this correct and if so how do I get this role?
>>> 
>>> Thanks,
>>> Jon
>> 
>> 



Sent on the Sprint® Now Network from my BlackBerry®

Re: Jira Assignment

2011-10-17 Thread Arun C Murthy
Done.

On Oct 16, 2011, at 10:00 AM, Mahadev Konar wrote:

> Arun,
> This was fixed a week ago or so. Here's the infra ticket.
> 
> https://issues.apache.org/jira/browse/INFRA-3960
> 
> You should be able to add new contributors now.
> 
> thanks
> mahadev
> 
> On Sun, Oct 16, 2011 at 9:36 AM, Arun C Murthy  wrote:
>> I've tried, and failed, many times recently to add 'contributors' to the 
>> Hadoop projects - something to do with the new UI they rolled out.
>> 
>> Let me try and follow-up with the ASF folks, thanks for being patient!
>> 
>> Arun
>> 
>> On Oct 16, 2011, at 9:32 AM, Jon Allen wrote:
>> 
>>> I've been doing some work on a Jira and want to assign it to myself but
>>> there doesn't seem to be an option to do this.  I believe I need to be
>>> assigned the contributor role before I can have issues assigned to me.  Is
>>> this correct and if so how do I get this role?
>>> 
>>> Thanks,
>>> Jon
>> 
>> 



Re: How do I connect Java Visual VM to a remote task?

2011-10-17 Thread Rahul Jain
The easy way to debug such problems in our experience is to use 'jmap' to
take a few snapshots of one of the tasktrackers (child tasks) and analyze
them under a profiler tool such as jprofiler, yourkit etc. This should give
you pretty good indication of objects that are using up most heap memory.

You can add JVM options to suspend child tasks upon startup and attach using
debugger etc. but that is more painful in a distributed environment.


-Rahul

On Mon, Oct 17, 2011 at 11:34 AM, W.P. McNeill  wrote:

> I'm investigating a bug where my mapper and reducer tasks run out of
> memory.
> It only reproduces when I run on large data sets, so the best way to dig in
> is to launch my job with sufficiently large inputs on the cluster and
> monitor the memory characteristics of the failing JVMs remotely. Java
> Visual
> VM looks like the tool I want to use. Specifically I want to use it to do
> heap dumps on my tasks. I can't figure out how to set up the listening end
> on the cluster nodes, however.
>
> Here is what I have tried:
>
> 1. *Turn on JMX remote for the tasks*...I added the following options to
> mapred.child.java.opts:
> com.sun.management.jmxremote,
>
> com.sun.management.jmxremote.port=8004,com.sun.management.jmxremote.authenticate=false,com.sun.management.jmxremote.ssl
> = false.
>
> This does not work because there is contention for the JMX remote port when
> multiple tasks run on the same node. All but the first task fail at JVM
> initialization time, causing the job to fail before I can see the repro.
>
> 2. *Use jstatd*...I tried running jstatd in the background on my cluster
> nodes. It launches and runs, but when I try to connect using Visual VM,
> nothing happens.
>
> I am going to try adding -XX:-HeapDumpOnOutOfMemoryError, which will at
> least give me post-mortem information. Does anyone know where the heap dump
> file will  be written?
>
> Has anyone debugged a similar setup? What tools did you use?
>


How do I connect Java Visual VM to a remote task?

2011-10-17 Thread W.P. McNeill
I'm investigating a bug where my mapper and reducer tasks run out of memory.
It only reproduces when I run on large data sets, so the best way to dig in
is to launch my job with sufficiently large inputs on the cluster and
monitor the memory characteristics of the failing JVMs remotely. Java Visual
VM looks like the tool I want to use. Specifically I want to use it to do
heap dumps on my tasks. I can't figure out how to set up the listening end
on the cluster nodes, however.

Here is what I have tried:

1. *Turn on JMX remote for the tasks*...I added the following options to
mapred.child.java.opts:
com.sun.management.jmxremote,
com.sun.management.jmxremote.port=8004,com.sun.management.jmxremote.authenticate=false,com.sun.management.jmxremote.ssl
= false.

This does not work because there is contention for the JMX remote port when
multiple tasks run on the same node. All but the first task fail at JVM
initialization time, causing the job to fail before I can see the repro.

2. *Use jstatd*...I tried running jstatd in the background on my cluster
nodes. It launches and runs, but when I try to connect using Visual VM,
nothing happens.

I am going to try adding -XX:-HeapDumpOnOutOfMemoryError, which will at
least give me post-mortem information. Does anyone know where the heap dump
file will  be written?

Has anyone debugged a similar setup? What tools did you use?


Re: Is there a good way to see how full hdfs is

2011-10-17 Thread Uma Maheswara Rao G 72686
Yes, that was deprecated in trunk

If you want to use by programatically, this will be the better option as well.
 /** {@inheritDoc} */
  @Override
  public FsStatus getStatus(Path p) throws IOException {
statistics.incrementReadOps(1);
return dfs.getDiskStatus();
  }

This should work for you.

It will give you FileStatus object contains below APIs
getCapacity, getUsed, getRemaining

I would suggest you to look at the FileSystem APIs available once. I think you 
will get clear understanding to use.

Regards,
Uma


- Original Message -
From: ivan.nov...@emc.com
Date: Monday, October 17, 2011 9:48 pm
Subject: Re: Is there a good way to see how full hdfs is
To: common-user@hadoop.apache.org

> Hi Harsh,
> 
> I need access to the data programatically for system automation, 
> and hence
> I do not want a monitoring tool but access to the raw data.
> 
> I am more than happy to use an exposed function or client program 
> and not
> an internal API.
> 
> So i am still a bit confused... What is the simplest way to get at 
> thisraw disk usage data programmatically?  Is there a HDFS 
> equivalent of du
> and df, or are you suggesting to just run that on the linux OS 
> (which is
> perfectly doable).
> 
> Cheers,
> Ivan
> 
> 
> On 10/17/11 9:05 AM, "Harsh J"  wrote:
> 
> >Uma/Ivan,
> >
> >The DistributedFileSystem class explicitly is _not_ meant for public
> >consumption, it is an internal one. Additionally, that method has 
> been>deprecated.
> >
> >What you need is FileSystem#getStatus() if you want the summarized
> >report via code.
> >
> >A job, that possibly runs "du" or "df", is a good idea if you
> >guarantee perfect homogeneity of path names in your cluster.
> >
> >But I wonder, why won't using a general monitoring tool (such as
> >nagios) for this purpose cut it? What's the end goal here?
> >
> >P.s. I'd moved this conversation to hdfs-user@ earlier on, but now I
> >see it being cross posted into mr-user, common-user, and common-
> dev --
> >Why?
> >
> >On Mon, Oct 17, 2011 at 9:25 PM, Uma Maheswara Rao G 72686
> > wrote:
> >> We can write the simple program and you can call this API.
> >>
> >> Make sure Hadoop jars presents in your class path.
> >> Just for more clarification, DN will send their stats as parts of
> >>hertbeats, So, NN will maintain all the statistics about the 
> diskspace>>usage for the complete filesystem and etc... This api 
> will give you that
> >>stats.
> >>
> >> Regards,
> >> Uma
> >>
> >> - Original Message -
> >> From: ivan.nov...@emc.com
> >> Date: Monday, October 17, 2011 9:07 pm
> >> Subject: Re: Is there a good way to see how full hdfs is
> >> To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org
> >> Cc: common-...@hadoop.apache.org
> >>
> >>> So is there a client program to call this?
> >>>
> >>> Can one write their own simple client to call this method from all
> >>> diskson the cluster?
> >>>
> >>> How about a map reduce job to collect from all disks on the 
> cluster?>>>
> >>> On 10/15/11 4:51 AM, "Uma Maheswara Rao G 72686"
> >>> wrote:
> >>>
> >>> >/** Return the disk usage of the filesystem, including total
> >>> capacity,>   * used space, and remaining space */
> >>> >  public DiskStatus getDiskStatus() throws IOException {
> >>> >return dfs.getDiskStatus();
> >>> >  }
> >>> >
> >>> >DistributedFileSystem has the above API from java API side.
> >>> >
> >>> >Regards,
> >>> >Uma
> >>> >
> >>> >- Original Message -
> >>> >From: wd 
> >>> >Date: Saturday, October 15, 2011 4:16 pm
> >>> >Subject: Re: Is there a good way to see how full hdfs is
> >>> >To: mapreduce-u...@hadoop.apache.org
> >>> >
> >>> >> hadoop dfsadmin -report
> >>> >>
> >>> >> On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
> >>> >>  wrote:
> >>> >> > We have a small cluster with HDFS running on only 8 nodes -
> I
> >>> >> believe that
> >>> >> > the partition assigned to hdfs might be getting full and
> >>> >> > wonder if the web tools or java api havew a way to look at 
> free>>> >> space on
> >>> >> > hdfs
> >>> >> >
> >>> >> > --
> >>> >> > Steven M. Lewis PhD
> >>> >> > 4221 105th Ave NE
> >>> >> > Kirkland, WA 98033
> >>> >> > 206-384-1340 (cell)
> >>> >> > Skype lordjoe_com
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >>
> >>> >
> >>>
> >>>
> >>
> >
> >
> >
> >-- 
> >Harsh J
> >
> 
> 


Re: Is there a good way to see how full hdfs is

2011-10-17 Thread Ivan.Novick
Hi Harsh,

I need access to the data programatically for system automation, and hence
I do not want a monitoring tool but access to the raw data.

I am more than happy to use an exposed function or client program and not
an internal API.

So i am still a bit confused... What is the simplest way to get at this
raw disk usage data programmatically?  Is there a HDFS equivalent of du
and df, or are you suggesting to just run that on the linux OS (which is
perfectly doable).

Cheers,
Ivan


On 10/17/11 9:05 AM, "Harsh J"  wrote:

>Uma/Ivan,
>
>The DistributedFileSystem class explicitly is _not_ meant for public
>consumption, it is an internal one. Additionally, that method has been
>deprecated.
>
>What you need is FileSystem#getStatus() if you want the summarized
>report via code.
>
>A job, that possibly runs "du" or "df", is a good idea if you
>guarantee perfect homogeneity of path names in your cluster.
>
>But I wonder, why won't using a general monitoring tool (such as
>nagios) for this purpose cut it? What's the end goal here?
>
>P.s. I'd moved this conversation to hdfs-user@ earlier on, but now I
>see it being cross posted into mr-user, common-user, and common-dev --
>Why?
>
>On Mon, Oct 17, 2011 at 9:25 PM, Uma Maheswara Rao G 72686
> wrote:
>> We can write the simple program and you can call this API.
>>
>> Make sure Hadoop jars presents in your class path.
>> Just for more clarification, DN will send their stats as parts of
>>hertbeats, So, NN will maintain all the statistics about the diskspace
>>usage for the complete filesystem and etc... This api will give you that
>>stats.
>>
>> Regards,
>> Uma
>>
>> - Original Message -
>> From: ivan.nov...@emc.com
>> Date: Monday, October 17, 2011 9:07 pm
>> Subject: Re: Is there a good way to see how full hdfs is
>> To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org
>> Cc: common-...@hadoop.apache.org
>>
>>> So is there a client program to call this?
>>>
>>> Can one write their own simple client to call this method from all
>>> diskson the cluster?
>>>
>>> How about a map reduce job to collect from all disks on the cluster?
>>>
>>> On 10/15/11 4:51 AM, "Uma Maheswara Rao G 72686"
>>> wrote:
>>>
>>> >/** Return the disk usage of the filesystem, including total
>>> capacity,>   * used space, and remaining space */
>>> >  public DiskStatus getDiskStatus() throws IOException {
>>> >return dfs.getDiskStatus();
>>> >  }
>>> >
>>> >DistributedFileSystem has the above API from java API side.
>>> >
>>> >Regards,
>>> >Uma
>>> >
>>> >- Original Message -
>>> >From: wd 
>>> >Date: Saturday, October 15, 2011 4:16 pm
>>> >Subject: Re: Is there a good way to see how full hdfs is
>>> >To: mapreduce-u...@hadoop.apache.org
>>> >
>>> >> hadoop dfsadmin -report
>>> >>
>>> >> On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
>>> >>  wrote:
>>> >> > We have a small cluster with HDFS running on only 8 nodes - I
>>> >> believe that
>>> >> > the partition assigned to hdfs might be getting full and
>>> >> > wonder if the web tools or java api havew a way to look at free
>>> >> space on
>>> >> > hdfs
>>> >> >
>>> >> > --
>>> >> > Steven M. Lewis PhD
>>> >> > 4221 105th Ave NE
>>> >> > Kirkland, WA 98033
>>> >> > 206-384-1340 (cell)
>>> >> > Skype lordjoe_com
>>> >> >
>>> >> >
>>> >> >
>>> >>
>>> >
>>>
>>>
>>
>
>
>
>-- 
>Harsh J
>



Re: Is there a good way to see how full hdfs is

2011-10-17 Thread Harsh J
Uma/Ivan,

The DistributedFileSystem class explicitly is _not_ meant for public
consumption, it is an internal one. Additionally, that method has been
deprecated.

What you need is FileSystem#getStatus() if you want the summarized
report via code.

A job, that possibly runs "du" or "df", is a good idea if you
guarantee perfect homogeneity of path names in your cluster.

But I wonder, why won't using a general monitoring tool (such as
nagios) for this purpose cut it? What's the end goal here?

P.s. I'd moved this conversation to hdfs-user@ earlier on, but now I
see it being cross posted into mr-user, common-user, and common-dev --
Why?

On Mon, Oct 17, 2011 at 9:25 PM, Uma Maheswara Rao G 72686
 wrote:
> We can write the simple program and you can call this API.
>
> Make sure Hadoop jars presents in your class path.
> Just for more clarification, DN will send their stats as parts of hertbeats, 
> So, NN will maintain all the statistics about the diskspace usage for the 
> complete filesystem and etc... This api will give you that stats.
>
> Regards,
> Uma
>
> - Original Message -
> From: ivan.nov...@emc.com
> Date: Monday, October 17, 2011 9:07 pm
> Subject: Re: Is there a good way to see how full hdfs is
> To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org
> Cc: common-...@hadoop.apache.org
>
>> So is there a client program to call this?
>>
>> Can one write their own simple client to call this method from all
>> diskson the cluster?
>>
>> How about a map reduce job to collect from all disks on the cluster?
>>
>> On 10/15/11 4:51 AM, "Uma Maheswara Rao G 72686"
>> wrote:
>>
>> >/** Return the disk usage of the filesystem, including total
>> capacity,>   * used space, and remaining space */
>> >  public DiskStatus getDiskStatus() throws IOException {
>> >    return dfs.getDiskStatus();
>> >  }
>> >
>> >DistributedFileSystem has the above API from java API side.
>> >
>> >Regards,
>> >Uma
>> >
>> >- Original Message -
>> >From: wd 
>> >Date: Saturday, October 15, 2011 4:16 pm
>> >Subject: Re: Is there a good way to see how full hdfs is
>> >To: mapreduce-u...@hadoop.apache.org
>> >
>> >> hadoop dfsadmin -report
>> >>
>> >> On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
>> >>  wrote:
>> >> > We have a small cluster with HDFS running on only 8 nodes - I
>> >> believe that
>> >> > the partition assigned to hdfs might be getting full and
>> >> > wonder if the web tools or java api havew a way to look at free
>> >> space on
>> >> > hdfs
>> >> >
>> >> > --
>> >> > Steven M. Lewis PhD
>> >> > 4221 105th Ave NE
>> >> > Kirkland, WA 98033
>> >> > 206-384-1340 (cell)
>> >> > Skype lordjoe_com
>> >> >
>> >> >
>> >> >
>> >>
>> >
>>
>>
>



-- 
Harsh J


Re: Is there a good way to see how full hdfs is

2011-10-17 Thread Uma Maheswara Rao G 72686
We can write the simple program and you can call this API.

Make sure Hadoop jars presents in your class path.
Just for more clarification, DN will send their stats as parts of hertbeats, 
So, NN will maintain all the statistics about the diskspace usage for the 
complete filesystem and etc... This api will give you that stats.

Regards,
Uma

- Original Message -
From: ivan.nov...@emc.com
Date: Monday, October 17, 2011 9:07 pm
Subject: Re: Is there a good way to see how full hdfs is
To: common-user@hadoop.apache.org, mapreduce-u...@hadoop.apache.org
Cc: common-...@hadoop.apache.org

> So is there a client program to call this?
> 
> Can one write their own simple client to call this method from all 
> diskson the cluster?  
> 
> How about a map reduce job to collect from all disks on the cluster?
> 
> On 10/15/11 4:51 AM, "Uma Maheswara Rao G 72686" 
> wrote:
> 
> >/** Return the disk usage of the filesystem, including total 
> capacity,>   * used space, and remaining space */
> >  public DiskStatus getDiskStatus() throws IOException {
> >return dfs.getDiskStatus();
> >  }
> >
> >DistributedFileSystem has the above API from java API side.
> >
> >Regards,
> >Uma
> >
> >- Original Message -
> >From: wd 
> >Date: Saturday, October 15, 2011 4:16 pm
> >Subject: Re: Is there a good way to see how full hdfs is
> >To: mapreduce-u...@hadoop.apache.org
> >
> >> hadoop dfsadmin -report
> >> 
> >> On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
> >>  wrote:
> >> > We have a small cluster with HDFS running on only 8 nodes - I
> >> believe that
> >> > the partition assigned to hdfs might be getting full and
> >> > wonder if the web tools or java api havew a way to look at free
> >> space on
> >> > hdfs
> >> >
> >> > --
> >> > Steven M. Lewis PhD
> >> > 4221 105th Ave NE
> >> > Kirkland, WA 98033
> >> > 206-384-1340 (cell)
> >> > Skype lordjoe_com
> >> >
> >> >
> >> >
> >> 
> >
> 
> 


Re: Is there a good way to see how full hdfs is

2011-10-17 Thread Ivan.Novick
So is there a client program to call this?

Can one write their own simple client to call this method from all disks
on the cluster?  

How about a map reduce job to collect from all disks on the cluster?

On 10/15/11 4:51 AM, "Uma Maheswara Rao G 72686" 
wrote:

>/** Return the disk usage of the filesystem, including total capacity,
>   * used space, and remaining space */
>  public DiskStatus getDiskStatus() throws IOException {
>return dfs.getDiskStatus();
>  }
>
>DistributedFileSystem has the above API from java API side.
>
>Regards,
>Uma
>
>- Original Message -
>From: wd 
>Date: Saturday, October 15, 2011 4:16 pm
>Subject: Re: Is there a good way to see how full hdfs is
>To: mapreduce-u...@hadoop.apache.org
>
>> hadoop dfsadmin -report
>> 
>> On Sat, Oct 15, 2011 at 8:16 AM, Steve Lewis
>>  wrote:
>> > We have a small cluster with HDFS running on only 8 nodes - I
>> believe that
>> > the partition assigned to hdfs might be getting full and
>> > wonder if the web tools or java api havew a way to look at free
>> space on
>> > hdfs
>> >
>> > --
>> > Steven M. Lewis PhD
>> > 4221 105th Ave NE
>> > Kirkland, WA 98033
>> > 206-384-1340 (cell)
>> > Skype lordjoe_com
>> >
>> >
>> >
>> 
>



Hadoop archive

2011-10-17 Thread Jonas Hartwig
Hi, im new to the community.

Id like to create an archive but I get the error: "Exception in archives
null".

Im using hadoop 0.204.0. the issue was tracked under MAPREDUCE-1399
  and solved. How
do I combine my hadoop version with a new map/reduce release? And how do
I get the release using firefox? I saw something like JIRA but the
firefox plugin is not working with 7.x.

 

regards



Re: Does hadoop support append option?

2011-10-17 Thread Uma Maheswara Rao G 72686
AFAIK, append option is there in 20Append branch. Mainly supports sync. But 
there are some issues with that.

Same has been merged to 20.205 branch and will be released soon (rc2 
available). And also fixed many bugs in this branch. As per our basic testing 
it is pretty good as of now.Need to wait for official release.

Regards,
Uma

- Original Message -
From: bourne1900 
Date: Monday, October 17, 2011 12:37 pm
Subject: Does hadoop support append option?
To: common-user 

> I know that hadoop0.19.0 supports append option, but not stable.
> Does the latest version support append option? Is it stable?
> Thanks for help.
> 
> 
> 
> 
> bourne