How can I get the actual time for one write operation in HDFS?

2009-05-12 Thread Xie, Tao

DFSOutputStream.writeChunk() enqueues packets into data queue and after that
it returns. So write is asynchronous. 

I want to know the total actual time of HDFS executing the write operation
(start from writeChunk() to the time that each replication is written on
disk). How can get that time?

Thanks.

-- 
View this message in context: 
http://www.nabble.com/How-can-I-get-the-actual-time-for-one-write-operation-in-HDFS--tp23516363p23516363.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: how to improve the Hadoop's capability of dealing with small files

2009-05-12 Thread Rasit OZDAS
I have the similar situation, I have very small files,
I never tried HBase (want to), but you can also group them
and write (let's say) 20-30 into a file as every file becomes a key in that
big file.

There are methods in API which you can write an object as a file into HDFS,
and read again
to get original object. Having list of items in object can solve this
problem..


RE: public IP for datanode on EC2

2009-05-12 Thread Joydeep Sen Sarma
(raking up real old thread)

After struggling with this issue for sometime now - it seems that accessing 
hdfs on ec2 from outside ec2 is not possible.

This is primarily because of https://issues.apache.org/jira/browse/HADOOP-985. 
Even if datanode ports are authorized in ec2 and we set the public hostname via 
slave.host.name - the namenode uses the internal IP address of the datanodes 
for block locations. DFS clients outside ec2 cannot reach these addresses and 
report failure reading/writing data blocks.

HDFS/EC2 gurus - would it be reasonable to ask for an option to not use IP 
addresses (and use datanode host names as pre-985)? 

I really like the idea of being able to use an external node (my personal 
workstation) to do job submission (which typically requires interacting with 
HDFS in order to push files into the jobcache etc). This way I don't need 
custom AMIs - I can use stock hadoop amis (all the custom software is on the 
external node). Without the above option - this is not possible currently.

 

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Tuesday, September 09, 2008 7:04 AM
To: core-user@hadoop.apache.org
Subject: Re: public IP for datanode on EC2

> I think most people try to avoid allowing remote access for security 
> reasons. If you can add a file, I can mount your filesystem too, maybe 
> even delete things. Whereas with EC2-only filesystems, your files are 
> *only* exposed to everyone else that knows or can scan for your IPAddr and 
> ports.
>

I imagine that the access to the ports used by HDFS could be restricted to 
specific IPs using the EC2 group (ec2-authorize) or any other firewall 
mechanism if necessary.

Could anyone confirm that there is no conf parameter I could use to force the 
address of my DataNodes?

Thanks

Julien

--
DigitalPebble Ltd
http://www.digitalpebble.com


Re: how to connect to remote hadoop dfs by eclipse plugin?

2009-05-12 Thread Rasit OZDAS
Your hadoop isn't working at all or isn't working at the specified port.
- try stop-all.sh command on namenode. if it says "no namenode to stop",
then take a look at namenode logs and paste here if anything seems strange.
- If namenode logs are ok (filled with INFO messages), then take a look at
all logs.
- In eclipse plugin, left side is for map reduce port, right side is for
namenode port, make sure both are same as your configuration in xml files

2009/5/12 andy2005cst 

>
> when i use eclipse plugin hadoop-0.18.3-eclipse-plugin.jar and try to
> connect
> to a remote hadoop dfs, i got ioexception. if run a map/reduce program it
> outputs:
> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> /**.**.**.**:9100. Already tried 0 time(s).
> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> /**.**.**.**:9100. Already tried 1 time(s).
> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> /**.**.**.**:9100. Already tried 2 time(s).
> 
> Exception in thread "main" java.io.IOException: Call to /**.**.**.**:9100
> failed on local exception: java.net.SocketException: Connection refused:
> connect
>
> looking forward your help. thanks a lot.
> --
> View this message in context:
> http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23498736.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: Winning a sixty second dash with a yellow elephant

2009-05-12 Thread Ian jonhson
Interesting   so, where can I download the benchmark and relative
test codes?


On Tue, May 12, 2009 at 8:38 AM, Arun C Murthy  wrote:
> ... oh, and getting it to run a marathon too!
>
> http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html
>
> Owen & Arun
>


hadoop streaming reducer values

2009-05-12 Thread Alan Drew

Hi,

I have a question about the  that the reducer gets in Hadoop
Streaming.

I wrote a simple mapper.sh, reducer.sh script files:

mapper.sh : 

#!/bin/bash

while read data
do
  #tokenize the data and output the values 
  echo $data | awk '{token=0; while(++token<=NF) print $token"\t1"}'
done

reducer.sh :

#!/bin/bash

while read data
do
  echo -e $data
done

The mapper tokenizes a line of input and outputs  pairs to standard
output.  The reducer just outputs what it gets from standard input.

I have a simple input file:

cat in the hat
ate my mat the

I was expecting the final output to be something like:

the 1 1 1 
cat 1

etc.

but instead each word has its own line, which makes me think that
 is being given to the reducer and not  which is
default for normal Hadoop (in Java) right?

the 1
the 1
the 1
cat 1

Is there any way to get  for the reducer and not a bunch of
 pairs?  I looked into the -reducer aggregate option, but there
doesn't seem to be a way to customize what the reducer does with the  other than max,min functions.

Thanks.
-- 
View this message in context: 
http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: No reduce tasks running, yet 1 is pending

2009-05-12 Thread Saptarshi Guha
Interestingly, when i started other jobs, this one finished.
I have no idea why.

Saptarshi Guha



On Tue, May 12, 2009 at 10:36 PM, Saptarshi Guha
 wrote:
> Hello,
> I mentioned this issue before for the case of map tasks. I have 43
> reduce tasks, 42 completed, 1 pending and 0 running.
> This is the case for the last 30 minutes. A pictur(tiff) of the job
> tracker can be found here( http://www.stat.purdue.edu/~sguha/mr.tiff
> ),
>  since I haven't canceled the job, i can send logs and output if
> anyone requires.
>
>
> Regards
> Saptarshi
>


No reduce tasks running, yet 1 is pending

2009-05-12 Thread Saptarshi Guha
Hello,
I mentioned this issue before for the case of map tasks. I have 43
reduce tasks, 42 completed, 1 pending and 0 running.
This is the case for the last 30 minutes. A pictur(tiff) of the job
tracker can be found here( http://www.stat.purdue.edu/~sguha/mr.tiff
),
 since I haven't canceled the job, i can send logs and output if
anyone requires.


Regards
Saptarshi


Re: Suggestions for making writing faster? DFSClient waiting while writing chunk

2009-05-12 Thread stack
On Mon, May 11, 2009 at 9:43 PM, Raghu Angadi  wrote:

> stack wrote:
>
>> Thanks Raghu:
>>
>> Here is where it gets stuck:  [...]
>>
>
> Is that where it normally stuck? That implies it is spending unusually long
> time at the end of writing a block, which should not be the case.


I studied datanode as you suggested.  This sent be back to the client
application and indeed, we were spending time finalizing blocks because
block size had been set way down in the application.  Write-rate is
reasonable again.

Thanks for the pointers Raghu,
St.Ack


Hadoop-on-Demand question: key/value pairs in child opts

2009-05-12 Thread Jiaqi Tan
Hi,

I'd like to do this in my hodrc file:

client-params = ...,,mapred.child.java.opts="-Dkey=value",...

but HoD doesn't like it:
error: 1 problem found.
Check your command line options and/or your configuration file /hodrc

Any ideas how to specify "nested equal"s? Has anyone ever tried this,
or is there any other way to pass shell options to child TaskTrackers?

Thanks,
Jiaqi


RE: Hadoop Summit 2009 - Open for registration

2009-05-12 Thread Ajay Anand
You can register at http://hadoopsummit09.eventbrite.com/ 

Ajay

-Original Message-
From: Amandeep Khurana [mailto:ama...@gmail.com] 
Sent: Tuesday, May 12, 2009 9:55 AM
To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: Hadoop Summit 2009 - Open for registration

It shows sold out on the website. Any chances of more seats opening up?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, May 5, 2009 at 2:10 PM, Ajay Anand  wrote:

> This year's Hadoop Summit
> (http://developer.yahoo.com/events/hadoopsummit09/) is confirmed for
> June 10th at the Santa Clara Marriott, and is now open for
registration.
>
>
>
> We have a packed agenda, with three tracks - for developers,
> administrators, and one focused on new and innovative applications
using
> Hadoop. The presentations include talks from Amazon, IBM, Sun,
Cloudera,
> Facebook, HP, Microsoft, and the Yahoo! team, as well as leading
> universities including UC Berkeley, CMU, Cornell, U of Maryland, U of
> Nebraska and SUNY.
>
>
>
> From our experience last year with the rush for seats, I would
encourage
> people to register early at http://hadoopsummit09.eventbrite.com/
>
>
>
> Looking forward to seeing you at the summit!
>
>
>
> Ajay
>
>


Re: Hadoop Summit 2009 - Open for registration

2009-05-12 Thread Amandeep Khurana
It shows sold out on the website. Any chances of more seats opening up?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, May 5, 2009 at 2:10 PM, Ajay Anand  wrote:

> This year's Hadoop Summit
> (http://developer.yahoo.com/events/hadoopsummit09/) is confirmed for
> June 10th at the Santa Clara Marriott, and is now open for registration.
>
>
>
> We have a packed agenda, with three tracks - for developers,
> administrators, and one focused on new and innovative applications using
> Hadoop. The presentations include talks from Amazon, IBM, Sun, Cloudera,
> Facebook, HP, Microsoft, and the Yahoo! team, as well as leading
> universities including UC Berkeley, CMU, Cornell, U of Maryland, U of
> Nebraska and SUNY.
>
>
>
> From our experience last year with the rush for seats, I would encourage
> people to register early at http://hadoopsummit09.eventbrite.com/
>
>
>
> Looking forward to seeing you at the summit!
>
>
>
> Ajay
>
>


Re: HDFS to S3 copy problems

2009-05-12 Thread Tom White
Ian - Thanks for the detailed analysis. It was these issues that lead
me to create a temporary file in NativeS3FileSystem in the first
place. I think we can get NativeS3FileSystem to report progress
though, see https://issues.apache.org/jira/browse/HADOOP-5814.

Ken - I can't see why you would be getting that error. Does it work
with hadoop fs, but not hadoop distcp?

Cheers,
Tom

On Sat, May 9, 2009 at 6:48 AM, Nowland, Ian  wrote:
> Hi Tom,
>
> Not creating a temp file is the ideal as it saves you from having to "waste" 
> using the local hard disk by writing an output file just before uploading 
> same to Amazon S3. There are a few problems though:
>
> 1) Amazon S3 PUTs need the file length up front. You could use a chunked 
> POST, but then you have the disadvantage of having to Base64 encode all your 
> data, increasing bandwidth usage, and also you still have the next problems;
>
> 2) You would still want to have MD5 checking. In Amazon S3 both PUT and POST 
> require the MD5 to be supplied before the contents. To work around this then 
> you would have to upload the object without MD5, then check its metadata to 
> make sure the MD5 is correct, then delete it if it is not. This is all 
> possible, but would be difficult to make bulletproof, whereas in the current 
> version, if the MD5 is different the PUT fails atomically and you can easily 
> just retry.
>
> 3) Finally, you would have to be careful in reducers that output only very 
> rarely. If there is too big a gap between data being uploaded through the 
> socket, then S3 may determine the connection has timed out, closing the 
> connection and meaning your task has to rerun (perhaps just to hit the same 
> problem again).
>
> All of this means that the current solution may be best for now as far as 
> general upload. The best I think we can so is fix the fact that the task is 
> not progressed in close(). The best way I can see to do this is introducing a 
> new interface say called ExtendedClosable which defines a close(Progressable 
> p) method. Then, have the various clients of FileSystem output streams (e.g. 
> Distcp, TextOutputFormat) test if their DataOutputStream supports the 
> interface, and if so call this in preference to the default. In the case of 
> NativeS3FileSystem then, this method spins up a thread to keep the 
> Progressable updated as the upload progresses.
>
> As an additional optimization to Distcp, where the source file already exists 
> we could have some extended interface say ExtendedWriteFileSystem that has a 
> create() method that takes the MD5 and the file size, then test for this 
> interface in the Distcp mapper call the extended method. The trade off here 
> is the fact that the checksum HDFS stored is not the MD5 needed by S3, and so 
> two (perhaps distributed) reads would be needed so the tradeoff is these two 
> distributed reads vs a distributed read and a local write then local read.
>
> What do you think?
>
> Cheers,
> Ian Nowland
> Amazon.com
>
> -Original Message-
> From: Tom White [mailto:t...@cloudera.com]
> Sent: Friday, May 08, 2009 1:36 AM
> To: core-user@hadoop.apache.org
> Subject: Re: HDFS to S3 copy problems
>
> Perhaps we should revisit the implementation of NativeS3FileSystem so
> that it doesn't always buffer the file on the client. We could have an
> option to make it write directly to S3. Thoughts?
>
> Regarding the problem with HADOOP-3733, you can work around it by
> setting fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey in your
> hadoop-site.xml.
>
> Cheers,
> Tom
>
> On Fri, May 8, 2009 at 1:17 AM, Andrew Hitchcock  wrote:
>> Hi Ken,
>>
>> S3N doesn't work that well with large files. When uploading a file to
>> S3, S3N saves it to local disk during write() and then uploads to S3
>> during the close(). Close can take a long time for large files and it
>> doesn't report progress, so the call can time out.
>>
>> As a work around, I'd recommend either increasing the timeout or
>> uploading the files by hand. Since you only have a few large files,
>> you might want to copy the files to local disk and then use something
>> like s3cmd to upload them to S3.
>>
>> Regards,
>> Andrew
>>
>> On Thu, May 7, 2009 at 4:42 PM, Ken Krugler  
>> wrote:
>>> Hi all,
>>>
>>> I have a few large files (4 that are 1.8GB+) I'm trying to copy from HDFS to
>>> S3. My micro EC2 cluster is running Hadoop 0.19.1, and has one master/two
>>> slaves.
>>>
>>> I first tried using the hadoop fs -cp command, as in:
>>>
>>> hadoop fs -cp output// s3n:
>>>
>>> This seemed to be working, as I could walk the network traffic spike, and
>>> temp files were being created in S3 (as seen with CyberDuck).
>>>
>>> But then it seemed to hang. Nothing happened for 30 minutes, so I killed the
>>> command.
>>>
>>> Then I tried using the hadoop distcp command, as in:
>>>
>>> hadoop distcp hdfs://:50001/// s3://:>> key>@//
>>>
>>> This failed, because my secret key has a '/' in it
>>> (http://issues.apache.org/jira/brow

Re: large files vs many files

2009-05-12 Thread Sasha Dolgy
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28)
hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49)
hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63)
hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013)
hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986
lastFlushOffset 1731
2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66)
hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39)
hdfs.HdfsQueueConsumer: Consumer writing event
2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28)
hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49)
hdfs.HdfsQueueConsumer: Re-using existing stream
2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63)
hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013)
hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235
lastFlushOffset 1986
2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66)
hdfs.HdfsQueueConsumer: Flushed stream, size = 2235

So although the Offset is changing as expected, the output stream isn't
being flushed or cleared out and isn't being written to file...

On Tue, May 12, 2009 at 5:26 PM, Sasha Dolgy  wrote:

> Right now data is received in parallel and is written to a queue, then a
> single thread reads the queue and writes those messages to a
> FSDataOutputStream which is kept open, but the messages never get flushed.
>  Tried flush() and sync() with no joy.
> 1.
> outputStream.writeBytes(rawMessage.toString());
>
> 2.
>
> log.debug("Flushing stream, size = " + s.getOutputStream().size());
>  s.getOutputStream().sync();
> log.debug("Flushed stream, size = " + s.getOutputStream().size());
>
> or
>
> log.debug("Flushing stream, size = " + s.getOutputStream().size());
> s.getOutputStream().flush();
>  log.debug("Flushed stream, size = " + s.getOutputStream().size());
>
> Just see the size() remain the same after performing this action.
>
> This is using hadoop-0.20.0.
>
> -sd
>
>
> On Sun, May 10, 2009 at 4:45 PM, Stefan Podkowinski wrote:
>
>> You just can't have many distributed jobs write into the same file
>> without locking/synchronizing these writes. Even with append(). Its
>> not different than using a regular file from multiple processes in
>> this respect.
>> Maybe you need to collect your data in front before processing them in
>> hadoop?
>> Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa
>>
>>
>> On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy  wrote:
>> > Would WritableFactories not allow me to open one outputstream and
>> continue
>> > to write() and sync() ?
>> >
>> > Maybe I'm reading into that wrong.  Although UUID would be nice, it
>> would
>> > still leave me in the problem of having lots of little files instead of
>> a
>> > few large files.
>> >
>> > -sd
>> >
>> > On Sat, May 9, 2009 at 8:37 AM, jason hadoop 
>> wrote:
>> >
>> >> You must create unique file names, I don't believe (but I do not know)
>> that
>> >> the append could will allow multiple writers.
>> >>
>> >> Are you writing from within a task, or as an external application
>> writing
>> >> into hadoop.
>> >>
>> >> You may try using UUID,
>> >> http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part
>> of
>> >> your
>> >> filename.
>> >> Without knowing more about your goals, environment and constraints it
>> is
>> >> hard to offer any more detailed suggestions.
>> >> You could also have an application aggregate the streams and write out
>> >> chunks, with one or more writers, one per output file.
>>
>


-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: large files vs many files

2009-05-12 Thread Sasha Dolgy
Right now data is received in parallel and is written to a queue, then a
single thread reads the queue and writes those messages to a
FSDataOutputStream which is kept open, but the messages never get flushed.
 Tried flush() and sync() with no joy.
1.
outputStream.writeBytes(rawMessage.toString());

2.

log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().sync();
log.debug("Flushed stream, size = " + s.getOutputStream().size());

or

log.debug("Flushing stream, size = " + s.getOutputStream().size());
s.getOutputStream().flush();
log.debug("Flushed stream, size = " + s.getOutputStream().size());

Just see the size() remain the same after performing this action.

This is using hadoop-0.20.0.

-sd

On Sun, May 10, 2009 at 4:45 PM, Stefan Podkowinski wrote:

> You just can't have many distributed jobs write into the same file
> without locking/synchronizing these writes. Even with append(). Its
> not different than using a regular file from multiple processes in
> this respect.
> Maybe you need to collect your data in front before processing them in
> hadoop?
> Have a look at Chukwa, http://wiki.apache.org/hadoop/Chukwa
>
>
> On Sat, May 9, 2009 at 9:44 AM, Sasha Dolgy  wrote:
> > Would WritableFactories not allow me to open one outputstream and
> continue
> > to write() and sync() ?
> >
> > Maybe I'm reading into that wrong.  Although UUID would be nice, it would
> > still leave me in the problem of having lots of little files instead of a
> > few large files.
> >
> > -sd
> >
> > On Sat, May 9, 2009 at 8:37 AM, jason hadoop 
> wrote:
> >
> >> You must create unique file names, I don't believe (but I do not know)
> that
> >> the append could will allow multiple writers.
> >>
> >> Are you writing from within a task, or as an external application
> writing
> >> into hadoop.
> >>
> >> You may try using UUID,
> >> http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of
> >> your
> >> filename.
> >> Without knowing more about your goals, environment and constraints it is
> >> hard to offer any more detailed suggestions.
> >> You could also have an application aggregate the streams and write out
> >> chunks, with one or more writers, one per output file.
>


Re: How to do load control of MapReduce

2009-05-12 Thread Steve Loughran

Stefan Will wrote:

Yes, I think the JVM uses way more memory than just its heap. Now some of it
might be just reserved memory, but not actually used (not sure how to tell
the difference). There are also things like thread stacks, jit compiler
cache, direct nio byte buffers etc. that take up process space outside of
the Java heap. But none of that should imho add up to Gigabytes...


good article on this
http://www.ibm.com/developerworks/linux/library/j-nativememory-linux/



Re: How to do load control of MapReduce

2009-05-12 Thread Stefan Will
Yes, I think the JVM uses way more memory than just its heap. Now some of it
might be just reserved memory, but not actually used (not sure how to tell
the difference). There are also things like thread stacks, jit compiler
cache, direct nio byte buffers etc. that take up process space outside of
the Java heap. But none of that should imho add up to Gigabytes...

-- Stefan 


> From: zsongbo 
> Reply-To: 
> Date: Tue, 12 May 2009 20:06:37 +0800
> To: 
> Subject: Re: How to do load control of MapReduce
> 
> Yes, I also found that the TaskTracker should not use so much memory.
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 
> 
> 32480 schubert  35  10 1411m 172m 9212 S0  2.2   8:54.78 java
> 
> The previous 1GB is the default value, I have just change the heap of TT to
> 384MB one hours ago.
> 
> I also found DataNode also need not too much memory.
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 
> 32399 schubert  25   0 1638m 372m 9208 S2  4.7  32:46.28 java
> 
> 
> In fact, I define the -Xmx512m in child opt for MapReduce tasks. But I found
> the child task use more memory than the definition:
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 
> 
> 10577 schubert  30  10  942m 572m 9092 S   46  7.2  51:02.21 java
> 
> 10507 schubert  29  10  878m 570m 9092 S   48  7.1  50:49.52 java
> 
> Schubert
> 
> On Tue, May 12, 2009 at 6:53 PM, Steve Loughran  wrote:
> 
>> zsongbo wrote:
>> 
>>> Hi Stefan,
>>> Yes, the 'nice' cannot resolve this problem.
>>> 
>>> Now, in my cluster, there are 8GB of RAM. My java heap configuration is:
>>> 
>>> HDFS DataNode : 1GB
>>> HBase-RegionServer: 1.5GB
>>> MR-TaskTracker: 1GB
>>> MR-child: 512MB   (max child task is 6, 4 map task + 2 reduce task)
>>> 
>>> But the memory usage is still tight.
>>> 
>> 
>> does TT need to be so big if you are running all your work in external VMs?
>> 




Re: How to do load control of MapReduce

2009-05-12 Thread zsongbo
Yes, I also found that the TaskTracker should not use so much memory.
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND


32480 schubert  35  10 1411m 172m 9212 S0  2.2   8:54.78 java

The previous 1GB is the default value, I have just change the heap of TT to
384MB one hours ago.

I also found DataNode also need not too much memory.
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

32399 schubert  25   0 1638m 372m 9208 S2  4.7  32:46.28 java


In fact, I define the -Xmx512m in child opt for MapReduce tasks. But I found
the child task use more memory than the definition:
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND


10577 schubert  30  10  942m 572m 9092 S   46  7.2  51:02.21 java

10507 schubert  29  10  878m 570m 9092 S   48  7.1  50:49.52 java

Schubert

On Tue, May 12, 2009 at 6:53 PM, Steve Loughran  wrote:

> zsongbo wrote:
>
>> Hi Stefan,
>> Yes, the 'nice' cannot resolve this problem.
>>
>> Now, in my cluster, there are 8GB of RAM. My java heap configuration is:
>>
>> HDFS DataNode : 1GB
>> HBase-RegionServer: 1.5GB
>> MR-TaskTracker: 1GB
>> MR-child: 512MB   (max child task is 6, 4 map task + 2 reduce task)
>>
>> But the memory usage is still tight.
>>
>
> does TT need to be so big if you are running all your work in external VMs?
>


append() production support

2009-05-12 Thread Sasha Dolgy
Does anyone have any vague ideas when append() may be available for
production usage?
Thanks in advance
-sasha

-- 
Sasha Dolgy
sasha.do...@gmail.com


Re: How to do load control of MapReduce

2009-05-12 Thread Steve Loughran

zsongbo wrote:

Hi Stefan,
Yes, the 'nice' cannot resolve this problem.

Now, in my cluster, there are 8GB of RAM. My java heap configuration is:

HDFS DataNode : 1GB
HBase-RegionServer: 1.5GB
MR-TaskTracker: 1GB
MR-child: 512MB   (max child task is 6, 4 map task + 2 reduce task)

But the memory usage is still tight.


does TT need to be so big if you are running all your work in external VMs?


Re: Huge DataNode Virtual Memory Usage

2009-05-12 Thread Steve Loughran

Stefan Will wrote:

Raghu,

I don't actually have exact numbers from jmap, although I do remember that
jmap -histo reported something less than 256MB for this process (before I
restarted it).

I just looked at another DFS process that is currently running and has a VM
size of 1.5GB (~600 resident). Here jmap reports a total object heap usage
of 120MB. The memory block list reported by jmap  doesn't actually seem
to contain the heap at all since the largest block in that list is 10MB in
size (/usr/java/jdk1.6.0_10/jre/lib/amd64/server/libjvm.so). However, pmap
reports a total usage of 1.56GB.

-- Stefan


you know, if you could get the Task Tracker to include stats on real and 
virtual memory use, I'm sure that others would welcome those reports 
-know that the job was slower and its VM was 2x physical would give you 
a good hint as to the root cause.


Re: Winning a sixty second dash with a yellow elephant

2009-05-12 Thread Steve Loughran

Arun C Murthy wrote:

... oh, and getting it to run a marathon too!

http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html 



Owen & Arun


Lovely. I will now stick up the pic of you getting the first results in 
on your laptop at apachecon


how to connect to remote hadoop dfs by eclipse plugin?

2009-05-12 Thread andy2005cst

when i use eclipse plugin hadoop-0.18.3-eclipse-plugin.jar and try to connect
to a remote hadoop dfs, i got ioexception. if run a map/reduce program it
outputs:
09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
/**.**.**.**:9100. Already tried 0 time(s).
09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
/**.**.**.**:9100. Already tried 1 time(s).
09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
/**.**.**.**:9100. Already tried 2 time(s).

Exception in thread "main" java.io.IOException: Call to /**.**.**.**:9100
failed on local exception: java.net.SocketException: Connection refused:
connect

looking forward your help. thanks a lot.
-- 
View this message in context: 
http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23498736.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: How to do load control of MapReduce

2009-05-12 Thread zsongbo
Hi Stefan,
Yes, the 'nice' cannot resolve this problem.

Now, in my cluster, there are 8GB of RAM. My java heap configuration is:

HDFS DataNode : 1GB
HBase-RegionServer: 1.5GB
MR-TaskTracker: 1GB
MR-child: 512MB   (max child task is 6, 4 map task + 2 reduce task)

But the memory usage is still tight.

Schubert

On Tue, May 12, 2009 at 11:39 AM, Stefan Will  wrote:

> I'm having similar performance issues and have been running my Hadoop
> processes using a nice level of 10 for a while, and haven't noticed any
> improvement.
>
> In my case, I believe what's happening is that the peak combined RAM usage
> of all the Hadoop task processes and the service processes exceeeds the
> ammount of RAM on my machines. This in turn causes part of the server
> processes to get paged out to disk while the nightly Hadoop batch processes
> are running. Since the swap space is typically on the same physical disks
> as
> the DFS and MapReduce working directory, I'm heavily IO bound and real time
> queries pretty much slow down to a crawl.
>
> I think the key is to make absolutely sure that all of your processes fit
> in
> your available RAM at all times. I'm actually having a hard time achieving
> this since the virtual memory usage of the JVM is usually way higher than
> the maximum heap size (see my other thread).
>
> -- Stefan
>
>
> > From: zsongbo 
> > Reply-To: 
> > Date: Tue, 12 May 2009 10:58:49 +0800
> > To: 
> > Subject: Re: How to do load control of MapReduce
> >
> > Thanks Billy,I am trying 'nice', and will report the result later.
> >
> > On Tue, May 12, 2009 at 3:42 AM, Billy Pearson
> > wrote:
> >
> >> Might try setting the tasktrackers linux nice level to say 5 or 10
> >> leavening dfs and hbase setting to 0
> >>
> >> Billy
> >> "zsongbo"  wrote in message
> >> news:fa03480d0905110549j7f09be13qd434ca41c9f84...@mail.gmail.com...
> >>
> >>  Hi all,
> >>> Now, if we have a large dataset to process by MapReduce. The MapReduce
> >>> will
> >>> take machine resources as many as possible.
> >>>
> >>> So when one such a big MapReduce job are running, the cluster would
> become
> >>> very busy and almost cannot do anything else.
> >>>
> >>> For example, we have a HDFS+MapReduc+HBase cluster.
> >>> There are a large dataset in HDFS to be processed by MapReduce
> >>> periodically,
> >>> the workload is CPU and I/O heavy. And the cluster also provide other
> >>> service for query (query HBase and read files in HDFS). So, when the
> job
> >>> is
> >>> running, the query latency will become very long.
> >>>
> >>> Since the MapReduce job is not time sensitive, I want to control the
> load
> >>> of
> >>> MapReduce. Do you have some advices ?
> >>>
> >>> Thanks in advance.
> >>> Schubert
> >>>
> >>>
> >>
> >>
>
>
>