log files not found

2010-03-31 Thread Raghava Mutharaju
Hi all,

   I am running a series of jobs one after another. While executing the
4th job, the job fails. It fails in the reducer --- the progress percentage
would be map 100%, reduce 99%. It gives out the following message

10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
attempt_201003240138_0110_r_18_1, Status : FAILED
Task attempt_201003240138_0110_r_18_1 failed to report status for 602
seconds. Killing!

It makes several attempts again to execute it but fails with similar
message. I couldn't get anything from this error message and wanted to look
at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
find any files which match the timestamp of the job. Also I did not find
history and userlogs in the logs folder. Should I look at some other place
for the logs? What could be the possible causes for the above error?

   I am using Hadoop 0.20.2 and I am running it on a cluster with 14
nodes.

Thank you.

Regards,
Raghava.


Re: log

2010-03-31 Thread Amareshwari Sri Ramadasu
Along with JobTracker maintaining history in ${hadoop.log.dir}/logs/history, in 
branch 0.20, the job history is available in a user location also. User 
location can be specified for configuration “hadoop.job.history.user.location”. 
By default, if nothing is specified for the configuration, the history will be 
created in output directory of the job. The user history can be disabled by 
specifying the value “none” for configuration.

Gang, if you are not seeing the history for some of your jobs, there could be a 
couple of reasons.

 1.  Your job does not have any output directory. You can specify a different 
location for user history.
 2.  Job history got disabled for some problem with Job’s configuration. You 
can check JobTracker logs here and verify if the history got disabled.

Thanks
Amareshwari

On 4/1/10 12:13 AM, "Gang Luo"  wrote:

Thanks Abhishek.
but I observe that some of my job output has no such _log directory. Actually, 
I run a script which launch 100+ jobs. I didn't find the log for any of the 
output. Any ideas?

Thanks,
-Gang


- 原始邮件 
发件人: abhishek sharma 
收件人: common-user@hadoop.apache.org
发送日期: 2010/3/31 (周三) 1:15:48 下午
主   题: Re: log

Gang,

In the log/history directory, two files are created for each job--one
xml file that records the configuration and the other file has log
entries. These log entries have all the information about the
individual map and reduce tasks related to a job--which nodes they ran
on, duration, input size, etc.

A single log/history directory is created by Hadoop and files related
to all the jobs executed are stored there.

Abhishek

On Tue, Mar 30, 2010 at 8:50 PM, Gang Luo  wrote:
> Hi all,
> I find there is a directory "_log/history/..." under the output directory of 
> a mapreduce job. Is the file in that directory a log file? Is the information 
> there sufficient to allow me to figure out what nodes the job runs on? 
> Besides, not every job has such a directory. Is there such settings 
> controlling this? Or is there other ways to get the nodes my job runs on?
>
> Thanks,
> -Gang
>
>
>
>







hadoop communication problem with server

2010-03-31 Thread himanshu chandola
Hi Everyone,
My hadoop install has been throwing such errors quite regularly now. After 
running for a while I get this error:
10/03/31 21:04:17 INFO mapred.JobClient: Communication problem with server: 
java.io.IOException: Call to hill.local/10.42.1.1:8021 failed on local 
exception: java.io.EOFException

This has been happening frequently to my cluster setup and after this error 
things only get fixed after a restart. During this time, the webpage for 
monitoring hadoop doesn't open up and hadoop job -list hangs up.

Would really appreciate if you can give any ideas .

Thanks
H

 Morpheus: Do you believe in fate, Neo?
Neo: No.
Morpheus: Why Not?
Neo: Because I don't like the idea that I'm not in control of my life.



  


Re: How to Recommission?

2010-03-31 Thread Allen Wittenauer



On 3/31/10 8:12 PM, "Zhanlei Ma"  wrote:

> But how to Recommission? Wish your help.

Take them out of dfs.exclude and refreshnodes again.




Re: Hadoop DFS IO Performance measurement

2010-03-31 Thread Jason Venner
I completely forgot that the raid controller can be a bottle neck as
can the disk connection strategy
I don't remember what the PERC's top out at, and I don't recall the
aggregate actual bandwidth available for the disks.
I have a simple SSD that I get steady 100MB/sec out of with sata 1, I
would guess that sata 2 tops out about 150MB/sec in the real world.

Exercise each of your components in isolation

ie: dd if=/dev/MY_RAID0 of=/dev/null bs=64k count=10
to get an idea of what the disk subsystem can deliver

the dfs client passes everything through the socket layer which adds
additional copying and latency.

On Wed, Mar 31, 2010 at 6:31 PM, Jason Venner  wrote:
> Unless you are getting all local IO, and or you have better than GigE
> nic interfaces
> 100MB/sec is your cap.
>
> For local IO the bound is going to be your storage subsystem.
> Decent drives in a raid 0 interface are going to cap out on those
> machines about 400MB/sec, which is the buffer cache bandwidth on those
> processors/memory.
> realistically you are going to see a quite a bit less, but 200 should be 
> doable.
>
>
> On Wed, Mar 31, 2010 at 2:55 PM, Sagar Naik  wrote:
>> Hi Edson,
>>
>> usual commodity machine :
>> 8GB Ram, 2.5GHZ Intel Xeon,
>>
>> 6 Disks : 2 drives RAID 1 , 4 RAID 0 using PERC 6
>>
>> Centos , ext4
>>
>> Datanodes configured to use 4 RAID 0 drives
>> OS and hadoop installation on RAID 1
>>
>> -Sagar
>> On Mar 30, 2010, at 4:21 PM, Edson Ramiro wrote:
>>
>>> Hi Sagar,
>>>
>>> What hardware did you run it on ?
>>>
>>> Edson Ramiro
>>>
>>>
>>> On 30 March 2010 19:41, sagar naik  wrote:
>>>
 Hi All,

 I am trying to get DFS IO performance.
 I used TestDFSIO from hadoop jars.
 The results were abt 100Mbps read and write .
 I think it should be more than this

 Pl share some stats to compare

 Either I am missing something like  config params or something else


 -Sagar

>>
>>
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Hadoop DFS IO Performance measurement

2010-03-31 Thread Jason Venner
Unless you are getting all local IO, and or you have better than GigE
nic interfaces
100MB/sec is your cap.

For local IO the bound is going to be your storage subsystem.
Decent drives in a raid 0 interface are going to cap out on those
machines about 400MB/sec, which is the buffer cache bandwidth on those
processors/memory.
realistically you are going to see a quite a bit less, but 200 should be doable.


On Wed, Mar 31, 2010 at 2:55 PM, Sagar Naik  wrote:
> Hi Edson,
>
> usual commodity machine :
> 8GB Ram, 2.5GHZ Intel Xeon,
>
> 6 Disks : 2 drives RAID 1 , 4 RAID 0 using PERC 6
>
> Centos , ext4
>
> Datanodes configured to use 4 RAID 0 drives
> OS and hadoop installation on RAID 1
>
> -Sagar
> On Mar 30, 2010, at 4:21 PM, Edson Ramiro wrote:
>
>> Hi Sagar,
>>
>> What hardware did you run it on ?
>>
>> Edson Ramiro
>>
>>
>> On 30 March 2010 19:41, sagar naik  wrote:
>>
>>> Hi All,
>>>
>>> I am trying to get DFS IO performance.
>>> I used TestDFSIO from hadoop jars.
>>> The results were abt 100Mbps read and write .
>>> I think it should be more than this
>>>
>>> Pl share some stats to compare
>>>
>>> Either I am missing something like  config params or something else
>>>
>>>
>>> -Sagar
>>>
>
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: swapping on hadoop

2010-03-31 Thread Scott Carey
On Linux, check out the 'swappiness' OS tunable -- you can turn this down from 
the default to reduce swapping at the expense of some system file cache.  
However, you want a decent chunk of RAM left for the system to cache files -- 
if it is all allocated and used by Hadoop there will be extra I/O.

For Java GC, if your -Xmx is above 600MB or so, try either changing 
-XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the 
-XX:MaxNewSize parameter to around 150MB to 250MB.

An example of Hadoop memory use scaling as -Xmx grows:

Lets say you have a Hadoop job with a 100MB map side join, and 250MB of hadoop 
sort space.

Both of these chunks of data will eventually get pushed to the tenured 
generation.   So, the actual heap required will end up close to:

(Size of young generation) + 100MB + 250MB + misc.  
The default size of the young generation is 1/3 of the heap.  So, at -Xmx750M 
this job will probably use a minimum of 600MB of java heap, plus about 50MB 
non-heap if this is a pure java job.

Now, perhaps due to some other jobs you want to set -Xmx1200M.  The above job 
will end up using about 150MB more now, because the new space has grown, 
although the footprint is the same.   A larger new space can improve 
performance, but with most typical hadoop jobs it won't.   Making sure it does 
not grow larger just because -Xmx is larger can help save a lot of memory.  
Additionally, a job that would have failed with an OOME at -Xmx1200M might pass 
at -Xmx1000M if the young generation takes 150MB instead of 400MB of the space.

If you are using a 64 bit JRE, you can also save space with the 
-XX:+UseCompressedOops option -- sometimes quite a bit of space.

On Mar 30, 2010, at 10:15 AM, Vasilis Liaskovitis wrote:

> Hi all,
> 
> I 've noticed swapping for a single terasort job on a small 8-node
> cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I
> can have back to back runs of the same job from the same hdfs input
> data and get swapping only on 1 out of 4 identical runs. I 've noticed
> this swapping behaviour on both terasort jobs and hive query jobs.
> 
> - Focusing on a single job config, Is there a rule of thumb about how
> much node memory should be left for use outside of Child JVMs?
> I make sure that per Node, there is free memory:
> (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) *
> JVMHeapSize < PhysicalMemoryonNode
> The total JVM heap size per node per job from the above equation
> currently account 65%-75% of the node's memory. (I 've tried
> allocating a riskier 90% of the node's memory, with similar swapping
> observations).
> 
> - Could there be an issue with HDFS data or metadata taking up memory?
> I am not cleaning output or intermediate outputs from HDFS between
> runs. Is this possible?
> 
> - Do people use any specific java flags (particularly garbage
> collection flags) for production environments where one job runs (or
> possibly more jobs run simultaneously) ?
> 
> - What are the memory requirements for the jobtracker,namenode and
> tasktracker,datanode JVMs?
> 
> - I am setting io.sort.mb to about half of the JVM heap size (half of
> -Xmx in javaopts). Should this be set to a different ratio? (this
> setting doesn't sound like it should be causing swapping in the first
> place).
> 
> - The buffer cache is cleaned before each run (flush and echo 3 >
> /proc/sys/vm/drop_caches)
> 
> any empirical advice and suggestions  to solve this are appreciated.
> thanks,
> 
> - Vasilis



Re: Hadoop DFS IO Performance measurement

2010-03-31 Thread Sagar Naik
Hi Edson, 

usual commodity machine :
8GB Ram, 2.5GHZ Intel Xeon, 

6 Disks : 2 drives RAID 1 , 4 RAID 0 using PERC 6

Centos , ext4

Datanodes configured to use 4 RAID 0 drives
OS and hadoop installation on RAID 1

-Sagar 
On Mar 30, 2010, at 4:21 PM, Edson Ramiro wrote:

> Hi Sagar,
> 
> What hardware did you run it on ?
> 
> Edson Ramiro
> 
> 
> On 30 March 2010 19:41, sagar naik  wrote:
> 
>> Hi All,
>> 
>> I am trying to get DFS IO performance.
>> I used TestDFSIO from hadoop jars.
>> The results were abt 100Mbps read and write .
>> I think it should be more than this
>> 
>> Pl share some stats to compare
>> 
>> Either I am missing something like  config params or something else
>> 
>> 
>> -Sagar
>> 



RE: is there any way we can limit Hadoop Datanode's disk usage?

2010-03-31 Thread Michael Segel



> From: awittena...@linkedin.com
> To: common-user@hadoop.apache.org
> Subject: Re: is there any way we can limit Hadoop Datanode's disk usage?
> Date: Wed, 31 Mar 2010 18:09:04 +
> 
> On 3/30/10 8:12 PM, "steven zhuang"  wrote:
> 
> > hi, guys,
> >we have some machine with 1T disk, some with 100GB disk,
> >I have this question that is there any means we can limit the
> > disk usage of datanodes on those machines with smaller disk?
> >thanks!
> 
> 
> You can use dfs.datanode.du.reserved, but be aware that are *no* limits on
> mapreduce's usage, other than what you can create with file system quotas.
> 
>  I've started recommended creating file system partitions in order to work
> around Hadoop's crazy space reservation ideas.
> 
Hmmm.

Our sysadmins decided to put each of the jbod disks in to their own volume 
group.
Kind of makes sense if you want to limit any impact that Hadoop could cause. 
(Assuming someone forgot to set up the dfs.datanode.du.reserved)

But I do agree that at a minimum, the file space used by hadoop should be a 
partition and not on the '/' (root) disk space.
  
_
Hotmail: Trusted email with powerful SPAM protection.
http://clk.atdmt.com/GBL/go/210850553/direct/01/

OutOfMemoryError: Cannot create GC thread. Out of system resources

2010-03-31 Thread Edson Ramiro
Hi all,

When I run the pi Hadoop sample I get this error:

10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp://
h04.ctinfra.ufpr.br:50060/tasklog?plaintext=true&taskid=attempt_201003311545_0001_r_02_0&filter=stdout
10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp://
h04.ctinfra.ufpr.br:50060/tasklog?plaintext=true&taskid=attempt_201003311545_0001_r_02_0&filter=stderr
10/03/31 15:46:20 INFO mapred.JobClient: Task Id :
attempt_201003311545_0001_m_06_1, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

May be its because the datanode can't create more threads.

ram...@lcpad:~/hadoop-0.20.2$ cat
logs/userlogs/attempt_201003311457_0001_r_01_2/stdout
#
# A fatal error has been detected by the Java Runtime Environment:
#
# java.lang.OutOfMemoryError: Cannot create GC thread. Out of system
resources.
#
#  Internal Error (gcTaskThread.cpp:38), pid=28840, tid=140010745776400
#  Error: Cannot create GC thread. Out of system resources.
#
# JRE version: 6.0_17-b04
# Java VM: Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode
linux-amd64 )
# An error report file with more information is saved as:
#
/var-host/tmp/hadoop-ramiro/mapred/local/taskTracker/jobcache/job_201003311457_0001/attempt_201003311457_0001_r_01_2/work/hs_err_pid28840.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

I configured the limits bellow, but I'm still getting the same error.

  
  fs.inmemory.size.mb
  100
  

  
  mapred.child.java.opts
  -Xmx128M
  

Do you know what limit should I configure to fix it?

Thanks in Advance

Edson Ramiro


Re: log

2010-03-31 Thread Gang Luo
Thanks Abhishek. 
but I observe that some of my job output has no such _log directory. Actually, 
I run a script which launch 100+ jobs. I didn't find the log for any of the 
output. Any ideas?

Thanks,
-Gang


- 原始邮件 
发件人: abhishek sharma 
收件人: common-user@hadoop.apache.org
发送日期: 2010/3/31 (周三) 1:15:48 下午
主   题: Re: log

Gang,

In the log/history directory, two files are created for each job--one
xml file that records the configuration and the other file has log
entries. These log entries have all the information about the
individual map and reduce tasks related to a job--which nodes they ran
on, duration, input size, etc.

A single log/history directory is created by Hadoop and files related
to all the jobs executed are stored there.

Abhishek

On Tue, Mar 30, 2010 at 8:50 PM, Gang Luo  wrote:
> Hi all,
> I find there is a directory "_log/history/..." under the output directory of 
> a mapreduce job. Is the file in that directory a log file? Is the information 
> there sufficient to allow me to figure out what nodes the job runs on? 
> Besides, not every job has such a directory. Is there such settings 
> controlling this? Or is there other ways to get the nodes my job runs on?
>
> Thanks,
> -Gang
>
>
>
>






Re: Redirecting hadoop log messages to a log file at client side

2010-03-31 Thread Alex Kozlov
Just for information, you can call the same code from the `hadoop` command,
the benefit is that a lot of java config parameters are set up for you (for
good or for bad):

hadoop  

and set HADOOP_CLASSPATH to include your custom jar (no need to include
hadoop jars this way)

It looks like there is a conflict between reading the resources in your
case...

But it looks like you know what you are doing.

Alex K

On Wed, Mar 31, 2010 at 4:24 AM, Pallavi Palleti <
pallavi.pall...@corp.aol.com> wrote:

> Hi Alex,
>
> I created a jar including my client code (specified in manifest) and needed
> jar files like hadoop-20.jar, log4j.jar, commons-logging.jar and ran the
> application as
> java -cp  -jar above_jar
> needed-parameters-for-client-code.
>
> I will explore using commons-logging in my client code.
>
> Thanks
> Pallavi
>
>
> On 03/30/2010 10:14 PM, Alex Kozlov wrote:
>
>> Hi Pallavi,
>>
>> DFSClient uses log4j.properties for configuration.  What is your
>> classpath?
>>  I need to know how exactly you invoke your program (java, hadoop script,
>> etc.).  The log level and appender is driven by the hadoop.root.logger
>> config variable.
>>
>> I would also recommend to use one logging system in the code, which will
>> be
>> commons-logging in this case.
>>
>> Alex K
>>
>> On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti<
>> pallavi.pall...@corp.aol.com>  wrote:
>>
>>
>>
>>> Hi Alex,
>>>
>>> Thanks for the reply. I have already created a logger (from
>>> log4j.logger)and configured the same to log it to a file and it is
>>> logging
>>> for all the log statements that I have in my client code. However, the
>>> error/info logs of DFSClient are going to stdout.  The DFSClient code is
>>> using log from commons-logging.jar. I am wondering how to redirect those
>>> logs (which are right now going to stdout) to append to the existing
>>> logger
>>> in client code.
>>>
>>> Thanks
>>> Pallavi
>>>
>>>
>>>
>>> On 03/30/2010 12:06 PM, Alex Kozlov wrote:
>>>
>>>
>>>
 Hi Pallavi,

 It depends what logging configuration you are using.  If it's log4j, you
 need to modify (or create) log4j.properties file and point you code (via
 classpath) to it.

 A sample log4j.properties is in the conf directory (either apache or CDH
 distributions).

 Alex K

 On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti<
 pallavi.pall...@corp.aol.com>   wrote:





> Hi,
>
> I am copying certain data from a client machine (which is not part of
> the
> cluster) using DFSClient to HDFS. During this process, I am
> encountering
> some issues and the error/info logs are going to stdout. Is there a
> way,
> I
> can configure the property at client side so that the error/info logs
> are
> appended to existing log file (being created using logger at client
> code)
> rather writing to stdout.
>
> Thanks
> Pallavi
>
>
>
>
>



>>>
>>>
>>
>>
>


Re: C++ pipes on full (nonpseudo) cluster

2010-03-31 Thread Allen Wittenauer
On 3/31/10 9:53 AM, "Keith Wiley"  wrote:
> I'm still trying to find a way to build the binary on my Mac instead of
> logging into some remote Linux machine. I ought to be able to build a binary
> on the Mac that runs on a Linux cluster if I use the right flags and link
> against the right libraries, but so far I've had no success on that front.

You'll need either a cross compiler or a virtual machine.

You're probably better off logging into that remote box.



Re: swapping on hadoop

2010-03-31 Thread Allen Wittenauer



On 3/30/10 10:15 AM, "Vasilis Liaskovitis"  wrote:
> - Focusing on a single job config, Is there a rule of thumb about how
> much node memory should be left for use outside of Child JVMs?
> I make sure that per Node, there is free memory:
> (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) *
> JVMHeapSize < PhysicalMemoryonNode
> The total JVM heap size per node per job from the above equation
> currently account 65%-75% of the node's memory. (I 've tried
> allocating a riskier 90% of the node's memory, with similar swapping
> observations).

Java takes more RAM than just the heap size.  Sometimes 2-3x as much.

My general rule of thumb for general purpose grids is to plan on having
3-4gb of free VM (swap+physical) space for the OS, monitoring, datanode, and
task tracker processes.  After that, you can carve it up however you want.
If you are running on machines with less than that much, you likely are
going to face at least a little bit of swapping or have minimum monitoring
or whatever.




Re: is there any way we can limit Hadoop Datanode's disk usage?

2010-03-31 Thread Allen Wittenauer
On 3/30/10 8:12 PM, "steven zhuang"  wrote:

> hi, guys,
>we have some machine with 1T disk, some with 100GB disk,
>I have this question that is there any means we can limit the
> disk usage of datanodes on those machines with smaller disk?
>thanks!


You can use dfs.datanode.du.reserved, but be aware that are *no* limits on
mapreduce's usage, other than what you can create with file system quotas.

 I've started recommended creating file system partitions in order to work
around Hadoop's crazy space reservation ideas.



Re: Query over DFSClient

2010-03-31 Thread Hairong Kuang
> However, I couldn't figure out where this exception is thrown.
After lastException is set, the exception is thrown when next
FSOutputStream.write or close is called.

Hairong

On 3/31/10 6:01 AM, "Ankur C. Goel"  wrote:

> Pallavi,
>  If a DFSClient is not able to write a block to any of the
> datanodes given by namenode it retries N times before aborting.
> See - https://issues.apache.org/jira/browse/HDFS-167
> This should be handled by the application as this indicates that something is
> seriously wrong with your cluster
> 
> Hope this helps
> -...@nkur
> 
> On 3/31/10 4:59 PM, "Pallavi Palleti"  wrote:
> 
> Hi,
> 
> I am looking into hadoop-20 source code for below issue. From DFSClient,
> I could see that once the datanodes given by namenode are not reachable,
> it is setting "lastException" variable to error message saying "recovery
> from primary datanode is failed N times, aborting.."(line No:2546 in
> processDataNodeError).  However, I couldn't figure out where this
> exception is thrown. I could see the throw statement in isClosed() but
> not finding the exact sequence after Streamer exits with lastException
> set to isClosed() method call. It would be great if some one could shed
> some light on this. I am essentially looking whether DFSClient
> approaches namenode in the case of failure of all datanodes that
> namenode has given for a given data block previously.
> 
> Thanks
> Pallavi
> 
> 
> On 03/30/2010 05:01 PM, Pallavi Palleti wrote:
>> Hi,
>> 
>> Could some one kindly let me know if the DFSClient takes care of
>> datanode failures and attempt to write to another datanode if primary
>> datanode (and replicated datanodes) fail. I looked into the souce code
>> of DFSClient and figured out that it attempts to write to one of the
>> datanodes in pipeline and fails if it failed to write to at least one
>> of them. However, I am not sure as I haven't explored fully. If so, is
>> there a way of querying namenode to provide different datanodes in the
>> case of failure. I am sure the Mapper would be doing similar
>> thing(attempting to fetch different datanode from namenode)  if it
>> fails to write to datanodes. Kindly let me know.
>> 
>> Thanks
>> Pallavi
>> 
> 



Re: log

2010-03-31 Thread abhishek sharma
Gang,

In the log/history directory, two files are created for each job--one
xml file that records the configuration and the other file has log
entries. These log entries have all the information about the
individual map and reduce tasks related to a job--which nodes they ran
on, duration, input size, etc.

A single log/history directory is created by Hadoop and files related
to all the jobs executed are stored there.

Abhishek

On Tue, Mar 30, 2010 at 8:50 PM, Gang Luo  wrote:
> Hi all,
> I find there is a directory "_log/history/..." under the output directory of 
> a mapreduce job. Is the file in that directory a log file? Is the information 
> there sufficient to allow me to figure out what nodes the job runs on? 
> Besides, not every job has such a directory. Is there such settings 
> controlling this? Or is there other ways to get the nodes my job runs on?
>
> Thanks,
> -Gang
>
>
>
>


Re: C++ pipes on full (nonpseudo) cluster

2010-03-31 Thread Keith Wiley
I have solved the problem...I guess.  If one looks back through this thread, 
one will observe that one of the initial errors I was receiving was the 
following:

stderr logs
bash: 
/data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic:
 cannot execute binary file

I had misinterpreted this as an indication that the file did not have 
executable permissions.  This misinterpretation was exacerbated by the 
observation that I could not set executable permissions on an HDFS file so 
files in HDFS really truly are nonexecutable.  However, it turns out that this 
is okay because the Pipes library enables executable permissions before running 
a binary.  The symptom nevertheless suggested my initial interpretation, that 
this was a file permissions problem.

The error I was receiving above was not a file permissions problem, it was an 
architecture mismatch between the binary and the cluster nodes.  A tremendously 
labored quest has converged on a delicate build configuration which, for the 
time being, seems to work.

I'm still trying to find a way to build the binary on my Mac instead of logging 
into some remote Linux machine. I ought to be able to build a binary on the Mac 
that runs on a Linux cluster if I use the right flags and link against the 
right libraries, but so far I've had no success on that front.

Thanks for you help.

Cheers!


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
  -- Keith Wiley







Re: DeDuplication Techniques

2010-03-31 Thread Joseph Stein
Thanks everyone.

I think we are going to go HBase and use the hash map structures there
to keep uniqueness (table.exists(value)) on our values, see it how it
goes.

Appreciate the insights I do like (being a developer) the extended
sort and join ideas in the MR will likely use that for other things.
Seems like a lot of work just to get to the point to execute our
business logic (time/resources, etc).

As the old saying goes "sometimes you only need chopsticks to catch a fly" =8^)

Thanks again!!!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

On Fri, Mar 26, 2010 at 3:26 PM, Jeyendran Balakrishnan
 wrote:
> Joe,
>
> This is what I use for a related problem, using pure HDFS [no HBase]:
>
> 1. Run a one-time map-reduce job where you input your current historical file 
> of hashes [say it is of the format  in some kind of 
> flat file] using IdentityMapper and the output of your custom reducer is 
>  is  or maybe even  
> to save space. The important thing is to use MapFileOutputFormat for the 
> reducer output instead of the typical SequenceFileOutputFormat. Now you have 
> a single look-up table which you use for efficient lookup using your hash 
> keys.
> Note down the HDFS path of where you stored this mapfile, call it 
> dedupMapFile.
>
> 2. In your incremental data update job, pass the HDFS path of dedupMapFile to 
> your conf, then open the mapfile in your reducer configure(), store the 
> reference to the mapfile in the class, and close it in close().
> Inside your reduce(), use the mapfile reference to lookup your hashkey; if 
> there is a hit, it is a dup.
>
> 3. Also, for your reducer in 2. above, you can use a multiple output format 
> custom format, in which one of the outputs is your current output, and the 
> other is a new dedup output sequencefile which is in the same key-value 
> format as the dedupMapFile. So in the reduce() if the current key value is a 
> dup, discard it, else output to both your regular output, and the new dedup 
> output.
>
> 4. After each incremental update job, run a new map reduce job 
> [IdentityMapper and IdentityReducer] to merge the new dedup file with your 
> old dedupMapFile, resulting in the updated dedupMapFile.
>
> Some comments:
> * I didn't read your approach too closely, so I suspect you might be doing 
> something essentially like this already.
> * All this stuff is basically what HBase does for free, where your 
> dedupMapFile is now a HBase table, and you don't have to run Step 4, since 
> you can just write new [non-duplicate] hash-keys to the HBase table in Step 
> 3, and in Step 2, you just use table.exists(hash-key) to check if it is a 
> dup. You still need Step 1 to populate the table with your historical data.
>
> Hope this helps
>
> Cheers,
> jp
>
>
> -Original Message-
> From: Joseph Stein [mailto:crypt...@gmail.com]
> Sent: Thursday, March 25, 2010 11:35 AM
> To: common-user@hadoop.apache.org
> Subject: Re: DeDuplication Techniques
>
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
>
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
>
> If it is there then throw it out, if not there add it to the end.
>
> To me this opening a file checking for dups is a map/reduce task in itself.
>
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
>
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
>
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
>
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner  wrote:
>> Joe,
>>
>> what about this approach:
>>
>> using hashmap values as your keys in MR maps. Since they are sorted by keys,
>> in reducer you will get all duplicates together, so that you can loop
>> through them. As the simplest solution, you just take the first one.
>>
>> Sincerely,
>> Mark
>>
>> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein  wrote:
>>
>>> I have been researching ways to handle de-dupping data while running a
>>> map/reduce program (so as to not re-calculate/re-aggregate data that
>>> we have seen before[possibly months before]).
>>>
>>> The data sets we have are littered with repeats of data from mobile
>>> devices which continue to come in over time (so 

Re: C++ pipes on full (nonpseudo) cluster

2010-03-31 Thread Gianluigi Zanetti
Could you please state exactly the steps you do in setting up the pipes
run?

The two critical things to watch are:

a/ where did you load the executable on hdfs e.g.,
$ ls pipes-bin
genreads
pair-reads
seqal

$ hadoop dfs -put pipes-bin pipes-bin

$ hadoop dfs -ls hdfs://host:53897/user/zag/pipes-bin
Found 3 items
-rw-r--r--   3 zag supergroup480 2010-03-17 12:28 
/user/zag/pipes-bin/genreads
-rw-r--r--   3 zag supergroup692 2010-03-17 15:02 
/user/zag/pipes-bin/pair_reads
-rw-r--r--   3 zag supergroup477 2010-03-17 12:33 
/user/zag/pipes-bin/seqal

b/ how you started the program

$ hadoop pipes -D hadoop.pipes.executable=pipes-bin/genreads -D 
hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
-input input  -output output


hope this is clear enough. 


--gianluigi

On Wed, 2010-03-31 at 06:57 -0700, Keith Wiley wrote:
> On 2010, Mar 31, at 4:25 AM, Gianluigi Zanetti wrote:
> 
> > What happens if you try this:
> >
> > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D  
> > hadoop.pipes.executable=EXECUTABLE -D  
> > hadoop.pipes.java.recordreader=true -D  
> > hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output  
> > HDFSPATH/output
> 
> 
> Not good news.  This is what I got:
> 
> $ hadoop pipes -D hadoop.pipes.executable=/Users/keithwiley/Astro_LSST/ 
> hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic -D  
> hadoop.pipes.java.recordreader=true -D  
> hadoop.pipes.java.recordwriter=true -input /uwphysics/kwiley/mosaic/ 
> input -output /uwphysics/kwiley/mosaic/output
> Exception in thread "main" java.io.FileNotFoundException: File does  
> not exist: /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ 
> Mosaic/src/cpp/Mosaic
>   at  
> org 
> .apache 
> .hadoop 
> .hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: 
> 457)
>   at  
> org 
> .apache 
> .hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: 
> 509)
>   at  
> org 
> .apache 
> .hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681)
>   at  
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802)
>   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290)
>   at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
>   at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
>   at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
> 
> Incidentally, just in case you're wondering:
> $ ls -l /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ 
> Mosaic/src/cpp/
> total 800
> 368 -rwxr-xr-x  1 keithwiley  keithwiley  185184 Mar 29 19:08 Mosaic*
> ...other files...
> 
> The path is obviously correct on my local machine.  The only  
> explanation is that Hadoop is looking for it on HDFS under that path.
> 
> I'm desperate.  I don't understand why I'm the only person who can get  
> this working.  Could you please describe to me the set of commands you  
> use to run a pipes program on a fully distributed cluster?
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com 
> music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly  
> teaches
> me to suspect that my own is also."
> --  Mark Twain
> 
> 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-31 Thread Keith Wiley

On 2010, Mar 31, at 4:25 AM, Gianluigi Zanetti wrote:


What happens if you try this:

$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D  
hadoop.pipes.executable=EXECUTABLE -D  
hadoop.pipes.java.recordreader=true -D  
hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output  
HDFSPATH/output



Not good news.  This is what I got:

$ hadoop pipes -D hadoop.pipes.executable=/Users/keithwiley/Astro_LSST/ 
hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic -D  
hadoop.pipes.java.recordreader=true -D  
hadoop.pipes.java.recordwriter=true -input /uwphysics/kwiley/mosaic/ 
input -output /uwphysics/kwiley/mosaic/output
Exception in thread "main" java.io.FileNotFoundException: File does  
not exist: /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ 
Mosaic/src/cpp/Mosaic
	at  
org 
.apache 
.hadoop 
.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: 
457)
	at  
org 
.apache 
.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: 
509)
	at  
org 
.apache 
.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681)
	at  
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)

Incidentally, just in case you're wondering:
$ ls -l /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ 
Mosaic/src/cpp/

total 800
368 -rwxr-xr-x  1 keithwiley  keithwiley  185184 Mar 29 19:08 Mosaic*
...other files...

The path is obviously correct on my local machine.  The only  
explanation is that Hadoop is looking for it on HDFS under that path.


I'm desperate.  I don't understand why I'm the only person who can get  
this working.  Could you please describe to me the set of commands you  
use to run a pipes program on a fully distributed cluster?



Keith Wiley kwi...@keithwiley.com keithwiley.com 
music.keithwiley.com


"The easy confidence with which I know another man's religion is folly  
teaches

me to suspect that my own is also."
   --  Mark Twain




Re: Query over DFSClient

2010-03-31 Thread Ankur C. Goel
Pallavi,
 If a DFSClient is not able to write a block to any of the 
datanodes given by namenode it retries N times before aborting.
See - https://issues.apache.org/jira/browse/HDFS-167
This should be handled by the application as this indicates that something is 
seriously wrong with your cluster

Hope this helps
-...@nkur

On 3/31/10 4:59 PM, "Pallavi Palleti"  wrote:

Hi,

I am looking into hadoop-20 source code for below issue. From DFSClient,
I could see that once the datanodes given by namenode are not reachable,
it is setting "lastException" variable to error message saying "recovery
from primary datanode is failed N times, aborting.."(line No:2546 in
processDataNodeError).  However, I couldn't figure out where this
exception is thrown. I could see the throw statement in isClosed() but
not finding the exact sequence after Streamer exits with lastException
set to isClosed() method call. It would be great if some one could shed
some light on this. I am essentially looking whether DFSClient
approaches namenode in the case of failure of all datanodes that
namenode has given for a given data block previously.

Thanks
Pallavi


On 03/30/2010 05:01 PM, Pallavi Palleti wrote:
> Hi,
>
> Could some one kindly let me know if the DFSClient takes care of
> datanode failures and attempt to write to another datanode if primary
> datanode (and replicated datanodes) fail. I looked into the souce code
> of DFSClient and figured out that it attempts to write to one of the
> datanodes in pipeline and fails if it failed to write to at least one
> of them. However, I am not sure as I haven't explored fully. If so, is
> there a way of querying namenode to provide different datanodes in the
> case of failure. I am sure the Mapper would be doing similar
> thing(attempting to fetch different datanode from namenode)  if it
> fails to write to datanodes. Kindly let me know.
>
> Thanks
> Pallavi
>



Re: Query over DFSClient

2010-03-31 Thread Pallavi Palleti

Hi,

I am looking into hadoop-20 source code for below issue. From DFSClient, 
I could see that once the datanodes given by namenode are not reachable, 
it is setting "lastException" variable to error message saying "recovery 
from primary datanode is failed N times, aborting.."(line No:2546 in 
processDataNodeError).  However, I couldn't figure out where this 
exception is thrown. I could see the throw statement in isClosed() but 
not finding the exact sequence after Streamer exits with lastException 
set to isClosed() method call. It would be great if some one could shed 
some light on this. I am essentially looking whether DFSClient 
approaches namenode in the case of failure of all datanodes that 
namenode has given for a given data block previously.


Thanks
Pallavi


On 03/30/2010 05:01 PM, Pallavi Palleti wrote:

Hi,

Could some one kindly let me know if the DFSClient takes care of 
datanode failures and attempt to write to another datanode if primary 
datanode (and replicated datanodes) fail. I looked into the souce code 
of DFSClient and figured out that it attempts to write to one of the 
datanodes in pipeline and fails if it failed to write to at least one 
of them. However, I am not sure as I haven't explored fully. If so, is 
there a way of querying namenode to provide different datanodes in the 
case of failure. I am sure the Mapper would be doing similar 
thing(attempting to fetch different datanode from namenode)  if it 
fails to write to datanodes. Kindly let me know.


Thanks
Pallavi



Re: C++ pipes on full (nonpseudo) cluster

2010-03-31 Thread Gianluigi Zanetti
What happens if you try this:

$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
hadoop.pipes.executable=EXECUTABLE -D hadoop.pipes.java.recordreader=true -D 
hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output 
HDFSPATH/output
> Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output

On Tue, 2010-03-30 at 15:05 -0700, Keith Wiley wrote:
> $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
> hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
> -input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE
> Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output
> 10/03/30 14:56:55 WARN mapred.JobClient: No job jar file set.  User classes 
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 10/03/30 14:56:55 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 10/03/30 14:57:05 INFO mapred.JobClient: Running job: job_201003241650_1076
> 10/03/30 14:57:06 INFO mapred.JobClient:  map 0% reduce 0%
> ^C
> $
> 
> At that point the terminal hung, so I eventually ctrl-Ced to break it.  Now 
> if I investigate the Hadoop task logs for the mapper, I see this:
> 
> stderr logs
> bash: 
> /data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic:
>  cannot execute binary file
> 
> ...which makes perfect sense in light of the following:
> 
> $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
> Found 1 items
> -rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
> /uwphysics/kwiley/mosaic/c++_bin/Mosaic
> $ hd fs -chmod 755 /uwphysics/kwiley/mosaic/c++_bin/Mosaic
> $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
> Found 1 items
> -rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
> /uwphysics/kwiley/mosaic/c++_bin/Mosaic
> $
> 
> Note that this is all in attempt to run an executable that was uploaded to 
> HDFS in advance.  In this example I am not attempting to run an executable 
> stored on my local machine.  Any attempt to do that results in a file not 
> found error:
> 
> $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
> hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
> -input HDFSPATH/input -output HDFSPATH/output -program LOCALPATH/EXECUTABLE
> Deleted hdfs://mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/output
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /Users/kwiley/hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>   at 
> org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
>   at 
> org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681)
>   at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802)
>   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290)
>   at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
>   at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
>   at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
> $
> 
> It's clearly looking or the executable in HDFS, not on the local system, thus 
> the file not found error.
> 
> 
> Keith Wiley   kwi...@keithwiley.com   
> www.keithwiley.com
> 
> "What I primarily learned in grad school is how much I *don't* know.
> Consequently, I left grad school with a higher ignorance to knowledge ratio 
> than
> when I entered."
>   -- Keith Wiley
> 
> 
> 
> 
> 


Re: Redirecting hadoop log messages to a log file at client side

2010-03-31 Thread Pallavi Palleti

Hi Alex,

I created a jar including my client code (specified in manifest) and 
needed jar files like hadoop-20.jar, log4j.jar, commons-logging.jar and 
ran the application as
java -cp  -jar above_jar 
needed-parameters-for-client-code.


I will explore using commons-logging in my client code.

Thanks
Pallavi

On 03/30/2010 10:14 PM, Alex Kozlov wrote:

Hi Pallavi,

DFSClient uses log4j.properties for configuration.  What is your classpath?
  I need to know how exactly you invoke your program (java, hadoop script,
etc.).  The log level and appender is driven by the hadoop.root.logger
config variable.

I would also recommend to use one logging system in the code, which will be
commons-logging in this case.

Alex K

On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti<
pallavi.pall...@corp.aol.com>  wrote:

   

Hi Alex,

Thanks for the reply. I have already created a logger (from
log4j.logger)and configured the same to log it to a file and it is logging
for all the log statements that I have in my client code. However, the
error/info logs of DFSClient are going to stdout.  The DFSClient code is
using log from commons-logging.jar. I am wondering how to redirect those
logs (which are right now going to stdout) to append to the existing logger
in client code.

Thanks
Pallavi



On 03/30/2010 12:06 PM, Alex Kozlov wrote:

 

Hi Pallavi,

It depends what logging configuration you are using.  If it's log4j, you
need to modify (or create) log4j.properties file and point you code (via
classpath) to it.

A sample log4j.properties is in the conf directory (either apache or CDH
distributions).

Alex K

On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti<
pallavi.pall...@corp.aol.com>   wrote:



   

Hi,

I am copying certain data from a client machine (which is not part of the
cluster) using DFSClient to HDFS. During this process, I am encountering
some issues and the error/info logs are going to stdout. Is there a way,
I
can configure the property at client side so that the error/info logs are
appended to existing log file (being created using logger at client code)
rather writing to stdout.

Thanks
Pallavi



 


   
 
   


Re: swapping on hadoop

2010-03-31 Thread Amogh Vasekar
Hi,
>>(#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize < 
>>PhysicalMemoryonNode
The tasktracker and datanode daemons also take up memory, 1GB each by default I 
think. Is that accounted for?

>> Could there be an issue with HDFS data or metadata taking up memory?
Is the namenode a separate machine or participates in compute nodes too?

>>What are the memory requirements for the jobtracker,namenode and 
>>tasktracker,datanode JVMs?
See above, there was a thread running on this on the forum sometime back, to 
manipulate these values for TT and DN.

>>this setting doesn't sound like it should be causing swapping in the first 
>>place ( io.sort.mb)
I think so too :)

Just yesterday I read a tweet on machine configs for Hadoop, hope it helps you
http://bit.ly/cphF7R

Amogh


On 3/30/10 10:45 PM, "Vasilis Liaskovitis"  wrote:

Hi all,

I 've noticed swapping for a single terasort job on a small 8-node
cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I
can have back to back runs of the same job from the same hdfs input
data and get swapping only on 1 out of 4 identical runs. I 've noticed
this swapping behaviour on both terasort jobs and hive query jobs.

- Focusing on a single job config, Is there a rule of thumb about how
much node memory should be left for use outside of Child JVMs?
I make sure that per Node, there is free memory:
(#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) *
JVMHeapSize < PhysicalMemoryonNode
The total JVM heap size per node per job from the above equation
currently account 65%-75% of the node's memory. (I 've tried
allocating a riskier 90% of the node's memory, with similar swapping
observations).

- Could there be an issue with HDFS data or metadata taking up memory?
I am not cleaning output or intermediate outputs from HDFS between
runs. Is this possible?

- Do people use any specific java flags (particularly garbage
collection flags) for production environments where one job runs (or
possibly more jobs run simultaneously) ?

- What are the memory requirements for the jobtracker,namenode and
tasktracker,datanode JVMs?

- I am setting io.sort.mb to about half of the JVM heap size (half of
-Xmx in javaopts). Should this be set to a different ratio? (this
setting doesn't sound like it should be causing swapping in the first
place).

- The buffer cache is cleaned before each run (flush and echo 3 >
/proc/sys/vm/drop_caches)

any empirical advice and suggestions  to solve this are appreciated.
thanks,

- Vasilis



Re: java.io.IOException: Function not implemented

2010-03-31 Thread Steve Loughran

Edson Ramiro wrote:

May be it's a bug.

I'm not the admin. : (

so, I'll talk to him and may be he install a 2.6.32.9 in another node to
test  : )

Thanks

Edson Ramiro


On 30 March 2010 20:00, Todd Lipcon  wrote:


Hi Edson,

I noticed that only the h01 nodes are running 2.6.32.9, the other broken
DNs
are 2.6.32.10.

Is there some reason you are running a kernel that is literally 2 weeks
old?
I wouldn't be at all surprised if there were a bug here, or some issue with
your Debian "unstable" distribution...



If you are running the SCM trunk of the OS, you are part of the dev 
team. They will be grateful for the bugs you find and fix, but you get 
to find and fix them. In Ant one bugrep was  stopped setting 
dates in the past, turned out that on the debian nightly builds, you 
couldn't touch any file into the past...


-steve





[ANN] Eclipse GIT plugin beta version released

2010-03-31 Thread Thomas Koch
GIT is one of the most popular distributed version control system. 
In the hope, that more Java developers may want to explore the world of easy 
branching, merging and patch management, I'd like to inform you, that a beta 
version of the upcoming Eclipse GIT plugin is available:

http://www.infoq.com/news/2010/03/egit-released
http://aniszczyk.org/2010/03/22/the-start-of-an-adventure-egitjgit-0-7-1/

Maybe, one day, some apache / hadoop projects will use GIT... :-)

(Yes, I know git.apache.org.)

Best regards,

Thomas Koch, http://www.koch.ro