Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Pallavi Palleti

Hi,

I am copying certain data from a client machine (which is not part of 
the cluster) using DFSClient to HDFS. During this process, I am 
encountering some issues and the error/info logs are going to stdout. Is 
there a way, I can configure the property at client side so that the 
error/info logs are appended to existing log file (being created using 
logger at client code) rather writing to stdout.


Thanks
Pallavi


Re: hadoop.log.dir

2010-03-30 Thread Alex Kozlov
HADOOP_LOG_DIR is used to set hadoop.log.dir (see bin/hadoop).  It is passed
to a JVM via the -D java flag (or set it in the log4j.properties file).

The best way for you would be to set this variable in bin/hadoop-env.sh
(essentially, uncomment the prepared stub).

Alex K

On Mon, Mar 29, 2010 at 10:55 PM, Amareshwari Sri Ramadasu 
amar...@yahoo-inc.com wrote:

 Hadoop.log.dir is not config parameter, it is a system property.
 You can specify the log directory in the environment variable
 HADOOP_LOG_DIR.

 Thanks
 Amareshwari

 On 3/30/10 11:17 AM, Vasilis Liaskovitis vlias...@gmail.com wrote:

 Hi all,

 is  there a config option that controls placement of all hadoop logs?
 I 'd like to put all hadoop logs under a specific directory e.g. /tmp.
 on the namenode and all datanodes.

 Is hadoop.log.dir the right config? Can I change this in the
 log4j.properties file, or pass it e.g. in the JVM opts as
 -Dhadoop.log.dir=/tmp ?
 I am using hadoop-0.20.1 or hadoop-0.20.2.

 thanks,

 - Vasilis




Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Alex Kozlov
Hi Pallavi,

It depends what logging configuration you are using.  If it's log4j, you
need to modify (or create) log4j.properties file and point you code (via
classpath) to it.

A sample log4j.properties is in the conf directory (either apache or CDH
distributions).

Alex K

On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti 
pallavi.pall...@corp.aol.com wrote:

 Hi,

 I am copying certain data from a client machine (which is not part of the
 cluster) using DFSClient to HDFS. During this process, I am encountering
 some issues and the error/info logs are going to stdout. Is there a way, I
 can configure the property at client side so that the error/info logs are
 appended to existing log file (being created using logger at client code)
 rather writing to stdout.

 Thanks
 Pallavi



Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Pallavi Palleti

Hi Alex,

Thanks for the reply. I have already created a logger (from 
log4j.logger)and configured the same to log it to a file and it is 
logging for all the log statements that I have in my client code. 
However, the error/info logs of DFSClient are going to stdout.  The 
DFSClient code is using log from commons-logging.jar. I am wondering how 
to redirect those logs (which are right now going to stdout) to append 
to the existing logger in client code.


Thanks
Pallavi


On 03/30/2010 12:06 PM, Alex Kozlov wrote:

Hi Pallavi,

It depends what logging configuration you are using.  If it's log4j, you
need to modify (or create) log4j.properties file and point you code (via
classpath) to it.

A sample log4j.properties is in the conf directory (either apache or CDH
distributions).

Alex K

On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti
pallavi.pall...@corp.aol.com  wrote:

   

Hi,

I am copying certain data from a client machine (which is not part of the
cluster) using DFSClient to HDFS. During this process, I am encountering
some issues and the error/info logs are going to stdout. Is there a way, I
can configure the property at client side so that the error/info logs are
appended to existing log file (being created using logger at client code)
rather writing to stdout.

Thanks
Pallavi

 
   


Re: Single datanode setup

2010-03-30 Thread Ankur C. Goel

M/R is performance is known to be better when using just a bunch of disks (BOD) 
instead of RAID.

From your setup it looks like your single datanode must be running hot on I/O 
activity.

The parameter- dfs.datanode.handler.count only control the number of datanode 
threads serving IPC request.
These are NOT used for actual block transfer. Try upping - 
dfs.datanode.max.xcievers.

You can then run the I/O  benchmarks to measure the I/O throughput -
jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 
1000

-...@nkur

On 3/30/10 12:46 PM, Ed Mazur ma...@cs.umass.edu wrote:

Hi,

I have a 12 node cluster where instead of running a DN on each compute
node, I'm running just one DN backed by a large RAID (with a
dfs.replication of 1). The compute node storage is limited, so the
idea behind this was to free up more space for intermediate job data.
So the cluster has that one node with the DN, a master node with the
JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
from Cloudera.

The problem is that MR job performance is now worse than when using a
traditional HDFS setup. A job that took 76 minutes before now takes
169 minutes. I've used this single DN setup before on a
similarly-sized cluster without any problems, so what can I do to find
the bottleneck?

-Loading data into HDFS was fast, under 30 minutes to load ~240GB, so
I'm thinking this is a DN - map task communication problem.

-With a traditional HDFS setup, map tasks were taking 10-30 seconds,
but they now take 45-90 seconds or more.

-I grep'd the DN logs to find how long the size 67633152 HDFS reads
(map inputs) were taking. With the central DN, the reads were an order
of magnitude slower than with traditional HDFS (e.g. 82008147000 vs.
8238455000).

-I tried increasing dfs.datanode.handler.count to 10, but this didn't
seem to have any effect.

-Could low memory be an issue? The machine the DN is running on only
has 2GB and there is less than 100MB free without the DN running. I
haven't observed any swapping going on though.

-I looked at netstat during a job. I wasn't too sure what to look for,
but I didn't see any substantial send/receive buffering.

I've tried everything I can think of, so I'd really appreciate any tips. Thanks.

Ed



Re: java.io.IOException: Function not implemented

2010-03-30 Thread Steve Loughran

Edson Ramiro wrote:

I'm not involved with Debian community :(


I think you are now...


Re: why does 'jps' lose track of hadoop processes ?

2010-03-30 Thread Steve Loughran

Marcos Medrado Rubinelli wrote:
jps gets its information from the files stored under /tmp/hsperfdata_*, 
so when a cron job clears your /tmp directory, it also erases these 
files. You can submit jobs as long as your jobtracker and namenode are 
responding to requests over TCP, though.


I never knew that.

ps -ef | grep java works quite well; jps has fairly steep startup costs 
and if a JVM is playing up, jps can hang too




Listing subdirectories in Hadoop

2010-03-30 Thread Santiago Pérez

Hej

I've checking the API and on internet but I have not found any method for
listing the subdirectories of a given directory in the HDFS. 

Can anybody show me how to get the list of subdirectories or even how to
implement the method? (I guess that it should be possible and not very
hard).

Thanks in advance ;)
-- 
View this message in context: 
http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



a question about automatic restart of the NameNode

2010-03-30 Thread 毛宏
Hi all,  
 Does automatic restart and failover of the NameNode software to
another machine available in hadoop 0.20.2?  



Re: Listing subdirectories in Hadoop

2010-03-30 Thread Ted Yu
Does this get what you want ?
hadoop dfs -ls path | grep drwx

On Tue, Mar 30, 2010 at 8:24 AM, Santiago Pérez elara...@gmail.com wrote:


 Hej

 I've checking the API and on internet but I have not found any method for
 listing the subdirectories of a given directory in the HDFS.

 Can anybody show me how to get the list of subdirectories or even how to
 implement the method? (I guess that it should be possible and not very
 hard).

 Thanks in advance ;)
 --
 View this message in context:
 http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: a question about automatic restart of the NameNode

2010-03-30 Thread Ted Yu
Please refer to highavailability contrib of 0.20.2:
HDFS-976
http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html

On Tue, Mar 30, 2010 at 8:51 AM, 毛宏 maohong1...@gmail.com wrote:

 Hi all,
 Does automatic restart and failover of the NameNode software to
 another machine available in hadoop 0.20.2?




Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Alex Kozlov
Hi Pallavi,

DFSClient uses log4j.properties for configuration.  What is your classpath?
 I need to know how exactly you invoke your program (java, hadoop script,
etc.).  The log level and appender is driven by the hadoop.root.logger
config variable.

I would also recommend to use one logging system in the code, which will be
commons-logging in this case.

Alex K

On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti 
pallavi.pall...@corp.aol.com wrote:

 Hi Alex,

 Thanks for the reply. I have already created a logger (from
 log4j.logger)and configured the same to log it to a file and it is logging
 for all the log statements that I have in my client code. However, the
 error/info logs of DFSClient are going to stdout.  The DFSClient code is
 using log from commons-logging.jar. I am wondering how to redirect those
 logs (which are right now going to stdout) to append to the existing logger
 in client code.

 Thanks
 Pallavi



 On 03/30/2010 12:06 PM, Alex Kozlov wrote:

 Hi Pallavi,

 It depends what logging configuration you are using.  If it's log4j, you
 need to modify (or create) log4j.properties file and point you code (via
 classpath) to it.

 A sample log4j.properties is in the conf directory (either apache or CDH
 distributions).

 Alex K

 On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti
 pallavi.pall...@corp.aol.com  wrote:



 Hi,

 I am copying certain data from a client machine (which is not part of the
 cluster) using DFSClient to HDFS. During this process, I am encountering
 some issues and the error/info logs are going to stdout. Is there a way,
 I
 can configure the property at client side so that the error/info logs are
 appended to existing log file (being created using logger at client code)
 rather writing to stdout.

 Thanks
 Pallavi








Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
Please disregard this thread.  I started another thread which is more specific 
and pertinent to my problem...but if you have any helpful information, please 
respond to the other thread.  I need to get this figured out.

Thank you.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!
  -- Homer Simpson






swapping on hadoop

2010-03-30 Thread Vasilis Liaskovitis
Hi all,

I 've noticed swapping for a single terasort job on a small 8-node
cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I
can have back to back runs of the same job from the same hdfs input
data and get swapping only on 1 out of 4 identical runs. I 've noticed
this swapping behaviour on both terasort jobs and hive query jobs.

- Focusing on a single job config, Is there a rule of thumb about how
much node memory should be left for use outside of Child JVMs?
I make sure that per Node, there is free memory:
(#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) *
JVMHeapSize  PhysicalMemoryonNode
The total JVM heap size per node per job from the above equation
currently account 65%-75% of the node's memory. (I 've tried
allocating a riskier 90% of the node's memory, with similar swapping
observations).

- Could there be an issue with HDFS data or metadata taking up memory?
I am not cleaning output or intermediate outputs from HDFS between
runs. Is this possible?

- Do people use any specific java flags (particularly garbage
collection flags) for production environments where one job runs (or
possibly more jobs run simultaneously) ?

- What are the memory requirements for the jobtracker,namenode and
tasktracker,datanode JVMs?

- I am setting io.sort.mb to about half of the JVM heap size (half of
-Xmx in javaopts). Should this be set to a different ratio? (this
setting doesn't sound like it should be causing swapping in the first
place).

- The buffer cache is cleaned before each run (flush and echo 3 
/proc/sys/vm/drop_caches)

any empirical advice and suggestions  to solve this are appreciated.
thanks,

- Vasilis


Re: Listing subdirectories in Hadoop

2010-03-30 Thread A Levine
If you were talking about looking at directories within a Java
program, here is what has worked for me.

FileSystem fs;
FileStatus[] fileStat;
Path[] fileList;
SequenceFile.Reader reader = null;
try{
 // connect to the file system
 fs = FileSystem.get(conf);

 // get the stat on all files in the source directory
 fileStat = fs.listStatus(sourceDir);

 // get paths to the files in the source directory
 fileList = FileUtil.stat2Paths(fileStat);

// then you can do something like
for(int x = 0; x  fileList.length; x++){
 System.out.println(x +   + fileList[x]);
}
} catch(IOException ioe){
// do something
}

Hope this helps.

andrew

--

On Tue, Mar 30, 2010 at 11:54 AM, Ted Yu yuzhih...@gmail.com wrote:
 Does this get what you want ?
 hadoop dfs -ls path | grep drwx

 On Tue, Mar 30, 2010 at 8:24 AM, Santiago Pérez elara...@gmail.com wrote:


 Hej

 I've checking the API and on internet but I have not found any method for
 listing the subdirectories of a given directory in the HDFS.

 Can anybody show me how to get the list of subdirectories or even how to
 implement the method? (I guess that it should be possible and not very
 hard).

 Thanks in advance ;)
 --
 View this message in context:
 http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





CfP with Extended Deadline 5th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'10)

2010-03-30 Thread Michael Alexander
Apologies if you received multiple copies of this message.


=

CALL FOR PAPERS

5th Workshop on

Virtualization in High-Performance Cloud Computing

VHPC'10

as part of Euro-Par 2010, Island of Ischia-Naples, Italy

=

Date: August 31, 2010

Euro-Par 2009: http://www.europar2010.org/

Workshop URL: http://vhpc.org

SUBMISSION DEADLINE:

Abstracts: April 4, 2010 (extended)
Full Paper: June 19, 2010 (extended) 


Scope:

Virtualization has become a common abstraction layer in modern data
centers, enabling resource owners to manage complex infrastructure
independently of their applications. Conjointly virtualization is
becoming a driving technology for a manifold of industry grade IT
services. Piloted by the Amazon Elastic Computing Cloud services, the
cloud concept includes the notion of a separation between resource
owners and users, adding services such as hosted application
frameworks and queuing. Utilizing the same infrastructure, clouds
carry significant potential for use in high-performance scientific
computing. The ability of clouds to provide for requests and releases
of vast computing resource dynamically and close to the marginal cost
of providing the services is unprecedented in the history of
scientific and commercial computing.

Distributed computing concepts that leverage federated resource access
are popular within the grid community, but have not seen previously
desired deployed levels so far. Also, many of the scientific
datacenters have not adopted virtualization or cloud concepts yet.

This workshop aims to bring together industrial providers with the
scientific community in order to foster discussion, collaboration and
mutual exchange of knowledge and experience.

The workshop will be one day in length, composed of 20 min paper
presentations, each followed by 10 min discussion sections.
Presentations may be accompanied by interactive demonstrations. It
concludes with a 30 min panel discussion by presenters.

TOPICS

Topics include, but are not limited to, the following subjects:

- Virtualization in cloud, cluster and grid HPC environments
- VM cloud, cluster load distribution algorithms
- Cloud, cluster and grid filesystems
- QoS and and service level guarantees
- Cloud programming models, APIs and databases
- Software as a service (SaaS)
- Cloud provisioning
- Virtualized I/O
- VMMs and storage virtualization
- MPI, PVM on virtual machines
- High-performance network virtualization
- High-speed interconnects
- Hypervisor extensions
- Tools for cluster and grid computing
- Xen/other VMM cloud/cluster/grid tools
- Raw device access from VMs
- Cloud reliability, fault-tolerance, and security
- Cloud load balancing
- VMs - power efficiency
- Network architectures for VM-based environments
- VMMs/Hypervisors
- Hardware support for virtualization
- Fault tolerant VM environments
- Workload characterizations for VM-based environments
- Bottleneck management
- Metering
- VM-based cloud performance modeling
- Cloud security, access control and data integrity
- Performance management and tuning hosts and guest VMs
- VMM performance tuning on various load types
- Research and education use cases
- Cloud use cases
- Management of VM environments and clouds
- Deployment of VM-based environments



PAPER SUBMISSION

Papers submitted to the workshop will be reviewed by at least two
members of the program committee and external reviewers. Submissions
should include abstract, key words, the e-mail address of the
corresponding author, and must not exceed 10 pages, including tables
and figures at a main font size no smaller than 11 point. Submission
of a paper should be regarded as a commitment that, should the paper
be accepted, at least one of the authors will register and attend the
conference to present the work.

Accepted papers will be published in the Springer LNCS series - the
format must be according to the Springer LNCS Style. Initial
submissions are in PDF, accepted papers will be requested to provided
source files.

Format Guidelines: http://www.springer.de/comp/lncs/authors.html

Submission Link: http://edas.info/newPaper.php?c=8553


IMPORTANT DATES

April 4 - Abstract submission due (extended)
May 19 - Full paper submission (extended)
July 14 - Acceptance notification
August 3 - Camera-ready version due
August 31 - September 3 - conference


CHAIR

Michael Alexander (chair), scaledinfra technologies GmbH, Austria
Gianluigi Zanetti (co-chair), CRS4, Italy


PROGRAM COMMITTEE

Padmashree Apparao, Intel Corp., USA
Volker Buege, University of Karlsruhe, Germany
Roberto Canonico, University of Napoli Federico II, Italy
Tommaso Cucinotta, Scuola Superiore Sant'Anna, Italy
Werner Fischer, Thomas Krenn AG, Germany
William Gardner, University of Guelph, Canada
Wolfgang Gentzsch, DEISA. Max Planck Gesellschaft, Germany
Derek Groen, UVA, The Netherlands
Marcus Hardt, 

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
No responses yet, although I admit it's only been a few hours.

As a follow-up, permit me to pose the following question:

Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as 
opposed to a pseudo-distributed system)?  I haven't found any definitive 
clarification on this topic one way or the other.  The only statement that I 
found in the least bit illuminating is in the O'Reilly book (not official 
Hadoop documentation mind you), p.38, which states:

To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes 
doesn't run in standalone (local) mode, since it relies on Hadoop's distributed 
cache mechanism, which works only when HDFS is running.

The phrasing of those statements is a little unclear in that the distinction 
being made appears to be between standalone and pseudo-distributed mode, 
without any specific reference to fully-distributed mode.  Namely, the section 
that qualifies the need for pseudo-distributed mode (the need for HDFS) would 
obviously also apply to full distributed mode despite the lack of mention of 
fully distributed mode in the quoted section.  So can pipes run in fully 
distributed mode or not?

Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet 
and I don't know if I am wasting my time, if this is a truly impossible effort 
or if it can be done and I simply haven't figured out how to do it yet.

Thanks for any help.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also.
  -- Mark Twain






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
Hello.
Did you try following the tutorial in 
http://wiki.apache.org/hadoop/C++WordCount ?

We use C++ pipes in production on a large cluster, and it works.

--gianluigi


On Tue, 2010-03-30 at 13:28 -0700, Keith Wiley wrote:
 No responses yet, although I admit it's only been a few hours.
 
 As a follow-up, permit me to pose the following question:
 
 Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as 
 opposed to a pseudo-distributed system)?  I haven't found any definitive 
 clarification on this topic one way or the other.  The only statement that I 
 found in the least bit illuminating is in the O'Reilly book (not official 
 Hadoop documentation mind you), p.38, which states:
 
 To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes 
 doesn't run in standalone (local) mode, since it relies on Hadoop's 
 distributed cache mechanism, which works only when HDFS is running.
 
 The phrasing of those statements is a little unclear in that the distinction 
 being made appears to be between standalone and pseudo-distributed mode, 
 without any specific reference to fully-distributed mode.  Namely, the 
 section that qualifies the need for pseudo-distributed mode (the need for 
 HDFS) would obviously also apply to full distributed mode despite the lack of 
 mention of fully distributed mode in the quoted section.  So can pipes run in 
 fully distributed mode or not?
 
 Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet 
 and I don't know if I am wasting my time, if this is a truly impossible 
 effort or if it can be done and I simply haven't figured out how to do it yet.
 
 Thanks for any help.
 
 
 Keith Wiley   kwi...@keithwiley.com   
 www.keithwiley.com
 
 The easy confidence with which I know another man's religion is folly teaches
 me to suspect that my own is also.
   -- Mark Twain
 
 
 
 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
Yep, tried and tried and tried it.  Works perfectly on a pseudo-distributed 
cluster which is why I didn't think the example or the code was the problem, 
but rather that the cluster was the problem.

I have only just (in the last two minutes) heard back from the administrator of 
our cluster and he says the pipes package is not installed on the cluster...so 
that kinda explains it, although I'm still unclear what the symptoms would be 
for various kinds of problems.  In other words, I'm not sure if the errors I 
got were the result of the lack of a pipes package on the cluster or if I still 
wasn't doing it right.

At any rate, it sounds like pipes is an additional extraneous add-on during 
cluster configuration and that our cluster didn't add it.

Does that make sense to you?...that pipes needs to be enabled on the cluster, 
not merely run properly by the user?

Thanks.

Cheers!

On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:

 Hello.
 Did you try following the tutorial in 
 http://wiki.apache.org/hadoop/C++WordCount ?
 
 We use C++ pipes in production on a large cluster, and it works.
 
 --gianluigi



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy.
  -- Edwin A. Abbott, Flatland






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
My cluster admin noticed that there is some additional pipes package he could 
add to the cluster configuration, but he admits to knowing very little about 
how the C++ pipes component of Hadoop works.

Can you offer any insight into this cluster configuration package?  What 
exactly does it do that makes a cluster capable of running pipes programs (and 
what symptom should its absence present from a user's point of view)?

On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:

 Hello.
 Did you try following the tutorial in 
 http://wiki.apache.org/hadoop/C++WordCount ?
 
 We use C++ pipes in production on a large cluster, and it works.
 
 --gianluigi



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use.
  -- Galileo Galilei






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
What are the symptoms? 
Pipes should run out of the box in a standard installation.
BTW what version of bash are you using? Is it bash 4.0 by any chance?
See https://issues.apache.org/jira/browse/HADOOP-6388

--gianluigi


On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote:
 My cluster admin noticed that there is some additional pipes package he could 
 add to the cluster configuration, but he admits to knowing very little about 
 how the C++ pipes component of Hadoop works.
 
 Can you offer any insight into this cluster configuration package?  What 
 exactly does it do that makes a cluster capable of running pipes programs 
 (and what symptom should its absence present from a user's point of view)?
 
 On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:
 
  Hello.
  Did you try following the tutorial in 
  http://wiki.apache.org/hadoop/C++WordCount ?
  
  We use C++ pipes in production on a large cluster, and it works.
  
  --gianluigi
 
 
 
 Keith Wiley   kwi...@keithwiley.com   
 www.keithwiley.com
 
 I do not feel obliged to believe that the same God who has endowed us with
 sense, reason, and intellect has intended us to forgo their use.
   -- Galileo Galilei
 
 
 
 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
The closest I've gotten so far is for the job to basically try to start up but 
to get an error complaining about the permissions on the executable 
binary...which makes perfect sense since the permissions are not executable.  
Problem is, the hdfs chmod command ignores executable commands.  For example, 
hd fs -chmod 755 somefile yields -rw-r--r--.  The x is simply dropped from 
the command.  This makes sense to me in light of documentation (O'Reilly p.47) 
that indicates HDFS doesn't support executable file permissions, but it leaves 
me perplexed how any file could ever be executable under HDFS or Hadoop in 
general.

Using slightly different attempts at the pipes command I usually get errors 
that the executable is not found.  This occurs when I point to a local file for 
the executable instead of one uploaded to HDFS.  In other words, I haven't 
found any way to run pipes such that the executable starts out on the local 
machine and is automatically distributed to the cluster as a component of the 
pipes command.  Rather, it seems that the executable must already reside in 
HDFS and be indicated during the pipes command (ala -program or 
hadoop.pipes.executable of course).  I have even tried adding the -files 
option to pipes, but so far to no positive effect.

I'll send another post with some specific transcripts of what I'm seeing.

One could ask, w.r.t. the -program flag for pipes, should that indicate a 
local path, an hdfs path, or are both options possible?

As to bash, I'm running on a 10.6.2 Mac, thus:

$ bash --version
bash --version
GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin10.0)
Copyright (C) 2007 Free Software Foundation, Inc.

...so not v4.0 as you asked.

On Mar 30, 2010, at 14:29 , Gianluigi Zanetti wrote:

 What are the symptoms? 
 Pipes should run out of the box in a standard installation.
 BTW what version of bash are you using? Is it bash 4.0 by any chance?
 See https://issues.apache.org/jira/browse/HADOOP-6388
 
 --gianluigi
 
 
 On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote:
 My cluster admin noticed that there is some additional pipes package he 
 could add to the cluster configuration, but he admits to knowing very little 
 about how the C++ pipes component of Hadoop works.
 
 Can you offer any insight into this cluster configuration package?  What 
 exactly does it do that makes a cluster capable of running pipes programs 
 (and what symptom should its absence present from a user's point of view)?
 
 On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:
 
 Hello.
 Did you try following the tutorial in 
 http://wiki.apache.org/hadoop/C++WordCount ?
 
 We use C++ pipes in production on a large cluster, and it works.
 
 --gianluigi
 
 
 
 Keith Wiley   kwi...@keithwiley.com   
 www.keithwiley.com
 
 I do not feel obliged to believe that the same God who has endowed us with
 sense, reason, and intellect has intended us to forgo their use.
  -- Galileo Galilei
 
 
 
 



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me.
  -- Abe (Grandpa) Simpson






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
-input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE
Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output
10/03/30 14:56:55 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
10/03/30 14:56:55 INFO mapred.FileInputFormat: Total input paths to process : 1
10/03/30 14:57:05 INFO mapred.JobClient: Running job: job_201003241650_1076
10/03/30 14:57:06 INFO mapred.JobClient:  map 0% reduce 0%
^C
$

At that point the terminal hung, so I eventually ctrl-Ced to break it.  Now if 
I investigate the Hadoop task logs for the mapper, I see this:

stderr logs
bash: 
/data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic:
 cannot execute binary file

...which makes perfect sense in light of the following:

$ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
Found 1 items
-rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
/uwphysics/kwiley/mosaic/c++_bin/Mosaic
$ hd fs -chmod 755 /uwphysics/kwiley/mosaic/c++_bin/Mosaic
$ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
Found 1 items
-rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
/uwphysics/kwiley/mosaic/c++_bin/Mosaic
$

Note that this is all in attempt to run an executable that was uploaded to HDFS 
in advance.  In this example I am not attempting to run an executable stored on 
my local machine.  Any attempt to do that results in a file not found error:

$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
-input HDFSPATH/input -output HDFSPATH/output -program LOCALPATH/EXECUTABLE
Deleted hdfs://mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/output
Exception in thread main java.io.FileNotFoundException: File does not exist: 
/Users/kwiley/hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at 
org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
at 
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
$

It's clearly looking or the executable in HDFS, not on the local system, thus 
the file not found error.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered.
  -- Keith Wiley







Hadoop DFS IO Performance measurement

2010-03-30 Thread sagar naik
Hi All,

I am trying to get DFS IO performance.
I used TestDFSIO from hadoop jars.
The results were abt 100Mbps read and write .
I think it should be more than this

Pl share some stats to compare

Either I am missing something like  config params or something else


-Sagar


Re: java.io.IOException: Function not implemented

2010-03-30 Thread Edson Ramiro
Hi all,

Thanks for help Todd and Steve,

I configured Hadoop (0.20.2) again and I'm getting the same error (Function
not implemented).

Do you think it's a Hadoop bug?

This is the situation:

I've 28 nodes where just four are running the datanode.

In all other nodes the tasktracker in running ok.

The NN and JT are running ok.

The configuration of the machines is the same, its a nfs shared home.

In all machines the Java version is 1.6.0_17.

This is the kernel version of the nodes, note that are two versions and in
both the
datanode doesn't work. Just in the h0* machines.

ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a  | sort
a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux


These are the java processes running on each clients.
Jjust the h0* machines are running ok.

ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort
a01: 1
a02: 1
a03: 1
a04: 1
a05: 1
a06: 1
a07: 1
a09: 1
a10: 1
ag06: 1
ag07: 1
bl02: 1
bl03: 1
bl04: 1
bl06: 1
bl07: 1
ct02: 1
ct03: 1
ct04: 1
ct06: 1
h01: 2
h02: 2
h03: 2
h04: 2
sd02: 1
sd05: 1
sd06: 1
sd07: 1

This is my configuration:

ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site*
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
namefs.default.name/name
valuehdfs://lcpad:9000/value
/property
/configuration
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
namedfs.replication/name
value1/value
/property
/configuration
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
  property
namemapred.job.tracker/name
valuelcpad:9001/value
  /property
/configuration

Thanks in Advance,

Edson Ramiro


On 30 March 2010 05:58, Steve Loughran ste...@apache.org wrote:

 Edson Ramiro wrote:

 I'm not involved with Debian community :(


 I think you are now...



Re: java.io.IOException: Function not implemented

2010-03-30 Thread Todd Lipcon
Hi Edson,

I noticed that only the h01 nodes are running 2.6.32.9, the other broken DNs
are 2.6.32.10.

Is there some reason you are running a kernel that is literally 2 weeks old?
I wouldn't be at all surprised if there were a bug here, or some issue with
your Debian unstable distribution...

-Todd

On Tue, Mar 30, 2010 at 3:54 PM, Edson Ramiro erlfi...@gmail.com wrote:

 Hi all,

 Thanks for help Todd and Steve,

 I configured Hadoop (0.20.2) again and I'm getting the same error (Function
 not implemented).

 Do you think it's a Hadoop bug?

 This is the situation:

 I've 28 nodes where just four are running the datanode.

 In all other nodes the tasktracker in running ok.

 The NN and JT are running ok.

 The configuration of the machines is the same, its a nfs shared home.

 In all machines the Java version is 1.6.0_17.

 This is the kernel version of the nodes, note that are two versions and in
 both the
 datanode doesn't work. Just in the h0* machines.

 ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a  | sort
 a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
 ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
 h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
 h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
 h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
 sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux
 sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
 GNU/Linux


 These are the java processes running on each clients.
 Jjust the h0* machines are running ok.

 ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort
 a01: 1
 a02: 1
 a03: 1
 a04: 1
 a05: 1
 a06: 1
 a07: 1
 a09: 1
 a10: 1
 ag06: 1
 ag07: 1
 bl02: 1
 bl03: 1
 bl04: 1
 bl06: 1
 bl07: 1
 ct02: 1
 ct03: 1
 ct04: 1
 ct06: 1
 h01: 2
 h02: 2
 h03: 2
 h04: 2
 sd02: 1
 sd05: 1
 sd06: 1
 sd07: 1

 This is my configuration:

 ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site*
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration
 property
 namefs.default.name/name
 valuehdfs://lcpad:9000/value
 /property
 /configuration
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration
 property
 namedfs.replication/name
 value1/value
 /property
 /configuration
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration
  property
namemapred.job.tracker/name
valuelcpad:9001/value
  /property
 /configuration

 Thanks in Advance,

 Edson Ramiro


 On 30 March 2010 05:58, Steve Loughran ste...@apache.org wrote:

  Edson Ramiro wrote:
 
  I'm not involved with Debian community :(
 
 
  I think you are now...
 




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: java.io.IOException: Function not implemented

2010-03-30 Thread Edson Ramiro
May be it's a bug.

I'm not the admin. : (

so, I'll talk to him and may be he install a 2.6.32.9 in another node to
test  : )

Thanks

Edson Ramiro


On 30 March 2010 20:00, Todd Lipcon t...@cloudera.com wrote:

 Hi Edson,

 I noticed that only the h01 nodes are running 2.6.32.9, the other broken
 DNs
 are 2.6.32.10.

 Is there some reason you are running a kernel that is literally 2 weeks
 old?
 I wouldn't be at all surprised if there were a bug here, or some issue with
 your Debian unstable distribution...

 -Todd

 On Tue, Mar 30, 2010 at 3:54 PM, Edson Ramiro erlfi...@gmail.com wrote:

  Hi all,
 
  Thanks for help Todd and Steve,
 
  I configured Hadoop (0.20.2) again and I'm getting the same error
 (Function
  not implemented).
 
  Do you think it's a Hadoop bug?
 
  This is the situation:
 
  I've 28 nodes where just four are running the datanode.
 
  In all other nodes the tasktracker in running ok.
 
  The NN and JT are running ok.
 
  The configuration of the machines is the same, its a nfs shared home.
 
  In all machines the Java version is 1.6.0_17.
 
  This is the kernel version of the nodes, note that are two versions and
 in
  both the
  datanode doesn't work. Just in the h0* machines.
 
  ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a  | sort
  a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
 GNU/Linux
  ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
 GNU/Linux
  h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
 GNU/Linux
  h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
 GNU/Linux
  h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
 GNU/Linux
  sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
  sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
  GNU/Linux
 
 
  These are the java processes running on each clients.
  Jjust the h0* machines are running ok.
 
  ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort
  a01: 1
  a02: 1
  a03: 1
  a04: 1
  a05: 1
  a06: 1
  a07: 1
  a09: 1
  a10: 1
  ag06: 1
  ag07: 1
  bl02: 1
  bl03: 1
  bl04: 1
  bl06: 1
  bl07: 1
  ct02: 1
  ct03: 1
  ct04: 1
  ct06: 1
  h01: 2
  h02: 2
  h03: 2
  h04: 2
  sd02: 1
  sd05: 1
  sd06: 1
  sd07: 1
 
  This is my configuration:
 
  ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site*
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
  !-- Put site-specific property overrides in this file. --
 
  configuration
  property
  namefs.default.name/name
  valuehdfs://lcpad:9000/value
  /property
  /configuration
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
  !-- Put site-specific property overrides in this file. --
 
  configuration
  property
  namedfs.replication/name
  value1/value
  /property
  /configuration
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
  !-- Put site-specific property overrides in this file. --
 
  configuration
   property
 namemapred.job.tracker/name
 valuelcpad:9001/value
   /property
  /configuration
 
  Thanks in Advance,
 
  Edson Ramiro
 
 
  On 30 March 2010 05:58, Steve Loughran ste...@apache.org wrote:
 
   Edson Ramiro wrote:
  
   I'm not involved 

Re: Hadoop DFS IO Performance measurement

2010-03-30 Thread Edson Ramiro
Hi Sagar,

What hardware did you run it on ?

Edson Ramiro


On 30 March 2010 19:41, sagar naik sn...@attributor.com wrote:

 Hi All,

 I am trying to get DFS IO performance.
 I used TestDFSIO from hadoop jars.
 The results were abt 100Mbps read and write .
 I think it should be more than this

 Pl share some stats to compare

 Either I am missing something like  config params or something else


 -Sagar



question on shuffle and sort

2010-03-30 Thread Cui tony
Hi,
  Did all key-value pairs of the map output, which have the same key, will
be sent to the same reducer tasknode?


Re: question on shuffle and sort

2010-03-30 Thread Ed Mazur
On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
  Did all key-value pairs of the map output, which have the same key, will
 be sent to the same reducer tasknode?

Yes, this is at the core of the MapReduce model. There is one call to
the user reduce function per unique map output key. This grouping is
achieved by sorting which means you see keys in increasing order.

Ed


Re: question on shuffle and sort

2010-03-30 Thread 毛宏
yes ,indeed

在 2010-03-31三的 09:56 +0800,Cui tony写道:
 Hi,
   Did all key-value pairs of the map output, which have the same key, will
 be sent to the same reducer tasknode?




Re: question on shuffle and sort

2010-03-30 Thread Jones, Nick
Something to keep in mind though, sorting is appropriate to the key type. Text 
will be sorted lexicographically.

Nick Jones


- Original Message -
From: Ed Mazur ma...@cs.umass.edu
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Sent: Tue Mar 30 21:07:29 2010
Subject: Re: question on shuffle and sort

On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
  Did all key-value pairs of the map output, which have the same key, will
 be sent to the same reducer tasknode?

Yes, this is at the core of the MapReduce model. There is one call to
the user reduce function per unique map output key. This grouping is
achieved by sorting which means you see keys in increasing order.

Ed




Re: question on shuffle and sort

2010-03-30 Thread Cui tony
Consider this extreme situation:
The input data is very large, and also the map result. 90% of map result
have the same key, then all of them will be sent to one reducer tasknode.
So 90% of work of reduce phase have to been done on a single node, not the
cluster. That is very ineffective and less scalable.


2010/3/31 Jones, Nick nick.jo...@amd.com

 Something to keep in mind though, sorting is appropriate to the key type.
 Text will be sorted lexicographically.

 Nick Jones


 - Original Message -
 From: Ed Mazur ma...@cs.umass.edu
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Tue Mar 30 21:07:29 2010
 Subject: Re: question on shuffle and sort

 On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
   Did all key-value pairs of the map output, which have the same key, will
  be sent to the same reducer tasknode?

 Yes, this is at the core of the MapReduce model. There is one call to
 the user reduce function per unique map output key. This grouping is
 achieved by sorting which means you see keys in increasing order.

 Ed





Re: question on shuffle and sort

2010-03-30 Thread Jones, Nick
I ran into an issue where lots of data was passing from mappers to a single 
reducer. Enabling a combiner saved quite a bit of processing time by reducing 
mapper disk writes and data movements to the reducer.

Nick Jones


- Original Message -
From: Cui tony tony.cui1...@gmail.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Sent: Tue Mar 30 21:24:18 2010
Subject: Re: question on shuffle and sort

Consider this extreme situation:
The input data is very large, and also the map result. 90% of map result
have the same key, then all of them will be sent to one reducer tasknode.
So 90% of work of reduce phase have to been done on a single node, not the
cluster. That is very ineffective and less scalable.


2010/3/31 Jones, Nick nick.jo...@amd.com

 Something to keep in mind though, sorting is appropriate to the key type.
 Text will be sorted lexicographically.

 Nick Jones


 - Original Message -
 From: Ed Mazur ma...@cs.umass.edu
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Tue Mar 30 21:07:29 2010
 Subject: Re: question on shuffle and sort

 On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
   Did all key-value pairs of the map output, which have the same key, will
  be sent to the same reducer tasknode?

 Yes, this is at the core of the MapReduce model. There is one call to
 the user reduce function per unique map output key. This grouping is
 achieved by sorting which means you see keys in increasing order.

 Ed






Re: question on shuffle and sort

2010-03-30 Thread Cui tony
Hi, Jones
 As you have met the situation I am worried about, I got my answer now.
Maybe re-design the map function or add a combiner is the only way to deal
with this kind of input data .

2010/3/31 Jones, Nick nick.jo...@amd.com

 I ran into an issue where lots of data was passing from mappers to a single
 reducer. Enabling a combiner saved quite a bit of processing time by
 reducing mapper disk writes and data movements to the reducer.

 Nick Jones


 - Original Message -
 From: Cui tony tony.cui1...@gmail.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Tue Mar 30 21:24:18 2010
 Subject: Re: question on shuffle and sort

 Consider this extreme situation:
 The input data is very large, and also the map result. 90% of map result
 have the same key, then all of them will be sent to one reducer tasknode.
 So 90% of work of reduce phase have to been done on a single node, not the
 cluster. That is very ineffective and less scalable.


 2010/3/31 Jones, Nick nick.jo...@amd.com

  Something to keep in mind though, sorting is appropriate to the key type.
  Text will be sorted lexicographically.
 
  Nick Jones
 
 
  - Original Message -
  From: Ed Mazur ma...@cs.umass.edu
  To: common-user@hadoop.apache.org common-user@hadoop.apache.org
  Sent: Tue Mar 30 21:07:29 2010
  Subject: Re: question on shuffle and sort
 
  On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
Did all key-value pairs of the map output, which have the same key,
 will
   be sent to the same reducer tasknode?
 
  Yes, this is at the core of the MapReduce model. There is one call to
  the user reduce function per unique map output key. This grouping is
  achieved by sorting which means you see keys in increasing order.
 
  Ed
 
 
 




is there any way we can limit Hadoop Datanode's disk usage?

2010-03-30 Thread steven zhuang
hi, guys,
   we have some machine with 1T disk, some with 100GB disk,
   I have this question that is there any means we can limit the
disk usage of datanodes on those machines with smaller disk?
   thanks!


Re: is there any way we can limit Hadoop Data node's disk usage?

2010-03-30 Thread Ravi Phulari
Hello Steven ,
You can use  dfs.datanode.du.reserved configuration value in 
$HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.

property
namedfs.datanode.du.reserved/name
!-- cluster variant --
value182400/value
descriptionReserved space in bytes per volume. Always leave this much 
space free for non dfs use.
  /description
  /property

Ravi
Hadoop @ Yahoo!

On 3/30/10 8:12 PM, steven zhuang steven.zhuang.1...@gmail.com wrote:

hi, guys,
   we have some machine with 1T disk, some with 100GB disk,
   I have this question that is there any means we can limit the
disk usage of datanodes on those machines with smaller disk?
   thanks!


Ravi
--



log

2010-03-30 Thread Gang Luo
Hi all,
I find there is a directory _log/history/... under the output directory of a 
mapreduce job. Is the file in that directory a log file? Is the information 
there sufficient to allow me to figure out what nodes the job runs on? Besides, 
not every job has such a directory. Is there such settings controlling this? Or 
is there other ways to get the nodes my job runs on?

Thanks,
-Gang