Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Pallavi Palleti

Hi Alex,

Thanks for the reply. I have already created a logger (from 
log4j.logger)and configured the same to log it to a file and it is 
logging for all the log statements that I have in my client code. 
However, the error/info logs of DFSClient are going to stdout.  The 
DFSClient code is using log from commons-logging.jar. I am wondering how 
to redirect those logs (which are right now going to stdout) to append 
to the existing logger in client code.


Thanks
Pallavi


On 03/30/2010 12:06 PM, Alex Kozlov wrote:

Hi Pallavi,

It depends what logging configuration you are using.  If it's log4j, you
need to modify (or create) log4j.properties file and point you code (via
classpath) to it.

A sample log4j.properties is in the conf directory (either apache or CDH
distributions).

Alex K

On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti<
pallavi.pall...@corp.aol.com>  wrote:

   

Hi,

I am copying certain data from a client machine (which is not part of the
cluster) using DFSClient to HDFS. During this process, I am encountering
some issues and the error/info logs are going to stdout. Is there a way, I
can configure the property at client side so that the error/info logs are
appended to existing log file (being created using logger at client code)
rather writing to stdout.

Thanks
Pallavi

 
   


Single datanode setup

2010-03-30 Thread Ed Mazur
Hi,

I have a 12 node cluster where instead of running a DN on each compute
node, I'm running just one DN backed by a large RAID (with a
dfs.replication of 1). The compute node storage is limited, so the
idea behind this was to free up more space for intermediate job data.
So the cluster has that one node with the DN, a master node with the
JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
from Cloudera.

The problem is that MR job performance is now worse than when using a
traditional HDFS setup. A job that took 76 minutes before now takes
169 minutes. I've used this single DN setup before on a
similarly-sized cluster without any problems, so what can I do to find
the bottleneck?

-Loading data into HDFS was fast, under 30 minutes to load ~240GB, so
I'm thinking this is a DN <-> map task communication problem.

-With a traditional HDFS setup, map tasks were taking 10-30 seconds,
but they now take 45-90 seconds or more.

-I grep'd the DN logs to find how long the size 67633152 HDFS reads
(map inputs) were taking. With the central DN, the reads were an order
of magnitude slower than with traditional HDFS (e.g. 82008147000 vs.
8238455000).

-I tried increasing dfs.datanode.handler.count to 10, but this didn't
seem to have any effect.

-Could low memory be an issue? The machine the DN is running on only
has 2GB and there is less than 100MB free without the DN running. I
haven't observed any swapping going on though.

-I looked at netstat during a job. I wasn't too sure what to look for,
but I didn't see any substantial send/receive buffering.

I've tried everything I can think of, so I'd really appreciate any tips. Thanks.

Ed


Re: Single datanode setup

2010-03-30 Thread Ankur C. Goel

M/R is performance is known to be better when using just a bunch of disks (BOD) 
instead of RAID.

>From your setup it looks like your single datanode must be running hot on I/O 
>activity.

The parameter- dfs.datanode.handler.count only control the number of datanode 
threads serving IPC request.
These are NOT used for actual block transfer. Try upping - 
dfs.datanode.max.xcievers.

You can then run the I/O  benchmarks to measure the I/O throughput -
jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 
1000

-...@nkur

On 3/30/10 12:46 PM, "Ed Mazur"  wrote:

Hi,

I have a 12 node cluster where instead of running a DN on each compute
node, I'm running just one DN backed by a large RAID (with a
dfs.replication of 1). The compute node storage is limited, so the
idea behind this was to free up more space for intermediate job data.
So the cluster has that one node with the DN, a master node with the
JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
from Cloudera.

The problem is that MR job performance is now worse than when using a
traditional HDFS setup. A job that took 76 minutes before now takes
169 minutes. I've used this single DN setup before on a
similarly-sized cluster without any problems, so what can I do to find
the bottleneck?

-Loading data into HDFS was fast, under 30 minutes to load ~240GB, so
I'm thinking this is a DN <-> map task communication problem.

-With a traditional HDFS setup, map tasks were taking 10-30 seconds,
but they now take 45-90 seconds or more.

-I grep'd the DN logs to find how long the size 67633152 HDFS reads
(map inputs) were taking. With the central DN, the reads were an order
of magnitude slower than with traditional HDFS (e.g. 82008147000 vs.
8238455000).

-I tried increasing dfs.datanode.handler.count to 10, but this didn't
seem to have any effect.

-Could low memory be an issue? The machine the DN is running on only
has 2GB and there is less than 100MB free without the DN running. I
haven't observed any swapping going on though.

-I looked at netstat during a job. I wasn't too sure what to look for,
but I didn't see any substantial send/receive buffering.

I've tried everything I can think of, so I'd really appreciate any tips. Thanks.

Ed



Re: java.io.IOException: Function not implemented

2010-03-30 Thread Steve Loughran

Edson Ramiro wrote:

I'm not involved with Debian community :(


I think you are now...


Re: why does 'jps' lose track of hadoop processes ?

2010-03-30 Thread Steve Loughran

Marcos Medrado Rubinelli wrote:
jps gets its information from the files stored under /tmp/hsperfdata_*, 
so when a cron job clears your /tmp directory, it also erases these 
files. You can submit jobs as long as your jobtracker and namenode are 
responding to requests over TCP, though.


I never knew that.

ps -ef | grep java works quite well; jps has fairly steep startup costs 
and if a JVM is playing up, jps can hang too




Re: Single datanode setup

2010-03-30 Thread Ed Mazur
I set dfs.datanode.max.xcievers to 4096, but this didn't seem to have
any effect on performance.

Here are some benchmarks (not sure what typical values are):

- TestDFSIO - : write
   Date & time: Tue Mar 30 04:53:18 EDT 2010
   Number of files: 10
Total MBytes processed: 1
 Throughput mb/sec: 23.41355598064167
Average IO rate mb/sec: 25.179018020629883
 IO rate std deviation: 7.022948102609891
Test exec time sec: 74.437

- TestDFSIO - : read
   Date & time: Tue Mar 30 05:02:01 EDT 2010
   Number of files: 10
Total MBytes processed: 1
 Throughput mb/sec: 10.735545929349373
Average IO rate mb/sec: 10.741226196289062
 IO rate std deviation: 0.24872891783558398
Test exec time sec: 119.561

- TestDFSIO - : write
   Date & time: Tue Mar 30 05:09:59 EDT 2010
   Number of files: 40
Total MBytes processed: 4
 Throughput mb/sec: 3.3887489806219473
Average IO rate mb/sec: 5.173769950866699
 IO rate std deviation: 6.293246618896401
Test exec time sec: 360.765

- TestDFSIO - : read
   Date & time: Tue Mar 30 05:18:20 EDT 2010
   Number of files: 40
Total MBytes processed: 4
 Throughput mb/sec: 2.345990558443698
Average IO rate mb/sec: 2.3469674587249756
 IO rate std deviation: 0.04731737036312141
Test exec time sec: 477.568

I also used 40 files in the benchmarks because I have 10 compute nodes
with mapred.tasktracker.map.tasks.maximum set to 4. It looks like
performance degrades quite a bit when switching from 10 files.

I set mapred.tasktracker.map.tasks.maximum to 1 and ran a MR job. This
got map completion times back down to the expected 15-30 seconds, but
did not change the overall running time.

Does this just mean that the RAID isn't able to keep up with 10*4=40
parallel requests, but it is able to keep up with 10*1=10 parallel
requests? And if so, is there anything I can do to change this? I know
this isn't how HDFS is meant to be used, but this single DN/RAID setup
has worked for me in the past on a similarly-sized cluster.

Ed

On Tue, Mar 30, 2010 at 4:29 AM, Ankur C. Goel  wrote:
>
> M/R is performance is known to be better when using just a bunch of disks 
> (BOD) instead of RAID.
>
> From your setup it looks like your single datanode must be running hot on I/O 
> activity.
>
> The parameter- dfs.datanode.handler.count only control the number of datanode 
> threads serving IPC request.
> These are NOT used for actual block transfer. Try upping - 
> dfs.datanode.max.xcievers.
>
> You can then run the I/O  benchmarks to measure the I/O throughput -
> jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 
> 1000
>
> -...@nkur
>
> On 3/30/10 12:46 PM, "Ed Mazur"  wrote:
>
> Hi,
>
> I have a 12 node cluster where instead of running a DN on each compute
> node, I'm running just one DN backed by a large RAID (with a
> dfs.replication of 1). The compute node storage is limited, so the
> idea behind this was to free up more space for intermediate job data.
> So the cluster has that one node with the DN, a master node with the
> JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
> from Cloudera.
>
> The problem is that MR job performance is now worse than when using a
> traditional HDFS setup. A job that took 76 minutes before now takes
> 169 minutes. I've used this single DN setup before on a
> similarly-sized cluster without any problems, so what can I do to find
> the bottleneck?
>
> -Loading data into HDFS was fast, under 30 minutes to load ~240GB, so
> I'm thinking this is a DN <-> map task communication problem.
>
> -With a traditional HDFS setup, map tasks were taking 10-30 seconds,
> but they now take 45-90 seconds or more.
>
> -I grep'd the DN logs to find how long the size 67633152 HDFS reads
> (map inputs) were taking. With the central DN, the reads were an order
> of magnitude slower than with traditional HDFS (e.g. 82008147000 vs.
> 8238455000).
>
> -I tried increasing dfs.datanode.handler.count to 10, but this didn't
> seem to have any effect.
>
> -Could low memory be an issue? The machine the DN is running on only
> has 2GB and there is less than 100MB free without the DN running. I
> haven't observed any swapping going on though.
>
> -I looked at netstat during a job. I wasn't too sure what to look for,
> but I didn't see any substantial send/receive buffering.
>
> I've tried everything I can think of, so I'd really appreciate any tips. 
> Thanks.
>
> Ed
>
>


Re: Single datanode setup

2010-03-30 Thread Steve Loughran

Ed Mazur wrote:

Hi,

I have a 12 node cluster where instead of running a DN on each compute
node, I'm running just one DN backed by a large RAID (with a
dfs.replication of 1). The compute node storage is limited, so the
idea behind this was to free up more space for intermediate job data.
So the cluster has that one node with the DN, a master node with the
JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
from Cloudera.

The problem is that MR job performance is now worse than when using a
traditional HDFS setup. A job that took 76 minutes before now takes
169 minutes. I've used this single DN setup before on a
similarly-sized cluster without any problems, so what can I do to find
the bottleneck?


I wouldn't use hdfs in this situation. Your network will be the 
bottleneck. If you have a SAN, high end filesystem and/or fast network, 
just use file:// URLs and let the underlying OS/network handle it. I 
know people who use alternate filesystems this way. Side benefit: the NN 
is longer an SPOF. Just your storage array. But they never fail, right?


Having a single DN and NN is a waste of effort here. There's no 
locality, no replication, so no need for the replication and locality 
features of HDFS. Try mounting the filestore everywhere with NFS (or 
other protocol of choice), and skip HDFS entirely.


-Steve



Query over DFSClient

2010-03-30 Thread Pallavi Palleti

Hi,

Could some one kindly let me know if the DFSClient takes care of 
datanode failures and attempt to write to another datanode if primary 
datanode (and replicated datanodes) fail. I looked into the souce code 
of DFSClient and figured out that it attempts to write to one of the 
datanodes in pipeline and fails if it failed to write to at least one of 
them. However, I am not sure as I haven't explored fully. If so, is 
there a way of querying namenode to provide different datanodes in the 
case of failure. I am sure the Mapper would be doing similar 
thing(attempting to fetch different datanode from namenode)  if it fails 
to write to datanodes. Kindly let me know.


Thanks
Pallavi



Listing subdirectories in Hadoop

2010-03-30 Thread Santiago Pérez

Hej

I've checking the API and on internet but I have not found any method for
listing the subdirectories of a given directory in the HDFS. 

Can anybody show me how to get the list of subdirectories or even how to
implement the method? (I guess that it should be possible and not very
hard).

Thanks in advance ;)
-- 
View this message in context: 
http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



a question about automatic restart of the NameNode

2010-03-30 Thread 毛宏
Hi all,  
 Does automatic restart and failover of the NameNode software to
another machine available in hadoop 0.20.2?  



Re: Listing subdirectories in Hadoop

2010-03-30 Thread Ted Yu
Does this get what you want ?
hadoop dfs -ls  | grep drwx

On Tue, Mar 30, 2010 at 8:24 AM, Santiago Pérez  wrote:

>
> Hej
>
> I've checking the API and on internet but I have not found any method for
> listing the subdirectories of a given directory in the HDFS.
>
> Can anybody show me how to get the list of subdirectories or even how to
> implement the method? (I guess that it should be possible and not very
> hard).
>
> Thanks in advance ;)
> --
> View this message in context:
> http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
I'm confused as to how to run a C++ pipes program on a full HDFS system.  First 
off, I have everything working in pseudo-distributed mode so that's a good 
start...but full HDFS has no concept of an executable file (to the best of my 
understanding, O'Reilly/White, p.47).  I haven't even been successful in 
setting executable permission on a file in HDFS (the 'x' never appears after 
the corresponding chmod command), which means that when I copy my compiled C++ 
program to the cluster (as in the C++ pseudo-cluster usage example, p.38) and 
try to run the job I obviously get a permissions error because the file isn't 
executable and the job fails.

I feel like I'm missing something obvious here.

Any ideas?


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
  -- Homer Simpson






Re: a question about automatic restart of the NameNode

2010-03-30 Thread Ted Yu
Please refer to highavailability contrib of 0.20.2:
HDFS-976
http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html

On Tue, Mar 30, 2010 at 8:51 AM, 毛宏  wrote:

> Hi all,
> Does automatic restart and failover of the NameNode software to
> another machine available in hadoop 0.20.2?
>
>


Re: Redirecting hadoop log messages to a log file at client side

2010-03-30 Thread Alex Kozlov
Hi Pallavi,

DFSClient uses log4j.properties for configuration.  What is your classpath?
 I need to know how exactly you invoke your program (java, hadoop script,
etc.).  The log level and appender is driven by the hadoop.root.logger
config variable.

I would also recommend to use one logging system in the code, which will be
commons-logging in this case.

Alex K

On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti <
pallavi.pall...@corp.aol.com> wrote:

> Hi Alex,
>
> Thanks for the reply. I have already created a logger (from
> log4j.logger)and configured the same to log it to a file and it is logging
> for all the log statements that I have in my client code. However, the
> error/info logs of DFSClient are going to stdout.  The DFSClient code is
> using log from commons-logging.jar. I am wondering how to redirect those
> logs (which are right now going to stdout) to append to the existing logger
> in client code.
>
> Thanks
> Pallavi
>
>
>
> On 03/30/2010 12:06 PM, Alex Kozlov wrote:
>
>> Hi Pallavi,
>>
>> It depends what logging configuration you are using.  If it's log4j, you
>> need to modify (or create) log4j.properties file and point you code (via
>> classpath) to it.
>>
>> A sample log4j.properties is in the conf directory (either apache or CDH
>> distributions).
>>
>> Alex K
>>
>> On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti<
>> pallavi.pall...@corp.aol.com>  wrote:
>>
>>
>>
>>> Hi,
>>>
>>> I am copying certain data from a client machine (which is not part of the
>>> cluster) using DFSClient to HDFS. During this process, I am encountering
>>> some issues and the error/info logs are going to stdout. Is there a way,
>>> I
>>> can configure the property at client side so that the error/info logs are
>>> appended to existing log file (being created using logger at client code)
>>> rather writing to stdout.
>>>
>>> Thanks
>>> Pallavi
>>>
>>>
>>>
>>
>>
>


C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
I'm confused as to how to run a C++ pipes program on a full HDFS system.  I 
have everything working in pseudo-distributed mode so that's a good start...but 
I can't figure out the full cluster mode.

As I see it, there are two basic approaches: upload the executable directly to 
HDFS or specify it when you run pipes and have it distributed to the cluster at 
the time the job is run.

In the former case, which mirrors the documentation for the pseudo-distributed 
example, I am totally perplexed because HDFS doesn't support executable 
permissions on any files.  In other words, the word count example for the 
pseudo-distributed case absolutely will not carry over to the fully distributed 
case since that example consists of first transferring the file to the cluster. 
 When I do that and run pipes I get a permissions error on the file because it 
isn't executable (and chmod refuses to enable 'x' on HDFS).

So that leaves the latter case.  I specify the executable to pipes using the 
-program option, but then it never gets found.  I get file not found errors for 
the executable.

I've tried the following and a few variants to no avail:

% hadoop pipes -files LOCALPATH/EXE EXE -input HDFSPATH/input -output 
HDFSPATH/output -program LOCALPATH/EXE

% hadoop pipes -input HDFSPATH/input -output HDFSPATH/output -program 
LOCALPATH/EXE

Do anyone know how to get this this working?

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"Luminous beings are we, not this crude matter."
  -- Yoda






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
Please disregard this thread.  I started another thread which is more specific 
and pertinent to my problem...but if you have any helpful information, please 
respond to the other thread.  I need to get this figured out.

Thank you.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
  -- Homer Simpson






swapping on hadoop

2010-03-30 Thread Vasilis Liaskovitis
Hi all,

I 've noticed swapping for a single terasort job on a small 8-node
cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I
can have back to back runs of the same job from the same hdfs input
data and get swapping only on 1 out of 4 identical runs. I 've noticed
this swapping behaviour on both terasort jobs and hive query jobs.

- Focusing on a single job config, Is there a rule of thumb about how
much node memory should be left for use outside of Child JVMs?
I make sure that per Node, there is free memory:
(#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) *
JVMHeapSize < PhysicalMemoryonNode
The total JVM heap size per node per job from the above equation
currently account 65%-75% of the node's memory. (I 've tried
allocating a riskier 90% of the node's memory, with similar swapping
observations).

- Could there be an issue with HDFS data or metadata taking up memory?
I am not cleaning output or intermediate outputs from HDFS between
runs. Is this possible?

- Do people use any specific java flags (particularly garbage
collection flags) for production environments where one job runs (or
possibly more jobs run simultaneously) ?

- What are the memory requirements for the jobtracker,namenode and
tasktracker,datanode JVMs?

- I am setting io.sort.mb to about half of the JVM heap size (half of
-Xmx in javaopts). Should this be set to a different ratio? (this
setting doesn't sound like it should be causing swapping in the first
place).

- The buffer cache is cleaned before each run (flush and echo 3 >
/proc/sys/vm/drop_caches)

any empirical advice and suggestions  to solve this are appreciated.
thanks,

- Vasilis


Re: Listing subdirectories in Hadoop

2010-03-30 Thread A Levine
If you were talking about looking at directories within a Java
program, here is what has worked for me.

FileSystem fs;
FileStatus[] fileStat;
Path[] fileList;
SequenceFile.Reader reader = null;
try{
 // connect to the file system
 fs = FileSystem.get(conf);

 // get the stat on all files in the source directory
 fileStat = fs.listStatus(sourceDir);

 // get paths to the files in the source directory
 fileList = FileUtil.stat2Paths(fileStat);

// then you can do something like
for(int x = 0; x < fileList.length; x++){
 System.out.println(x + " " + fileList[x]);
}
} catch(IOException ioe){
// do something
}

Hope this helps.

andrew

--

On Tue, Mar 30, 2010 at 11:54 AM, Ted Yu  wrote:
> Does this get what you want ?
> hadoop dfs -ls  | grep drwx
>
> On Tue, Mar 30, 2010 at 8:24 AM, Santiago Pérez  wrote:
>
>>
>> Hej
>>
>> I've checking the API and on internet but I have not found any method for
>> listing the subdirectories of a given directory in the HDFS.
>>
>> Can anybody show me how to get the list of subdirectories or even how to
>> implement the method? (I guess that it should be possible and not very
>> hard).
>>
>> Thanks in advance ;)
>> --
>> View this message in context:
>> http://old.nabble.com/Listing-subdirectories-in-Hadoop-tp28084164p28084164.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>


CfP with Extended Deadline 5th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'10)

2010-03-30 Thread Michael Alexander
Apologies if you received multiple copies of this message.


=

CALL FOR PAPERS

5th Workshop on

Virtualization in High-Performance Cloud Computing

VHPC'10

as part of Euro-Par 2010, Island of Ischia-Naples, Italy

=

Date: August 31, 2010

Euro-Par 2009: http://www.europar2010.org/

Workshop URL: http://vhpc.org

SUBMISSION DEADLINE:

Abstracts: April 4, 2010 (extended)
Full Paper: June 19, 2010 (extended) 


Scope:

Virtualization has become a common abstraction layer in modern data
centers, enabling resource owners to manage complex infrastructure
independently of their applications. Conjointly virtualization is
becoming a driving technology for a manifold of industry grade IT
services. Piloted by the Amazon Elastic Computing Cloud services, the
cloud concept includes the notion of a separation between resource
owners and users, adding services such as hosted application
frameworks and queuing. Utilizing the same infrastructure, clouds
carry significant potential for use in high-performance scientific
computing. The ability of clouds to provide for requests and releases
of vast computing resource dynamically and close to the marginal cost
of providing the services is unprecedented in the history of
scientific and commercial computing.

Distributed computing concepts that leverage federated resource access
are popular within the grid community, but have not seen previously
desired deployed levels so far. Also, many of the scientific
datacenters have not adopted virtualization or cloud concepts yet.

This workshop aims to bring together industrial providers with the
scientific community in order to foster discussion, collaboration and
mutual exchange of knowledge and experience.

The workshop will be one day in length, composed of 20 min paper
presentations, each followed by 10 min discussion sections.
Presentations may be accompanied by interactive demonstrations. It
concludes with a 30 min panel discussion by presenters.

TOPICS

Topics include, but are not limited to, the following subjects:

- Virtualization in cloud, cluster and grid HPC environments
- VM cloud, cluster load distribution algorithms
- Cloud, cluster and grid filesystems
- QoS and and service level guarantees
- Cloud programming models, APIs and databases
- Software as a service (SaaS)
- Cloud provisioning
- Virtualized I/O
- VMMs and storage virtualization
- MPI, PVM on virtual machines
- High-performance network virtualization
- High-speed interconnects
- Hypervisor extensions
- Tools for cluster and grid computing
- Xen/other VMM cloud/cluster/grid tools
- Raw device access from VMs
- Cloud reliability, fault-tolerance, and security
- Cloud load balancing
- VMs - power efficiency
- Network architectures for VM-based environments
- VMMs/Hypervisors
- Hardware support for virtualization
- Fault tolerant VM environments
- Workload characterizations for VM-based environments
- Bottleneck management
- Metering
- VM-based cloud performance modeling
- Cloud security, access control and data integrity
- Performance management and tuning hosts and guest VMs
- VMM performance tuning on various load types
- Research and education use cases
- Cloud use cases
- Management of VM environments and clouds
- Deployment of VM-based environments



PAPER SUBMISSION

Papers submitted to the workshop will be reviewed by at least two
members of the program committee and external reviewers. Submissions
should include abstract, key words, the e-mail address of the
corresponding author, and must not exceed 10 pages, including tables
and figures at a main font size no smaller than 11 point. Submission
of a paper should be regarded as a commitment that, should the paper
be accepted, at least one of the authors will register and attend the
conference to present the work.

Accepted papers will be published in the Springer LNCS series - the
format must be according to the Springer LNCS Style. Initial
submissions are in PDF, accepted papers will be requested to provided
source files.

Format Guidelines: http://www.springer.de/comp/lncs/authors.html

Submission Link: http://edas.info/newPaper.php?c=8553


IMPORTANT DATES

April 4 - Abstract submission due (extended)
May 19 - Full paper submission (extended)
July 14 - Acceptance notification
August 3 - Camera-ready version due
August 31 - September 3 - conference


CHAIR

Michael Alexander (chair), scaledinfra technologies GmbH, Austria
Gianluigi Zanetti (co-chair), CRS4, Italy


PROGRAM COMMITTEE

Padmashree Apparao, Intel Corp., USA
Volker Buege, University of Karlsruhe, Germany
Roberto Canonico, University of Napoli Federico II, Italy
Tommaso Cucinotta, Scuola Superiore Sant'Anna, Italy
Werner Fischer, Thomas Krenn AG, Germany
William Gardner, University of Guelph, Canada
Wolfgang Gentzsch, DEISA. Max Planck Gesellschaft, Germany
Derek Groen, UVA, The Netherlands
Marcus Hardt, Forschun

Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
No responses yet, although I admit it's only been a few hours.

As a follow-up, permit me to pose the following question:

Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as 
opposed to a pseudo-distributed system)?  I haven't found any definitive 
clarification on this topic one way or the other.  The only statement that I 
found in the least bit illuminating is in the O'Reilly book (not official 
Hadoop documentation mind you), p.38, which states:

"To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes 
doesn't run in standalone (local) mode, since it relies on Hadoop's distributed 
cache mechanism, which works only when HDFS is running."

The phrasing of those statements is a little unclear in that the distinction 
being made appears to be between standalone and pseudo-distributed mode, 
without any specific reference to fully-distributed mode.  Namely, the section 
that qualifies the need for pseudo-distributed mode (the need for HDFS) would 
obviously also apply to full distributed mode despite the lack of mention of 
fully distributed mode in the quoted section.  So can pipes run in fully 
distributed mode or not?

Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet 
and I don't know if I am wasting my time, if this is a truly impossible effort 
or if it can be done and I simply haven't figured out how to do it yet.

Thanks for any help.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
  -- Mark Twain






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
Hello.
Did you try following the tutorial in 
http://wiki.apache.org/hadoop/C++WordCount ?

We use C++ pipes in production on a large cluster, and it works.

--gianluigi


On Tue, 2010-03-30 at 13:28 -0700, Keith Wiley wrote:
> No responses yet, although I admit it's only been a few hours.
> 
> As a follow-up, permit me to pose the following question:
> 
> Is it, in fact, impossible to run C++ pipes on a fully-distributed system (as 
> opposed to a pseudo-distributed system)?  I haven't found any definitive 
> clarification on this topic one way or the other.  The only statement that I 
> found in the least bit illuminating is in the O'Reilly book (not official 
> Hadoop documentation mind you), p.38, which states:
> 
> "To run a Pipes job, we need to run Hadoop in pseudo-distributed mode...Pipes 
> doesn't run in standalone (local) mode, since it relies on Hadoop's 
> distributed cache mechanism, which works only when HDFS is running."
> 
> The phrasing of those statements is a little unclear in that the distinction 
> being made appears to be between standalone and pseudo-distributed mode, 
> without any specific reference to fully-distributed mode.  Namely, the 
> section that qualifies the need for pseudo-distributed mode (the need for 
> HDFS) would obviously also apply to full distributed mode despite the lack of 
> mention of fully distributed mode in the quoted section.  So can pipes run in 
> fully distributed mode or not?
> 
> Bottom line, I can't get C++ pipes to work on a fully distributed cluster yet 
> and I don't know if I am wasting my time, if this is a truly impossible 
> effort or if it can be done and I simply haven't figured out how to do it yet.
> 
> Thanks for any help.
> 
> 
> Keith Wiley   kwi...@keithwiley.com   
> www.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>   -- Mark Twain
> 
> 
> 
> 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
Yep, tried and tried and tried it.  Works perfectly on a pseudo-distributed 
cluster which is why I didn't think the example or the code was the problem, 
but rather that the cluster was the problem.

I have only just (in the last two minutes) heard back from the administrator of 
our cluster and he says the pipes package is not installed on the cluster...so 
that kinda explains it, although I'm still unclear what the symptoms would be 
for various kinds of problems.  In other words, I'm not sure if the errors I 
got were the result of the lack of a pipes package on the cluster or if I still 
wasn't "doing it right".

At any rate, it sounds like pipes is an additional extraneous add-on during 
cluster configuration and that our cluster didn't add it.

Does that make sense to you?...that pipes needs to be enabled on the cluster, 
not merely "run properly by the user"?

Thanks.

Cheers!

On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:

> Hello.
> Did you try following the tutorial in 
> http://wiki.apache.org/hadoop/C++WordCount ?
> 
> We use C++ pipes in production on a large cluster, and it works.
> 
> --gianluigi



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
  -- Edwin A. Abbott, Flatland






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
My cluster admin noticed that there is some additional pipes package he could 
add to the cluster configuration, but he admits to knowing very little about 
how the C++ pipes component of Hadoop works.

Can you offer any insight into this cluster configuration package?  What 
exactly does it do that makes a cluster capable of running pipes programs (and 
what symptom should its absence present from a user's point of view)?

On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:

> Hello.
> Did you try following the tutorial in 
> http://wiki.apache.org/hadoop/C++WordCount ?
> 
> We use C++ pipes in production on a large cluster, and it works.
> 
> --gianluigi



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
  -- Galileo Galilei






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Gianluigi Zanetti
What are the symptoms? 
Pipes should run out of the box in a standard installation.
BTW what version of bash are you using? Is it bash 4.0 by any chance?
See https://issues.apache.org/jira/browse/HADOOP-6388

--gianluigi


On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote:
> My cluster admin noticed that there is some additional pipes package he could 
> add to the cluster configuration, but he admits to knowing very little about 
> how the C++ pipes component of Hadoop works.
> 
> Can you offer any insight into this cluster configuration package?  What 
> exactly does it do that makes a cluster capable of running pipes programs 
> (and what symptom should its absence present from a user's point of view)?
> 
> On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:
> 
> > Hello.
> > Did you try following the tutorial in 
> > http://wiki.apache.org/hadoop/C++WordCount ?
> > 
> > We use C++ pipes in production on a large cluster, and it works.
> > 
> > --gianluigi
> 
> 
> 
> Keith Wiley   kwi...@keithwiley.com   
> www.keithwiley.com
> 
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>   -- Galileo Galilei
> 
> 
> 
> 


Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
The closest I've gotten so far is for the job to basically try to start up but 
to get an error complaining about the permissions on the executable 
binary...which makes perfect sense since the permissions are not "executable".  
Problem is, the hdfs chmod command ignores executable commands.  For example, 
"hd fs -chmod 755 somefile" yields -rw-r--r--.  The x is simply dropped from 
the command.  This makes sense to me in light of documentation (O'Reilly p.47) 
that indicates HDFS doesn't support executable file permissions, but it leaves 
me perplexed how any file could ever be executable under HDFS or Hadoop in 
general.

Using slightly different attempts at the pipes command I usually get errors 
that the executable is not found.  This occurs when I point to a local file for 
the executable instead of one uploaded to HDFS.  In other words, I haven't 
found any way to run pipes such that the executable starts out on the local 
machine and is automatically distributed to the cluster as a component of the 
pipes command.  Rather, it seems that the executable must already reside in 
HDFS and be indicated during the pipes command (ala -program or 
hadoop.pipes.executable of course).  I have even tried adding the "-files" 
option to pipes, but so far to no positive effect.

I'll send another post with some specific transcripts of what I'm seeing.

One could ask, w.r.t. the "-program" flag for pipes, should that indicate a 
local path, an hdfs path, or are both options possible?

As to bash, I'm running on a 10.6.2 Mac, thus:

$ bash --version
bash --version
GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin10.0)
Copyright (C) 2007 Free Software Foundation, Inc.

...so not v4.0 as you asked.

On Mar 30, 2010, at 14:29 , Gianluigi Zanetti wrote:

> What are the symptoms? 
> Pipes should run out of the box in a standard installation.
> BTW what version of bash are you using? Is it bash 4.0 by any chance?
> See https://issues.apache.org/jira/browse/HADOOP-6388
> 
> --gianluigi
> 
> 
> On Tue, 2010-03-30 at 14:13 -0700, Keith Wiley wrote:
>> My cluster admin noticed that there is some additional pipes package he 
>> could add to the cluster configuration, but he admits to knowing very little 
>> about how the C++ pipes component of Hadoop works.
>> 
>> Can you offer any insight into this cluster configuration package?  What 
>> exactly does it do that makes a cluster capable of running pipes programs 
>> (and what symptom should its absence present from a user's point of view)?
>> 
>> On Mar 30, 2010, at 13:43 , Gianluigi Zanetti wrote:
>> 
>>> Hello.
>>> Did you try following the tutorial in 
>>> http://wiki.apache.org/hadoop/C++WordCount ?
>>> 
>>> We use C++ pipes in production on a large cluster, and it works.
>>> 
>>> --gianluigi
>> 
>> 
>> 
>> Keith Wiley   kwi...@keithwiley.com   
>> www.keithwiley.com
>> 
>> "I do not feel obliged to believe that the same God who has endowed us with
>> sense, reason, and intellect has intended us to forgo their use."
>>  -- Galileo Galilei
>> 
>> 
>> 
>> 



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
  -- Abe (Grandpa) Simpson






Re: C++ pipes on full (nonpseudo) cluster

2010-03-30 Thread Keith Wiley
$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
-input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE
Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output
10/03/30 14:56:55 WARN mapred.JobClient: No job jar file set.  User classes may 
not be found. See JobConf(Class) or JobConf#setJar(String).
10/03/30 14:56:55 INFO mapred.FileInputFormat: Total input paths to process : 1
10/03/30 14:57:05 INFO mapred.JobClient: Running job: job_201003241650_1076
10/03/30 14:57:06 INFO mapred.JobClient:  map 0% reduce 0%
^C
$

At that point the terminal hung, so I eventually ctrl-Ced to break it.  Now if 
I investigate the Hadoop task logs for the mapper, I see this:

stderr logs
bash: 
/data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic:
 cannot execute binary file

...which makes perfect sense in light of the following:

$ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
Found 1 items
-rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
/uwphysics/kwiley/mosaic/c++_bin/Mosaic
$ hd fs -chmod 755 /uwphysics/kwiley/mosaic/c++_bin/Mosaic
$ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin
Found 1 items
-rw-r--r--   1 kwiley uwphysics 211808 2010-03-30 10:26 
/uwphysics/kwiley/mosaic/c++_bin/Mosaic
$

Note that this is all in attempt to run an executable that was uploaded to HDFS 
in advance.  In this example I am not attempting to run an executable stored on 
my local machine.  Any attempt to do that results in a file not found error:

$ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D 
hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true 
-input HDFSPATH/input -output HDFSPATH/output -program LOCALPATH/EXECUTABLE
Deleted hdfs://mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/output
Exception in thread "main" java.io.FileNotFoundException: File does not exist: 
/Users/kwiley/hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at 
org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509)
at 
org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
$

It's clearly looking or the executable in HDFS, not on the local system, thus 
the file not found error.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
  -- Keith Wiley







Hadoop DFS IO Performance measurement

2010-03-30 Thread sagar naik
Hi All,

I am trying to get DFS IO performance.
I used TestDFSIO from hadoop jars.
The results were abt 100Mbps read and write .
I think it should be more than this

Pl share some stats to compare

Either I am missing something like  config params or something else


-Sagar


Re: java.io.IOException: Function not implemented

2010-03-30 Thread Edson Ramiro
Hi all,

Thanks for help Todd and Steve,

I configured Hadoop (0.20.2) again and I'm getting the same error (Function
not implemented).

Do you think it's a Hadoop bug?

This is the situation:

I've 28 nodes where just four are running the datanode.

In all other nodes the tasktracker in running ok.

The NN and JT are running ok.

The configuration of the machines is the same, its a nfs shared home.

In all machines the Java version is "1.6.0_17".

This is the kernel version of the nodes, note that are two versions and in
both the
datanode doesn't work. Just in the h0* machines.

ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a  | sort
a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux
sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
GNU/Linux


These are the java processes running on each clients.
Jjust the h0* machines are running ok.

ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort
a01: 1
a02: 1
a03: 1
a04: 1
a05: 1
a06: 1
a07: 1
a09: 1
a10: 1
ag06: 1
ag07: 1
bl02: 1
bl03: 1
bl04: 1
bl06: 1
bl07: 1
ct02: 1
ct03: 1
ct04: 1
ct06: 1
h01: 2
h02: 2
h03: 2
h04: 2
sd02: 1
sd05: 1
sd06: 1
sd07: 1

This is my configuration:

ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site*







fs.default.name
hdfs://lcpad:9000









dfs.replication
1








  
mapred.job.tracker
lcpad:9001
  


Thanks in Advance,

Edson Ramiro


On 30 March 2010 05:58, Steve Loughran  wrote:

> Edson Ramiro wrote:
>
>> I'm not involved with Debian community :(
>>
>
> I think you are now...
>


Re: java.io.IOException: Function not implemented

2010-03-30 Thread Todd Lipcon
Hi Edson,

I noticed that only the h01 nodes are running 2.6.32.9, the other broken DNs
are 2.6.32.10.

Is there some reason you are running a kernel that is literally 2 weeks old?
I wouldn't be at all surprised if there were a bug here, or some issue with
your Debian "unstable" distribution...

-Todd

On Tue, Mar 30, 2010 at 3:54 PM, Edson Ramiro  wrote:

> Hi all,
>
> Thanks for help Todd and Steve,
>
> I configured Hadoop (0.20.2) again and I'm getting the same error (Function
> not implemented).
>
> Do you think it's a Hadoop bug?
>
> This is the situation:
>
> I've 28 nodes where just four are running the datanode.
>
> In all other nodes the tasktracker in running ok.
>
> The NN and JT are running ok.
>
> The configuration of the machines is the same, its a nfs shared home.
>
> In all machines the Java version is "1.6.0_17".
>
> This is the kernel version of the nodes, note that are two versions and in
> both the
> datanode doesn't work. Just in the h0* machines.
>
> ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a  | sort
> a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64 GNU/Linux
> ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
> h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
> h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
> h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64 GNU/Linux
> sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
> sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> GNU/Linux
>
>
> These are the java processes running on each clients.
> Jjust the h0* machines are running ok.
>
> ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort
> a01: 1
> a02: 1
> a03: 1
> a04: 1
> a05: 1
> a06: 1
> a07: 1
> a09: 1
> a10: 1
> ag06: 1
> ag07: 1
> bl02: 1
> bl03: 1
> bl04: 1
> bl06: 1
> bl07: 1
> ct02: 1
> ct03: 1
> ct04: 1
> ct06: 1
> h01: 2
> h02: 2
> h03: 2
> h04: 2
> sd02: 1
> sd05: 1
> sd06: 1
> sd07: 1
>
> This is my configuration:
>
> ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site*
> 
> 
>
> 
>
> 
> 
> fs.default.name
> hdfs://lcpad:9000
> 
> 
> 
> 
>
> 
>
> 
> 
> dfs.replication
> 1
> 
> 
> 
> 
>
> 
>
> 
>  
>mapred.job.tracker
>lcpad:9001
>  
> 
>
> Thanks in Advance,
>
> Edson Ramiro
>
>
> On 30 March 2010 05:58, Steve Loughran  wrote:
>
> > Edson Ramiro wrote:
> >
> >> I'm not involved with Debian community :(
> >>
> >
> > I think you are now...
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: java.io.IOException: Function not implemented

2010-03-30 Thread Edson Ramiro
May be it's a bug.

I'm not the admin. : (

so, I'll talk to him and may be he install a 2.6.32.9 in another node to
test  : )

Thanks

Edson Ramiro


On 30 March 2010 20:00, Todd Lipcon  wrote:

> Hi Edson,
>
> I noticed that only the h01 nodes are running 2.6.32.9, the other broken
> DNs
> are 2.6.32.10.
>
> Is there some reason you are running a kernel that is literally 2 weeks
> old?
> I wouldn't be at all surprised if there were a bug here, or some issue with
> your Debian "unstable" distribution...
>
> -Todd
>
> On Tue, Mar 30, 2010 at 3:54 PM, Edson Ramiro  wrote:
>
> > Hi all,
> >
> > Thanks for help Todd and Steve,
> >
> > I configured Hadoop (0.20.2) again and I'm getting the same error
> (Function
> > not implemented).
> >
> > Do you think it's a Hadoop bug?
> >
> > This is the situation:
> >
> > I've 28 nodes where just four are running the datanode.
> >
> > In all other nodes the tasktracker in running ok.
> >
> > The NN and JT are running ok.
> >
> > The configuration of the machines is the same, its a nfs shared home.
> >
> > In all machines the Java version is "1.6.0_17".
> >
> > This is the kernel version of the nodes, note that are two versions and
> in
> > both the
> > datanode doesn't work. Just in the h0* machines.
> >
> > ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh uname -a  | sort
> > a01: Linux a01 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a02: Linux a02 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a03: Linux a03 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a04: Linux a04 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a05: Linux a05 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a06: Linux a06 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a07: Linux a07 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a09: Linux a09 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > a10: Linux a10 2.6.27.11 #4 Fri Jan 16 22:32:46 BRST 2009 x86_64
> GNU/Linux
> > ag06: Linux ag06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > ag07: Linux ag07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > bl02: Linux bl02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > bl03: Linux bl03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > bl04: Linux bl04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > bl06: Linux bl06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > bl07: Linux bl07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > ct02: Linux ct02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > ct03: Linux ct03 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > ct04: Linux ct04 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > ct06: Linux ct06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > h01: Linux h01 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
> GNU/Linux
> > h02: Linux h02 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
> GNU/Linux
> > h03: Linux h03 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
> GNU/Linux
> > h04: Linux h04 2.6.32.9 #2 SMP Sat Mar 6 19:09:13 BRT 2010 x86_64
> GNU/Linux
> > sd02: Linux sd02 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > sd05: Linux sd05 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > sd06: Linux sd06 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> > sd07: Linux sd07 2.6.32.10 #1 SMP Tue Mar 16 10:17:30 BRT 2010 x86_64
> > GNU/Linux
> >
> >
> > These are the java processes running on each clients.
> > Jjust the h0* machines are running ok.
> >
> > ram...@lcpad:~/hadoop-0.20.2$ ./bin/slaves.sh pgrep -lc java | sort
> > a01: 1
> > a02: 1
> > a03: 1
> > a04: 1
> > a05: 1
> > a06: 1
> > a07: 1
> > a09: 1
> > a10: 1
> > ag06: 1
> > ag07: 1
> > bl02: 1
> > bl03: 1
> > bl04: 1
> > bl06: 1
> > bl07: 1
> > ct02: 1
> > ct03: 1
> > ct04: 1
> > ct06: 1
> > h01: 2
> > h02: 2
> > h03: 2
> > h04: 2
> > sd02: 1
> > sd05: 1
> > sd06: 1
> > sd07: 1
> >
> > This is my configuration:
> >
> > ram...@lcpad:~/hadoop-0.20.2$ cat conf/*site*
> > 
> > 
> >
> > 
> >
> > 
> > 
> > fs.default.name
> > hdfs://lcpad:9000
> > 
> > 
> > 
> > 
> >
> > 
> >
> > 
> > 
> > dfs.replication
> > 1
> > 
> > 
> > 
> > 
> >
> > 
> >
> > 
> >  
> >mapred.job.tracker
> >lcpad:9001
> >  
> > 
> >
> > Thanks in Advance,
> >
> > Edson Ramiro
> >
> >
> > On 30 March 2010 05:58, Steve Loughran  wrote:
> >
> > > Edson Ramiro wrote:
> > >
> > >> I'm not involved with Debian community :(
> > >>
> > >
> > > I think you are now...
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: Hadoop DFS IO Performance measurement

2010-03-30 Thread Edson Ramiro
Hi Sagar,

What hardware did you run it on ?

Edson Ramiro


On 30 March 2010 19:41, sagar naik  wrote:

> Hi All,
>
> I am trying to get DFS IO performance.
> I used TestDFSIO from hadoop jars.
> The results were abt 100Mbps read and write .
> I think it should be more than this
>
> Pl share some stats to compare
>
> Either I am missing something like  config params or something else
>
>
> -Sagar
>


question on shuffle and sort

2010-03-30 Thread Cui tony
Hi,
  Did all key-value pairs of the map output, which have the same key, will
be sent to the same reducer tasknode?


Re: question on shuffle and sort

2010-03-30 Thread Ed Mazur
On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
>  Did all key-value pairs of the map output, which have the same key, will
> be sent to the same reducer tasknode?

Yes, this is at the core of the MapReduce model. There is one call to
the user reduce function per unique map output key. This grouping is
achieved by sorting which means you see keys in increasing order.

Ed


Re: question on shuffle and sort

2010-03-30 Thread 毛宏
yes ,indeed

在 2010-03-31三的 09:56 +0800,Cui tony写道:
> Hi,
>   Did all key-value pairs of the map output, which have the same key, will
> be sent to the same reducer tasknode?




Re: question on shuffle and sort

2010-03-30 Thread Jones, Nick
Something to keep in mind though, sorting is appropriate to the key type. Text 
will be sorted lexicographically.

Nick Jones


- Original Message -
From: Ed Mazur 
To: common-user@hadoop.apache.org 
Sent: Tue Mar 30 21:07:29 2010
Subject: Re: question on shuffle and sort

On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
>  Did all key-value pairs of the map output, which have the same key, will
> be sent to the same reducer tasknode?

Yes, this is at the core of the MapReduce model. There is one call to
the user reduce function per unique map output key. This grouping is
achieved by sorting which means you see keys in increasing order.

Ed




Re: question on shuffle and sort

2010-03-30 Thread Cui tony
Consider this extreme situation:
The input data is very large, and also the map result. 90% of map result
have the same key, then all of them will be sent to one reducer tasknode.
So 90% of work of reduce phase have to been done on a single node, not the
cluster. That is very ineffective and less scalable.


2010/3/31 Jones, Nick 

> Something to keep in mind though, sorting is appropriate to the key type.
> Text will be sorted lexicographically.
>
> Nick Jones
>
>
> - Original Message -
> From: Ed Mazur 
> To: common-user@hadoop.apache.org 
> Sent: Tue Mar 30 21:07:29 2010
> Subject: Re: question on shuffle and sort
>
> On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
> >  Did all key-value pairs of the map output, which have the same key, will
> > be sent to the same reducer tasknode?
>
> Yes, this is at the core of the MapReduce model. There is one call to
> the user reduce function per unique map output key. This grouping is
> achieved by sorting which means you see keys in increasing order.
>
> Ed
>
>
>


Re: question on shuffle and sort

2010-03-30 Thread Jones, Nick
I ran into an issue where lots of data was passing from mappers to a single 
reducer. Enabling a combiner saved quite a bit of processing time by reducing 
mapper disk writes and data movements to the reducer.

Nick Jones


- Original Message -
From: Cui tony 
To: common-user@hadoop.apache.org 
Sent: Tue Mar 30 21:24:18 2010
Subject: Re: question on shuffle and sort

Consider this extreme situation:
The input data is very large, and also the map result. 90% of map result
have the same key, then all of them will be sent to one reducer tasknode.
So 90% of work of reduce phase have to been done on a single node, not the
cluster. That is very ineffective and less scalable.


2010/3/31 Jones, Nick 

> Something to keep in mind though, sorting is appropriate to the key type.
> Text will be sorted lexicographically.
>
> Nick Jones
>
>
> - Original Message -
> From: Ed Mazur 
> To: common-user@hadoop.apache.org 
> Sent: Tue Mar 30 21:07:29 2010
> Subject: Re: question on shuffle and sort
>
> On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
> >  Did all key-value pairs of the map output, which have the same key, will
> > be sent to the same reducer tasknode?
>
> Yes, this is at the core of the MapReduce model. There is one call to
> the user reduce function per unique map output key. This grouping is
> achieved by sorting which means you see keys in increasing order.
>
> Ed
>
>
>



Re: question on shuffle and sort

2010-03-30 Thread Cui tony
Hi, Jones
 As you have met the situation I am worried about, I got my answer now.
Maybe re-design the map function or add a combiner is the only way to deal
with this kind of input data .

2010/3/31 Jones, Nick 

> I ran into an issue where lots of data was passing from mappers to a single
> reducer. Enabling a combiner saved quite a bit of processing time by
> reducing mapper disk writes and data movements to the reducer.
>
> Nick Jones
>
>
> - Original Message -
> From: Cui tony 
> To: common-user@hadoop.apache.org 
> Sent: Tue Mar 30 21:24:18 2010
> Subject: Re: question on shuffle and sort
>
> Consider this extreme situation:
> The input data is very large, and also the map result. 90% of map result
> have the same key, then all of them will be sent to one reducer tasknode.
> So 90% of work of reduce phase have to been done on a single node, not the
> cluster. That is very ineffective and less scalable.
>
>
> 2010/3/31 Jones, Nick 
>
> > Something to keep in mind though, sorting is appropriate to the key type.
> > Text will be sorted lexicographically.
> >
> > Nick Jones
> >
> >
> > - Original Message -
> > From: Ed Mazur 
> > To: common-user@hadoop.apache.org 
> > Sent: Tue Mar 30 21:07:29 2010
> > Subject: Re: question on shuffle and sort
> >
> > On Tue, Mar 30, 2010 at 9:56 PM, Cui tony wrote:
> > >  Did all key-value pairs of the map output, which have the same key,
> will
> > > be sent to the same reducer tasknode?
> >
> > Yes, this is at the core of the MapReduce model. There is one call to
> > the user reduce function per unique map output key. This grouping is
> > achieved by sorting which means you see keys in increasing order.
> >
> > Ed
> >
> >
> >
>
>


is there any way we can limit Hadoop Datanode's disk usage?

2010-03-30 Thread steven zhuang
hi, guys,
   we have some machine with 1T disk, some with 100GB disk,
   I have this question that is there any means we can limit the
disk usage of datanodes on those machines with smaller disk?
   thanks!


Re: is there any way we can limit Hadoop Data node's disk usage?

2010-03-30 Thread Ravi Phulari
Hello Steven ,
You can use  dfs.datanode.du.reserved configuration value in 
$HADOOP_HOME/conf/hdfs-site.xml for limiting disk usage.


dfs.datanode.du.reserved

182400
Reserved space in bytes per volume. Always leave this much 
space free for non dfs use.
  
  

Ravi
Hadoop @ Yahoo!

On 3/30/10 8:12 PM, "steven zhuang"  wrote:

hi, guys,
   we have some machine with 1T disk, some with 100GB disk,
   I have this question that is there any means we can limit the
disk usage of datanodes on those machines with smaller disk?
   thanks!


Ravi
--



log

2010-03-30 Thread Gang Luo
Hi all,
I find there is a directory "_log/history/..." under the output directory of a 
mapreduce job. Is the file in that directory a log file? Is the information 
there sufficient to allow me to figure out what nodes the job runs on? Besides, 
not every job has such a directory. Is there such settings controlling this? Or 
is there other ways to get the nodes my job runs on?

Thanks,
-Gang