Problem with using -libjars

2011-09-16 Thread Virajith Jalaparti
Hi, I was trying to run the DumpWikipediaToPlainText job of the Cloud9 library for Hadoop (http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/wikipedia.html). It requires two extra libraries which are .jar files - bliki-core-3.0.16.jar and commons-lang-2.6.jar (available from http://cod

Re: Lack of data locality in Hadoop-0.20.2

2011-07-13 Thread Virajith Jalaparti
measures this counter. > > Matei > > On Jul 12, 2011, at 1:27 PM, Virajith Jalaparti wrote: > > I agree that the scheduler has lesser leeway when the replication factor is > 1. However, I would still expect the number of data-local tasks to be more > than 10% even when the r

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
at 7:20 PM, Allen Wittenauer wrote: > > On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote: > > > I agree that the scheduler has lesser leeway when the replication factor > is > > 1. However, I would still expect the number of data-local tasks to be > more > > t

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
On 7/12/2011 7:20 PM, Allen Wittenauer wrote: On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote: I agree that the scheduler has lesser leeway when the replication factor is 1. However, I would still expect the number of data-local tasks to be more than 10% even when the replication factor

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
stem since intra-rack b/w is good enough for most > installs of Hadoop. > > Arun > > On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote: > > I am using a replication factor of 1 since I dont to incur the overhead of > replication and I am not much worried about reliability.

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
I am attaching the config files I was using for these runs with this email. I am not sure if something in them is causing this non-data locality of Hadoop. Thanks, Virajith On Tue, Jul 12, 2011 at 3:36 PM, Virajith Jalaparti wrote: > I am using a replication factor of 1 since I dont to in

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
in most cases, is good enough. > > Arun > > On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote: > > > Hi, > > > > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input > data using a 20 node cluster of nodes. HDFS is configured to use 128MB block

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
ounters in the job itself. > > On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti > wrote: > > How do I find the number of data-local map tasks that are launched? I > > checked the log files but didnt see any information about this. All the > map > > tasks are rack l

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
Each node is configured to run 8map tasks. I am using 2.4 GHz 64-bit Quad Core Xeon using machines. -Virajith On Tue, Jul 12, 2011 at 2:05 PM, Sudharsan Sampath wrote: > what's the map task capacity of each node ? > > On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti &g

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
n Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti > wrote: > > Hi, > > > > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input > > data using a 20 node cluster of nodes. HDFS is configured to use 128MB > block > > size (so 1600maps are crea

Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
Hi, I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input data using a 20 node cluster of nodes. HDFS is configured to use 128MB block size (so 1600maps are created) and a replication factor of 1 is being used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth v

Re: Using hadoop over machines with multiple interfaces

2011-07-07 Thread Virajith Jalaparti
On Thu, Jul 7, 2011 at 1:39 PM, Virajith Jalaparti wrote: > Hi, > > I am trying to set up a Hadoop cluster (using hadoop-0.20.2) using a bunch > of machines each of which have 2 interfaces, a control and an internal > interface. I want only the internal interface to be used for

Using hadoop over machines with multiple interfaces

2011-07-07 Thread Virajith Jalaparti
Hi, I am trying to set up a Hadoop cluster (using hadoop-0.20.2) using a bunch of machines each of which have 2 interfaces, a control and an internal interface. I want only the internal interface to be used for running hadoop (all hadoop control and data traffic is to be sent only using the intern

Re: How does a ReduceTask determine which MapTask output to read?

2011-06-29 Thread Virajith Jalaparti
partitions, that have been created for it by the various map tasks. Now, how does the ReduceTask decide which partition to read first? Thanks, Virajith On 6/29/2011 11:37 PM, David Rosenstrauch wrote: On 06/29/2011 05:28 PM, Virajith Jalaparti wrote: Hi, I was wondering what scheduling algorithm is

How does a ReduceTask determine which MapTask output to read?

2011-06-29 Thread Virajith Jalaparti
Hi, I was wondering what scheduling algorithm is used in Hadoop (version 0.20.2 in particular), for a ReduceTask to determine in what order it is supposed to read the map outputs from the various mappers that have been run? In particular, suppose we have 10maps called map1, map2,, map10.

Re: Intermediate data size of Sort example

2011-06-29 Thread Virajith Jalaparti
so counts all the reads of spilled records done > during sorting of the various outputs between the MR phases. > > On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti > wrote: > > I would like to clarify my earlier question: I found that each reducer > > reports F

Re: Intermediate data size of Sort example

2011-06-29 Thread Virajith Jalaparti
I would like to clarify my earlier question: I found that each reducer reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ 78GB and not just 25GB? Thanks, Virajith On Wed, Jun 29, 2011 at 10:29 AM, Virajith

Intermediate data size of Sort example

2011-06-29 Thread Virajith Jalaparti
Hi, I was running the Sort example in Hadoop 0.20.2 (hadoop-0.20.2-examples.jar) over an input data size of 100GB (generated using randomwriter) with 800mappers (I was using 128MB of HDFS block size) and 4 reducers over a 3 machine cluster with 2 slave nodes. While the input and output were 100GB,

Re: what is mapred.reduce.parallel.copies?

2011-06-28 Thread Virajith Jalaparti
concurrent connections to a single node would be made. I am not familiar with newer versions of hadoop. On Tue, Jun 28, 2011 at 11:31 AM, Virajith Jalaparti mailto:virajit...@gmail.com>> wrote: Hi, I have a question about the "mapred.reduce.parallel.copies"

what is mapred.reduce.parallel.copies?

2011-06-28 Thread Virajith Jalaparti
Hi, I have a question about the "mapred.reduce.parallel.copies" configuration parameter in Hadoop. The mapred-default.xml file says it is "The default number of parallel transfers run by reduce during the copy(shuffle) phase." Is this the number of slave nodes from which a reduce task reads in p

Re: "No space left on device" and "Could not find any valid local directory for taskTracker/jobcache/"

2011-06-23 Thread Virajith Jalaparti
In case it is required, I was trying to run this using 400mappers (my DFS block size is 128MB) and 4 reducers. Each of my machines is a 2.4 GHz 64-bit Quad Core Xeon E5530 "Nehalem" processor and I am using a 32-bit Ubuntu 10.4. -Virajith On Thu, Jun 23, 2011 at 3:09 PM, Virajith Jalap

"No space left on device" and "Could not find any valid local directory for taskTracker/jobcache/"

2011-06-23 Thread Virajith Jalaparti
Hi, I am trying to run a sort job (from hadoop-0.20.2-examples.jar) on 50GB of data (generated using randomwriter). I am using hadoop-0.20.2 on a cluster of 3 machines with one machine serving as the master and the other two as slaves. I get the following errors for various the task attempts:

Re: hdfs reformat confirmation message

2011-06-23 Thread Virajith Jalaparti
Cool...yeah "echo Y | hadoop namenode -format" works just fine. Thanks, Virajith On Wed, Jun 22, 2011 at 10:35 PM, Joey Echeverria wrote: > You could pipe 'yes' to the hadoop command: > > yes | hadoop namenode -format > > -Joey > > On Wed, Jun 22, 201

hdfs reformat confirmation message

2011-06-22 Thread Virajith Jalaparti
Hi, When I try to reformat HDFS (I have to multiple times for some experiment I need to run), it asks for a confirmation Y/N. Is there a way to disable this in HDFS/hadoop? I am trying to automate my process and pressing Y everytime I do this is just a lot of manual work. Thanks, Virajith

Tasktracker denied communication with jobtracker

2011-06-21 Thread Virajith Jalaparti
Hi, I am trying to setup a hadoop cluster with 7nodes with the master node also functioning as a slave node (i.e. runs a datanode and a tasktracker along with the namenode and jobtracker deamons). I am able to get HDFS working. However when I try starting the tasktrackers (bin/start-mapred.sh), I

Re: How is reduce completion % calculated?

2011-06-08 Thread Virajith Jalaparti
Also, if I set the mapred.reduce.slowstart.completed.maps value to 1, will the reduce tasks start only after all the Mappers have finished? Thanks, Virajith On Wed, Jun 8, 2011 at 3:31 PM, Virajith Jalaparti wrote: > Sean, can you point me to the file where the exact calculation

Re: How is reduce completion % calculated?

2011-06-08 Thread Virajith Jalaparti
reducer completion can only be 0, 0.33, 0.67, 1.0 > -- of course it makes progress through a copy, sort, shuffle, reduce by > chunk, by records, so can report much smaller quanta of progress than that. > > > On Wed, Jun 8, 2011 at 3:19 PM, John Armstrong wrote: > >> On Wed,

How is reduce completion % calculated?

2011-06-08 Thread Virajith Jalaparti
Hi, I am trying to figure out how the reduce progress for a job is calculated. I was looking at the syslog generated by my job run and it looks like the reducers start before the mappers complete. I figured this was the case because even when the Map had <100% completion, the reduce completion % w