Hi,
I was trying to run the DumpWikipediaToPlainText job of the Cloud9
library for Hadoop
(http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/wikipedia.html). It
requires two extra libraries which are .jar files -
bliki-core-3.0.16.jar and commons-lang-2.6.jar (available from
http://cod
measures this counter.
>
> Matei
>
> On Jul 12, 2011, at 1:27 PM, Virajith Jalaparti wrote:
>
> I agree that the scheduler has lesser leeway when the replication factor is
> 1. However, I would still expect the number of data-local tasks to be more
> than 10% even when the r
at 7:20 PM, Allen Wittenauer wrote:
>
> On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:
>
> > I agree that the scheduler has lesser leeway when the replication factor
> is
> > 1. However, I would still expect the number of data-local tasks to be
> more
> > t
On 7/12/2011 7:20 PM, Allen Wittenauer wrote:
On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:
I agree that the scheduler has lesser leeway when the replication factor is
1. However, I would still expect the number of data-local tasks to be more
than 10% even when the replication factor
stem since intra-rack b/w is good enough for most
> installs of Hadoop.
>
> Arun
>
> On Jul 12, 2011, at 7:36 AM, Virajith Jalaparti wrote:
>
> I am using a replication factor of 1 since I dont to incur the overhead of
> replication and I am not much worried about reliability.
I am attaching the config files I was using for these runs with this email.
I am not sure if something in them is causing this non-data locality of
Hadoop.
Thanks,
Virajith
On Tue, Jul 12, 2011 at 3:36 PM, Virajith Jalaparti wrote:
> I am using a replication factor of 1 since I dont to in
in most cases, is good enough.
>
> Arun
>
> On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:
>
> > Hi,
> >
> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
> data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
ounters in the job itself.
>
> On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
> wrote:
> > How do I find the number of data-local map tasks that are launched? I
> > checked the log files but didnt see any information about this. All the
> map
> > tasks are rack l
Each node is configured to run 8map tasks. I am using 2.4 GHz 64-bit Quad
Core Xeon using machines.
-Virajith
On Tue, Jul 12, 2011 at 2:05 PM, Sudharsan Sampath wrote:
> what's the map task capacity of each node ?
>
> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
&g
n Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
> wrote:
> > Hi,
> >
> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
> > data using a 20 node cluster of nodes. HDFS is configured to use 128MB
> block
> > size (so 1600maps are crea
Hi,
I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
size (so 1600maps are created) and a replication factor of 1 is being used.
All the 20 nodes are also hdfs datanodes. I was using a bandwidth v
On Thu, Jul 7, 2011 at 1:39 PM, Virajith Jalaparti wrote:
> Hi,
>
> I am trying to set up a Hadoop cluster (using hadoop-0.20.2) using a bunch
> of machines each of which have 2 interfaces, a control and an internal
> interface. I want only the internal interface to be used for
Hi,
I am trying to set up a Hadoop cluster (using hadoop-0.20.2) using a bunch
of machines each of which have 2 interfaces, a control and an internal
interface. I want only the internal interface to be used for running hadoop
(all hadoop control and data traffic is to be sent only using the intern
partitions, that have been created for it by the various map tasks. Now,
how does the ReduceTask decide which partition to read first?
Thanks,
Virajith
On 6/29/2011 11:37 PM, David Rosenstrauch wrote:
On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
Hi,
I was wondering what scheduling algorithm is
Hi,
I was wondering what scheduling algorithm is used in Hadoop (version
0.20.2 in particular), for a ReduceTask to determine in what order it is
supposed to read the map outputs from the various mappers that have been
run? In particular, suppose we have 10maps called map1, map2,,
map10.
so counts all the reads of spilled records done
> during sorting of the various outputs between the MR phases.
>
> On Wed, Jun 29, 2011 at 6:30 PM, Virajith Jalaparti
> wrote:
> > I would like to clarify my earlier question: I found that each reducer
> > reports F
I would like to clarify my earlier question: I found that each reducer
reports FILE_BYTES_READ as around 78GB and HDFS_BYTES_WRITTEN as 25GB and
REDUCE_SHUFFLE_BYTES as 25GB. So, why is the FILE_BYTES_READ 78GB and not
just 25GB?
Thanks,
Virajith
On Wed, Jun 29, 2011 at 10:29 AM, Virajith
Hi,
I was running the Sort example in Hadoop 0.20.2 (hadoop-0.20.2-examples.jar)
over an input data size of 100GB (generated using randomwriter) with
800mappers (I was using 128MB of HDFS block size) and 4 reducers over a 3
machine cluster with 2 slave nodes. While the input and output were 100GB,
concurrent connections to a
single node would be made.
I am not familiar with newer versions of hadoop.
On Tue, Jun 28, 2011 at 11:31 AM, Virajith Jalaparti
mailto:virajit...@gmail.com>> wrote:
Hi,
I have a question about the "mapred.reduce.parallel.copies"
Hi,
I have a question about the "mapred.reduce.parallel.copies" configuration
parameter in Hadoop. The mapred-default.xml file says it is "The default
number of parallel transfers run by reduce
during the copy(shuffle) phase."
Is this the number of slave nodes from which a reduce task reads in
p
In case it is required, I was trying to run this using 400mappers (my DFS
block size is 128MB) and 4 reducers. Each of my machines is a 2.4 GHz 64-bit
Quad Core Xeon E5530 "Nehalem" processor and I am using a 32-bit Ubuntu
10.4.
-Virajith
On Thu, Jun 23, 2011 at 3:09 PM, Virajith Jalap
Hi,
I am trying to run a sort job (from hadoop-0.20.2-examples.jar) on 50GB of
data (generated using randomwriter). I am using hadoop-0.20.2 on a cluster
of 3 machines with one machine serving as the master and the other two as
slaves.
I get the following errors for various the task attempts:
Cool...yeah "echo Y | hadoop namenode -format" works just fine.
Thanks,
Virajith
On Wed, Jun 22, 2011 at 10:35 PM, Joey Echeverria wrote:
> You could pipe 'yes' to the hadoop command:
>
> yes | hadoop namenode -format
>
> -Joey
>
> On Wed, Jun 22, 201
Hi,
When I try to reformat HDFS (I have to multiple times for some
experiment I need to run), it asks for a confirmation Y/N. Is there a
way to disable this in HDFS/hadoop? I am trying to automate my process
and pressing Y everytime I do this is just a lot of manual work.
Thanks,
Virajith
Hi,
I am trying to setup a hadoop cluster with 7nodes with the master node also
functioning as a slave node (i.e. runs a datanode and a tasktracker along
with the namenode and jobtracker deamons). I am able to get HDFS working.
However when I try starting the tasktrackers (bin/start-mapred.sh), I
Also, if I set the mapred.reduce.slowstart.completed.maps value to 1, will
the reduce tasks start only after all the Mappers have finished?
Thanks,
Virajith
On Wed, Jun 8, 2011 at 3:31 PM, Virajith Jalaparti wrote:
> Sean, can you point me to the file where the exact calculation
reducer completion can only be 0, 0.33, 0.67, 1.0
> -- of course it makes progress through a copy, sort, shuffle, reduce by
> chunk, by records, so can report much smaller quanta of progress than that.
>
>
> On Wed, Jun 8, 2011 at 3:19 PM, John Armstrong wrote:
>
>> On Wed,
Hi,
I am trying to figure out how the reduce progress for a job is calculated. I
was looking at the syslog generated by my job run and it looks like the
reducers start before the mappers complete. I figured this was the case
because even when the Map had <100% completion, the reduce completion % w
28 matches
Mail list logo