Yes , My doubt is that how is the location of the reducer selected . Is it
selected arbitrarily or is selected on a particular machine which has
already the more values (corresponding to the key of that reducer) which
reduces the cost of transferring data across the network(because already
many val
Yes, but the copy phase starts with the initialization for a reducer, after
which it would keep polling for completed map tasks to fetch the respective
outputs.
-Original Message-
From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com]
Sent: Friday, August 21, 2009 12:00 P
Arun
iam not talkin about the map phase . Iam talking abt the reduce phase which
starts after the map gets finished
The Key "K" iam referring to in my example is one of the distinct keys wch
map outputs. and its corresponding values may be on any system depending on
where the map phase gets exec
Amogh
i think Reduce phase starts only when all the map phases are completed .
Because it needs all the values corresponding to a particular key!
2009/8/21 Amogh Vasekar
> I'm not sure that is the case with Hadoop. I think its assigning reduce
> task to an available tasktracker at any instant;
On Aug 20, 2009, at 9:20 PM, bharath vissapragada wrote:
OK i'll be a bit more specific ,
Suppose map outputs 100 different keys .
Consider a key "K" whose correspoding values may be on N diff
datanodes.
Consider a datanode "D" which have maximum number of values . So
instead of
moving t
I'm not sure that is the case with Hadoop. I think its assigning reduce task to
an available tasktracker at any instant; Since a reducer polls JT for completed
maps. And if it were the case as you said, a reducer wont be initialized until
all maps have completed , after which copy phase would st
Hi,
GenericOptionsParser is customized only for Hadoop specific params :
* GenericOptionsParser recognizes several standarad command
* line arguments, enabling applications to easily specify a namenode, a
* jobtracker, additional configuration resources etc.
Ideally, all params must be passe
Hello,
I got these exceptions when I started the cluster, any suggestions?
I used hadoop 0.15.2.
2009-08-21 12:12:53,463 ERROR org.apache.hadoop.dfs.NameNode:
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.hadoop.dfs.FSIm
Thanks for the quick reply.
I looked at it, but still could not figure out how to use HDFS to store
input data (binary) and call an executable.
Please note that I cannot modify the executable.
May be I am asking some dumb question, but could you please explain a bit of
how to handle the scenario I
OK i'll be a bit more specific ,
Suppose map outputs 100 different keys .
Consider a key "K" whose correspoding values may be on N diff datanodes.
Consider a datanode "D" which have maximum number of values . So instead of
moving the values on "D"
to other systems , it is useful to bring in the v
Add some detials:
1. #map is determined by the block size and InputFormat (whether you can
want to split or not split)
2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and
Capacity Scheduler are other two options as I know. JobTracker has the
scheduler.
3. Once the map task i
Arvind,
You can use this API to get the size of file system used
FileSystem.getUsed();
But, I do not find the API for calculate the remaining space. You can write
some code to create a API,
The remaining disk space = Total of disk space - operate system space -
FileSystem.getUsed()
-
You can use the jobtracker Web UI to use the disk usage.
-Original Message-
From: Arvind Sharma [mailto:arvind...@yahoo.com]
Sent: 2009年8月20日 15:57
To: common-user@hadoop.apache.org
Subject: Cluster Disk Usage
Is there a way to find out how much disk space - overall or per Datanode
bas
Hi,
I am trying to run a simple map reduce that writes the result from the
reducer to a mysql db.
I Keep getting
09/08/20 15:44:59 INFO mapred.JobClient: Task Id :
attempt_200908201210_0013_r_00_0, Status : FAILED
java.io.IOException: com.mysql.jdbc.Driver
at
org.apache.hadoop.mapre
Sorry, I also sent a direct e-mail to one response
there I asked one question - what is the cost of these APIs ??? Are they too
expensive calls ? Is the API only going to the NN which stores this data ?
Thanks!
Arvind
From: Arvind Sharma
To: common-us
Using hadoop-0.19.2
From: Arvind Sharma
To: common-user@hadoop.apache.org
Sent: Thursday, August 20, 2009 3:56:53 PM
Subject: Cluster Disk Usage
Is there a way to find out how much disk space - overall or per Datanode basis
- is available before creating a fi
Is there a way to find out how much disk space - overall or per Datanode basis
- is available before creating a file ?
I am trying to address an issue where the disk got full (config error) and the
client was not able to create a file on the HDFS.
I want to be able to check if there space left
On 8/20/09 3:40 AM, "Steve Loughran" wrote:
>
>
> does anyone have any up to date data on the memory consumption per
> block/file on the NN on a 64-bit JVM with compressed pointers?
>
> The best documentation on consumption is
> http://issues.apache.org/jira/browse/HADOOP-1687 -I'm just wond
Compressed OOPs are available now in 1.6.0u14:
https://jdk6.dev.java.net/6uNea.html
- Aaron
On Thu, Aug 20, 2009 at 10:51 AM, Raghu Angadi wrote:
>
> Suresh had made an spreadsheet for memory consumption.. will check.
>
> A large portion of NN memory is taken by references. I would expect memory
Look into "typed bytes":
http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/
On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake wrote:
> Hi Stefan,
>
>
>
> I am sorry, for the late reply. Somehow the response email has slipped my
> eyes.
>
> Could you explain a bit on how to use Hadoop s
I got it working! fantastic. One thing that hung me up for a while was how
picky the log4j.properties files are about syntax. For future reference to
others, I used this in log4j.properties:
# Define the root logger to the system property "hadoop.root.logger".
log4j.rootLogger=${hadoop.root.logger}
Hi,
I am looking at an easy way to passing the job arguments trough a config file.
The genericoptionsparser seems to parse only the hadoop options.
Normally i use jsap but that would not co-exist with genericoptionsparser
thanks
ishwar
Mithila ,
It depends on which version of Hadoop you want to work on .
If you want to work on Hadoop 0.20 then you should check out Hadoop 0.20 source
code .
If you want to work on trunk then check out Hadoop mapreduce source .
svn checkout http://svn.apache.org/repos/asf/hadoop/mapreduce/tru
On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote:
Hi all,
Can anyone tell me how the MR scheduler schedule the MR jobs?
How does it decide where t create MAP tasks and how many to create.
Once the MAP tasks are over how does it decide to move the keys to the
reducer efficiently(minimizi
Hello Ted,
I know that Hadoop tries to exploit data locality and it is pretty high.
However, the data locality cannot be exploited in case when
'mapred.min.split.size' is set to much higher than DFS blocksize - because
consecutive blocks are not stored on a single machine.
I have found out that the
If you go to
http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/AllocationConfigurationException.java?view=log
it
shows many revisions for the source
file AllocationConfigurationException.java, so I was wondering which can be
used to make
Suresh had made an spreadsheet for memory consumption.. will check.
A large portion of NN memory is taken by references. I would expect
memory savings to be very substantial (same as going from 64bit to
32bit), could be on the order of 40%.
The last I heard from Sun was that compressed point
Ananth T. Sarathy wrote:
it's on s3. and it always happens.
I have no experience with S3. You might want to check out S3 forums. It
can't be normal for S3 either.. there must be something missing
(configuration, ACLs... ).
Raghu.
Ananth T Sarathy
On Wed, Aug 19, 2009 at 4:35 PM, Raghu A
Uhh hadoop already goes to considerable lengths to make sure that
computation is local. In my experience it is common for 90% of the map
invocations to be working from local data. Hadoop doesn't know about record
boundaries so a little bit of slop into a non-local block is possible to
finish
Hi Stefan,
I am sorry, for the late reply. Somehow the response email has slipped my
eyes.
Could you explain a bit on how to use Hadoop streaming with binary data
formats.
I can see, explanations on using it with text data formats, but not for
binary files.
Thank you,
Jaliya
Stefan Podkow
On 8/20/09 9:48 AM, "Ananth T. Sarathy" wrote:
> ok.. i seems that's the case. that seems kind of selfdefeating though.
>
> Ananth T Sarathy
Then something is wrong with S3. It may be misconfigured, or just poor
performance. I have no experience with S3 but 20 seconds to connect
(authentic
Probably unrelated to your problem, but one extreme case I've seen,
a user's job with large gzip inputs (non-splittable),
20 mappers 800 reducers. Each map outputted like 20G.
Too many reducers were hitting a single node as soon as a mapper finished.
I think we tried something like
mapred.reduce.
ok.. i seems that's the case. that seems kind of selfdefeating though.
Ananth T Sarathy
On Thu, Aug 20, 2009 at 12:31 PM, Scott Carey wrote:
> If it always takes a very long time to start transferring data, get a few
> stack dumps (jstack or kill -e) during this period to see what it is doing
If it always takes a very long time to start transferring data, get a few
stack dumps (jstack or kill -e) during this period to see what it is doing
during this time.
Most likely, the client is doing nothing but waiting on the remote side.
On 8/20/09 8:02 AM, "Ananth T. Sarathy" wrote:
> it's
Hi
When I try to execute *hadoop-ec2 launch-cluster test-cluster 2*, it
executes, but keep waiting at "Waiting for instance to start", find below
the exact display as it shows on my screen
$ bin/hadoop-ec2 launch-cluster test-cluster 2
Testing for existing master in group: test-cluster
Creating g
On Wednesday, August 19, 2009 11:21
Jakob Homan wrote:
> George-
> You can certainly submit jobs asynchronously via the
> JobClient.submitJob() method
> (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html).
>
> This will return a handle (a Runn
Hi all,
Can anyone tell me how the MR scheduler schedule the MR jobs?
How does it decide where t create MAP tasks and how many to create.
Once the MAP tasks are over how does it decide to move the keys to the
reducer efficiently(minimizing the data movement across the network).
Is there any doc av
Hi all,
Can anyone tell me how the MR scheduler schedule the MR jobs?
How does it decide where t create MAP tasks and how many to create.
Once the MAP tasks are over how does it decide to move the keys to the
reducer efficiently(minimizing the data movement across the network).
Is there any doc av
On Thu, Aug 20, 2009 at 10:49 AM, mike anderson wrote:
> Yeah, that is interesting Edward. I don't need syslog-ng for any particular
> reason, other than that I'm familiar with it. If there were another way to
> get all my logs collated into one log file that would be great.
> mike
>
> On Thu, Aug
it's not really 1 mbps so much it takes 2 minutes to start doing the
reads.
Ananth T Sarathy
On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey wrote:
>
> On 8/19/09 10:58 AM, "Raghu Angadi" wrote:
>
> > Edward Capriolo wrote:
> >>> On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo
> >>> wrote:
it's on s3. and it always happens.
Ananth T Sarathy
On Wed, Aug 19, 2009 at 4:35 PM, Raghu Angadi wrote:
> Ananth T. Sarathy wrote:
>
>> Also, I just want to clear... the delay seems to at the intial
>>
>> (read = in.read(buf))
>>
>
> It the file on HDFS (over S3) or S3?
>
> Does it always hap
Yeah, that is interesting Edward. I don't need syslog-ng for any particular
reason, other than that I'm familiar with it. If there were another way to
get all my logs collated into one log file that would be great.
mike
On Thu, Aug 20, 2009 at 10:44 AM, Edward Capriolo wrote:
> On Wed, Aug 19, 20
On Wed, Aug 19, 2009 at 11:50 PM, Brian Bockelman wrote:
> Hey Mike,
>
> Yup. We find the stock log4j needs two things:
>
> 1) Set the rootLogger manually. The way 0.19.x has the root logger set up
> breaks when adding new appenders. I.e., do:
>
> log4j.rootLogger=INFO,SYSLOG,console,DRFA,EventC
Thanks Tom,
I will have a look at it.
Cheers,
Roman
On Thu, Aug 20, 2009 at 3:02 PM, Tom White wrote:
> Hi Roman,
>
> Have a look at CombineFileInputFormat - it might be related to what
> you are trying to do.
>
> Cheers,
> Tom
>
> On Thu, Aug 20, 2009 at 10:59 AM, roman kolcun
> wrote:
> > On
Hi Roman,
Have a look at CombineFileInputFormat - it might be related to what
you are trying to do.
Cheers,
Tom
On Thu, Aug 20, 2009 at 10:59 AM, roman kolcun wrote:
> On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi <
> harish.mallipe...@gmail.com> wrote:
>
>> On Thu, Aug 20, 2009 at 2:39 PM
AFAIK,
hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't
have much info on this )
java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task
launched. java.opts is the amount of memory reserved for a task.
When setting you need to account for memo
does anyone have any up to date data on the memory consumption per
block/file on the NN on a 64-bit JVM with compressed pointers?
The best documentation on consumption is
http://issues.apache.org/jira/browse/HADOOP-1687 -I'm just wondering if
anyone has looked at the memory footprint on the
On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi <
harish.mallipe...@gmail.com> wrote:
> On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun
> wrote:
>
> >
> > Hello Harish,
> >
> > I know that TaskTracker creates separate threads (up to
> > mapred.tasktracker.map.tasks.maximum) which execute the ma
Hi folks,
Sorry to cut across this discussion but I'm experiencing some similar
confusion about where to change some parameters.
In particular, I'm not entirely clear on how the following should be
used - clarification welcome (I'm happy to pull some of this together on
a blog once I get som
On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun wrote:
>
> Hello Harish,
>
> I know that TaskTracker creates separate threads (up to
> mapred.tasktracker.map.tasks.maximum) which execute the map() function.
> However, I haven't found the piece of code which associate FileSplit with
> the given map
On Thu, Aug 20, 2009 at 6:49 AM, Harish Mallipeddi <
harish.mallipe...@gmail.com> wrote:
> On Thu, Aug 20, 2009 at 7:25 AM, roman kolcun
> wrote:
>
> > Hello everyone,
> > could anyone please tell me in which class and which method does Hadoop
> > download the file chunk from HDFS and associate i
Thank you very much! I'm clear about it now.
2009/8/20 Aaron Kimball
> On Wed, Aug 19, 2009 at 8:39 PM, yang song
> wrote:
>
> >Thank you, Aaron. I've benefited a lot. "per-node" means some settings
> > associated with the node. e.g., "fs.default.name", "mapred.job.tracker",
> > etc. "per-j
What do you mean?
- Aaron
On Wed, Aug 19, 2009 at 8:35 PM, Mithila Nagendra wrote:
> Thanks! But How do I know which version to work with?
> Mithila
>
>
> On Thu, Aug 20, 2009 at 2:30 AM, Ravi Phulari >wrote:
>
> > Currently Fairscheduler source is in
> > hadoop-mapreduce/src/cont
There is a config "socket.read.timeout" or "socket.timeout" set to 6
(60s). 69000 is based on that.
Mayuran Yogarajah wrote:
Hello, we were importing several TB of data overnight and it seemed one
of the loads
failed. We're running Hadoop 0.18.3, and there are 6 nodes in the
cluster, al
On Wed, Aug 19, 2009 at 8:39 PM, yang song wrote:
>Thank you, Aaron. I've benefited a lot. "per-node" means some settings
> associated with the node. e.g., "fs.default.name", "mapred.job.tracker",
> etc. "per-job" means some settings associated with the jobs which are
> submited from the node
55 matches
Mail list logo