Re: MR job scheduler

2009-08-20 Thread bharath vissapragada
Yes , My doubt is that how is the location of the reducer selected . Is it selected arbitrarily or is selected on a particular machine which has already the more values (corresponding to the key of that reducer) which reduces the cost of transferring data across the network(because already many val

RE: MR job scheduler

2009-08-20 Thread Amogh Vasekar
Yes, but the copy phase starts with the initialization for a reducer, after which it would keep polling for completed map tasks to fetch the respective outputs. -Original Message- From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com] Sent: Friday, August 21, 2009 12:00 P

Re: MR job scheduler

2009-08-20 Thread bharath vissapragada
Arun iam not talkin about the map phase . Iam talking abt the reduce phase which starts after the map gets finished The Key "K" iam referring to in my example is one of the distinct keys wch map outputs. and its corresponding values may be on any system depending on where the map phase gets exec

Re: MR job scheduler

2009-08-20 Thread bharath vissapragada
Amogh i think Reduce phase starts only when all the map phases are completed . Because it needs all the values corresponding to a particular key! 2009/8/21 Amogh Vasekar > I'm not sure that is the case with Hadoop. I think its assigning reduce > task to an available tasktracker at any instant;

Re: MR job scheduler

2009-08-20 Thread Arun C Murthy
On Aug 20, 2009, at 9:20 PM, bharath vissapragada wrote: OK i'll be a bit more specific , Suppose map outputs 100 different keys . Consider a key "K" whose correspoding values may be on N diff datanodes. Consider a datanode "D" which have maximum number of values . So instead of moving t

RE: MR job scheduler

2009-08-20 Thread Amogh Vasekar
I'm not sure that is the case with Hadoop. I think its assigning reduce task to an available tasktracker at any instant; Since a reducer polls JT for completed maps. And if it were the case as you said, a reducer wont be initialized until all maps have completed , after which copy phase would st

RE: passing job arguments as an xml file

2009-08-20 Thread Amogh Vasekar
Hi, GenericOptionsParser is customized only for Hadoop specific params : * GenericOptionsParser recognizes several standarad command * line arguments, enabling applications to easily specify a namenode, a * jobtracker, additional configuration resources etc. Ideally, all params must be passe

Exception when starting namenode

2009-08-20 Thread Zheng Lv
Hello, I got these exceptions when I started the cluster, any suggestions? I used hadoop 0.15.2. 2009-08-21 12:12:53,463 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.dfs.FSIm

RE: Using Hadoop with executables and binary data

2009-08-20 Thread Jaliya Ekanayake
Thanks for the quick reply. I looked at it, but still could not figure out how to use HDFS to store input data (binary) and call an executable. Please note that I cannot modify the executable. May be I am asking some dumb question, but could you please explain a bit of how to handle the scenario I

Re: MR job scheduler

2009-08-20 Thread bharath vissapragada
OK i'll be a bit more specific , Suppose map outputs 100 different keys . Consider a key "K" whose correspoding values may be on N diff datanodes. Consider a datanode "D" which have maximum number of values . So instead of moving the values on "D" to other systems , it is useful to bring in the v

RE: MR job scheduler

2009-08-20 Thread zjffdu
Add some detials: 1. #map is determined by the block size and InputFormat (whether you can want to split or not split) 2. The default scheduler for Hadoop is FIFO, and the Fair Scheduler and Capacity Scheduler are other two options as I know. JobTracker has the scheduler. 3. Once the map task i

RE: Cluster Disk Usage

2009-08-20 Thread zjffdu
Arvind, You can use this API to get the size of file system used FileSystem.getUsed(); But, I do not find the API for calculate the remaining space. You can write some code to create a API, The remaining disk space = Total of disk space - operate system space - FileSystem.getUsed() -

RE: Cluster Disk Usage

2009-08-20 Thread zjffdu
You can use the jobtracker Web UI to use the disk usage. -Original Message- From: Arvind Sharma [mailto:arvind...@yahoo.com] Sent: 2009年8月20日 15:57 To: common-user@hadoop.apache.org Subject: Cluster Disk Usage Is there a way to find out how much disk space - overall or per Datanode bas

Writing to a db with DBOutputFormat spits out IOException Error

2009-08-20 Thread ishwar ramani
Hi, I am trying to run a simple map reduce that writes the result from the reducer to a mysql db. I Keep getting 09/08/20 15:44:59 INFO mapred.JobClient: Task Id : attempt_200908201210_0013_r_00_0, Status : FAILED java.io.IOException: com.mysql.jdbc.Driver at org.apache.hadoop.mapre

Re: Cluster Disk Usage

2009-08-20 Thread Arvind Sharma
Sorry, I also sent a direct e-mail to one response there I asked one question - what is the cost of these APIs ??? Are they too expensive calls ? Is the API only going to the NN which stores this data ? Thanks! Arvind From: Arvind Sharma To: common-us

Re: Cluster Disk Usage

2009-08-20 Thread Arvind Sharma
Using hadoop-0.19.2 From: Arvind Sharma To: common-user@hadoop.apache.org Sent: Thursday, August 20, 2009 3:56:53 PM Subject: Cluster Disk Usage Is there a way to find out how much disk space - overall or per Datanode basis - is available before creating a fi

Cluster Disk Usage

2009-08-20 Thread Arvind Sharma
Is there a way to find out how much disk space - overall or per Datanode basis - is available before creating a file ? I am trying to address an issue where the disk got full (config error) and the client was not able to create a file on the HDFS. I want to be able to check if there space left

Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Scott Carey
On 8/20/09 3:40 AM, "Steve Loughran" wrote: > > > does anyone have any up to date data on the memory consumption per > block/file on the NN on a 64-bit JVM with compressed pointers? > > The best documentation on consumption is > http://issues.apache.org/jira/browse/HADOOP-1687 -I'm just wond

Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Aaron Kimball
Compressed OOPs are available now in 1.6.0u14: https://jdk6.dev.java.net/6uNea.html - Aaron On Thu, Aug 20, 2009 at 10:51 AM, Raghu Angadi wrote: > > Suresh had made an spreadsheet for memory consumption.. will check. > > A large portion of NN memory is taken by references. I would expect memory

Re: Using Hadoop with executables and binary data

2009-08-20 Thread Aaron Kimball
Look into "typed bytes": http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/ On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake wrote: > Hi Stefan, > > > > I am sorry, for the late reply. Somehow the response email has slipped my > eyes. > > Could you explain a bit on how to use Hadoop s

Re: syslog-ng and hadoop

2009-08-20 Thread mike anderson
I got it working! fantastic. One thing that hung me up for a while was how picky the log4j.properties files are about syntax. For future reference to others, I used this in log4j.properties: # Define the root logger to the system property "hadoop.root.logger". log4j.rootLogger=${hadoop.root.logger}

passing job arguments as an xml file

2009-08-20 Thread ishwar ramani
Hi, I am looking at an easy way to passing the job arguments trough a config file. The genericoptionsparser seems to parse only the hadoop options. Normally i use jsap but that would not co-exist with genericoptionsparser thanks ishwar

Re: Location of the source code for the fair scheduler

2009-08-20 Thread Ravi Phulari
Mithila , It depends on which version of Hadoop you want to work on . If you want to work on Hadoop 0.20 then you should check out Hadoop 0.20 source code . If you want to work on trunk then check out Hadoop mapreduce source . svn checkout http://svn.apache.org/repos/asf/hadoop/mapreduce/tru

Re: MR job scheduler

2009-08-20 Thread Arun C Murthy
On Aug 20, 2009, at 9:00 AM, bharath vissapragada wrote: Hi all, Can anyone tell me how the MR scheduler schedule the MR jobs? How does it decide where t create MAP tasks and how many to create. Once the MAP tasks are over how does it decide to move the keys to the reducer efficiently(minimizi

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
Hello Ted, I know that Hadoop tries to exploit data locality and it is pretty high. However, the data locality cannot be exploited in case when 'mapred.min.split.size' is set to much higher than DFS blocksize - because consecutive blocks are not stored on a single machine. I have found out that the

Re: Location of the source code for the fair scheduler

2009-08-20 Thread Mithila Nagendra
If you go to http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/AllocationConfigurationException.java?view=log it shows many revisions for the source file AllocationConfigurationException.java, so I was wondering which can be used to make

Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Raghu Angadi
Suresh had made an spreadsheet for memory consumption.. will check. A large portion of NN memory is taken by references. I would expect memory savings to be very substantial (same as going from 64bit to 32bit), could be on the order of 40%. The last I heard from Sun was that compressed point

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Raghu Angadi
Ananth T. Sarathy wrote: it's on s3. and it always happens. I have no experience with S3. You might want to check out S3 forums. It can't be normal for S3 either.. there must be something missing (configuration, ACLs... ). Raghu. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:35 PM, Raghu A

Re: File Chunk to Map Thread Association

2009-08-20 Thread Ted Dunning
Uhh hadoop already goes to considerable lengths to make sure that computation is local. In my experience it is common for 90% of the map invocations to be working from local data. Hadoop doesn't know about record boundaries so a little bit of slop into a non-local block is possible to finish

Re: Using Hadoop with executables and binary data

2009-08-20 Thread Jaliya Ekanayake
Hi Stefan, I am sorry, for the late reply. Somehow the response email has slipped my eyes. Could you explain a bit on how to use Hadoop streaming with binary data formats. I can see, explanations on using it with text data formats, but not for binary files. Thank you, Jaliya Stefan Podkow

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey
On 8/20/09 9:48 AM, "Ananth T. Sarathy" wrote: > ok.. i seems that's the case. that seems kind of selfdefeating though. > > Ananth T Sarathy Then something is wrong with S3. It may be misconfigured, or just poor performance. I have no experience with S3 but 20 seconds to connect (authentic

Re: How to deal with "too many fetch failures"?

2009-08-20 Thread Koji Noguchi
Probably unrelated to your problem, but one extreme case I've seen, a user's job with large gzip inputs (non-splittable), 20 mappers 800 reducers. Each map outputted like 20G. Too many reducers were hitting a single node as soon as a mapper finished. I think we tried something like mapred.reduce.

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Ananth T. Sarathy
ok.. i seems that's the case. that seems kind of selfdefeating though. Ananth T Sarathy On Thu, Aug 20, 2009 at 12:31 PM, Scott Carey wrote: > If it always takes a very long time to start transferring data, get a few > stack dumps (jstack or kill -e) during this period to see what it is doing

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Scott Carey
If it always takes a very long time to start transferring data, get a few stack dumps (jstack or kill -e) during this period to see what it is doing during this time. Most likely, the client is doing nothing but waiting on the remote side. On 8/20/09 8:02 AM, "Ananth T. Sarathy" wrote: > it's

Invalid argument for option USER_DATA_FILE

2009-08-20 Thread Harshit Kumar
Hi When I try to execute *hadoop-ec2 launch-cluster test-cluster 2*, it executes, but keep waiting at "Waiting for instance to start", find below the exact display as it shows on my screen $ bin/hadoop-ec2 launch-cluster test-cluster 2 Testing for existing master in group: test-cluster Creating g

Re: submitting multiple small jobs simultaneously

2009-08-20 Thread George Jahad
On Wednesday, August 19, 2009 11:21 Jakob Homan wrote: > George- > You can certainly submit jobs asynchronously via the > JobClient.submitJob() method > (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html). > > This will return a handle (a Runn

MR job scheduler

2009-08-20 Thread bharath vissapragada
Hi all, Can anyone tell me how the MR scheduler schedule the MR jobs? How does it decide where t create MAP tasks and how many to create. Once the MAP tasks are over how does it decide to move the keys to the reducer efficiently(minimizing the data movement across the network). Is there any doc av

MR job scheduler

2009-08-20 Thread bharath vissapragada
Hi all, Can anyone tell me how the MR scheduler schedule the MR jobs? How does it decide where t create MAP tasks and how many to create. Once the MAP tasks are over how does it decide to move the keys to the reducer efficiently(minimizing the data movement across the network). Is there any doc av

Re: syslog-ng and hadoop

2009-08-20 Thread Edward Capriolo
On Thu, Aug 20, 2009 at 10:49 AM, mike anderson wrote: > Yeah, that is interesting Edward. I don't need syslog-ng for any particular > reason, other than that I'm familiar with it. If there were another way to > get all my logs collated into one log file that would be great. > mike > > On Thu, Aug

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Ananth T. Sarathy
it's not really 1 mbps so much it takes 2 minutes to start doing the reads. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:30 PM, Scott Carey wrote: > > On 8/19/09 10:58 AM, "Raghu Angadi" wrote: > > > Edward Capriolo wrote: > >>> On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo > >>> wrote:

Re: Faster alternative to FSDataInputStream

2009-08-20 Thread Ananth T. Sarathy
it's on s3. and it always happens. Ananth T Sarathy On Wed, Aug 19, 2009 at 4:35 PM, Raghu Angadi wrote: > Ananth T. Sarathy wrote: > >> Also, I just want to clear... the delay seems to at the intial >> >> (read = in.read(buf)) >> > > It the file on HDFS (over S3) or S3? > > Does it always hap

Re: syslog-ng and hadoop

2009-08-20 Thread mike anderson
Yeah, that is interesting Edward. I don't need syslog-ng for any particular reason, other than that I'm familiar with it. If there were another way to get all my logs collated into one log file that would be great. mike On Thu, Aug 20, 2009 at 10:44 AM, Edward Capriolo wrote: > On Wed, Aug 19, 20

Re: syslog-ng and hadoop

2009-08-20 Thread Edward Capriolo
On Wed, Aug 19, 2009 at 11:50 PM, Brian Bockelman wrote: > Hey Mike, > > Yup.  We find the stock log4j needs two things: > > 1) Set the rootLogger manually.  The way 0.19.x has the root logger set up > breaks when adding new appenders.  I.e., do: > > log4j.rootLogger=INFO,SYSLOG,console,DRFA,EventC

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
Thanks Tom, I will have a look at it. Cheers, Roman On Thu, Aug 20, 2009 at 3:02 PM, Tom White wrote: > Hi Roman, > > Have a look at CombineFileInputFormat - it might be related to what > you are trying to do. > > Cheers, > Tom > > On Thu, Aug 20, 2009 at 10:59 AM, roman kolcun > wrote: > > On

Re: File Chunk to Map Thread Association

2009-08-20 Thread Tom White
Hi Roman, Have a look at CombineFileInputFormat - it might be related to what you are trying to do. Cheers, Tom On Thu, Aug 20, 2009 at 10:59 AM, roman kolcun wrote: > On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi < > harish.mallipe...@gmail.com> wrote: > >> On Thu, Aug 20, 2009 at 2:39 PM

RE: Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

2009-08-20 Thread Amogh Vasekar
AFAIK, hadoop.tmp.dir : Used by NN and DN for directory listings and metadata ( don't have much info on this ) java.opts & ulimit : ulimit defines the maximum limit of virtual mem for task launched. java.opts is the amount of memory reserved for a task. When setting you need to account for memo

NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Steve Loughran
does anyone have any up to date data on the memory consumption per block/file on the NN on a 64-bit JVM with compressed pointers? The best documentation on consumption is http://issues.apache.org/jira/browse/HADOOP-1687 -I'm just wondering if anyone has looked at the memory footprint on the

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun > wrote: > > > > > Hello Harish, > > > > I know that TaskTracker creates separate threads (up to > > mapred.tasktracker.map.tasks.maximum) which execute the ma

Running hadoop jobs from a client and tuning (was Re: How does hadoop deal with hadoop-site.xml?)

2009-08-20 Thread stephen mulcahy
Hi folks, Sorry to cut across this discussion but I'm experiencing some similar confusion about where to change some parameters. In particular, I'm not entirely clear on how the following should be used - clarification welcome (I'm happy to pull some of this together on a blog once I get som

Re: File Chunk to Map Thread Association

2009-08-20 Thread Harish Mallipeddi
On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun wrote: > > Hello Harish, > > I know that TaskTracker creates separate threads (up to > mapred.tasktracker.map.tasks.maximum) which execute the map() function. > However, I haven't found the piece of code which associate FileSplit with > the given map

Re: File Chunk to Map Thread Association

2009-08-20 Thread roman kolcun
On Thu, Aug 20, 2009 at 6:49 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > On Thu, Aug 20, 2009 at 7:25 AM, roman kolcun > wrote: > > > Hello everyone, > > could anyone please tell me in which class and which method does Hadoop > > download the file chunk from HDFS and associate i

Re: How does hadoop deal with hadoop-site.xml?

2009-08-20 Thread yang song
Thank you very much! I'm clear about it now. 2009/8/20 Aaron Kimball > On Wed, Aug 19, 2009 at 8:39 PM, yang song > wrote: > > >Thank you, Aaron. I've benefited a lot. "per-node" means some settings > > associated with the node. e.g., "fs.default.name", "mapred.job.tracker", > > etc. "per-j

Re: Location of the source code for the fair scheduler

2009-08-20 Thread Aaron Kimball
What do you mean? - Aaron On Wed, Aug 19, 2009 at 8:35 PM, Mithila Nagendra wrote: > Thanks! But How do I know which version to work with? > Mithila > > > On Thu, Aug 20, 2009 at 2:30 AM, Ravi Phulari >wrote: > > > Currently Fairscheduler source is in > > hadoop-mapreduce/src/cont

Re: Loading data failed with timeout

2009-08-20 Thread Raghu Angadi
There is a config "socket.read.timeout" or "socket.timeout" set to 6 (60s). 69000 is based on that. Mayuran Yogarajah wrote: Hello, we were importing several TB of data overnight and it seemed one of the loads failed. We're running Hadoop 0.18.3, and there are 6 nodes in the cluster, al

Re: How does hadoop deal with hadoop-site.xml?

2009-08-20 Thread Aaron Kimball
On Wed, Aug 19, 2009 at 8:39 PM, yang song wrote: >Thank you, Aaron. I've benefited a lot. "per-node" means some settings > associated with the node. e.g., "fs.default.name", "mapred.job.tracker", > etc. "per-job" means some settings associated with the jobs which are > submited from the node