RE: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Aaron Baff
The number of bytes read can exceed the block size somewhat because each block rarely starts/ends on a record (e.g. line) boundary. So usually it reads to read a bit before and/or after the actual block boundary in to correctly read in all of the records it is supposed to. If you look, it's not

RE: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Aaron Baff
Well, if you think about it, you'll have more/better locality if more nodes with the same blocks. It gives the scheduler more leeway to find a node that has a block that hasn't been processed yet. Have you tried it with replication of 2 or 3 and seen what that does? --Aaron ---

RE: Reduce method called same key twice

2011-06-29 Thread Aaron Baff
record, so only 1 that is different per set) and it is my understanding that they would be grouped together (without the primary key) if I didn't do anything different. -Trevor On Wed, Jun 29, 2011 at 2:07 PM, Aaron Baff wrote: You probably need to implement a custom comparator that you use a

RE: Reduce method called same key twice

2011-06-29 Thread Aaron Baff
You probably need to implement a custom comparator that you use as the grouping comparator that compares the primary key, and then if they are the same compares the int part of the key. --Aaron - From: Trevor Adams [mai

RE: bin/hadoop job -history doesn't show all job information

2011-06-08 Thread Aaron Baff
I believe that this data is removed by the JobTracker approximately an hour after the Job completes. That's the default timeout, it can be changed, but the parameter name escapes me at the moment. --Aaron -Original Message- From: Pedro Costa [mailto:psdc1...@gmail.com] Sent: Wednesday,

RE: Printing the job status on the client side

2011-05-23 Thread Aaron Baff
You need to use the RunningJob (old API) or Job (new API) object, and use those to get the Mapper & Reducer statuses. They return it as a double, 0.0 to 1.0. --Aaron From: praveen.pe...@nokia.com [mailto:praveen.pe...@nokia.com] Sent: Monday, May 23, 2011

RE: Running M/R jobs from java code

2011-05-19 Thread Aaron Baff
mapred.jar",[FILE]), but I couldn't find a file format that works? Lior On Wed, May 18, 2011 at 8:18 PM, Aaron Baff wrote: It's not terribly hard to submit MR Job's. Create a hadoop Configuration object, and set it's fs.default.name and fs.defaultFS to the Namenode URI, an

RE: Running M/R jobs from java code

2011-05-18 Thread Aaron Baff
idn't know one could do this thanks. I'll give it a try. On 18 May 2011 10:18, Aaron Baff wrote: It's not terribly hard to submit MR Job's. Create a hadoop Configuration object, and set it's fs.default.name and fs.defaultFS to the Namenode URI, and mapreduce.jobtracker.

RE: Running M/R jobs from java code

2011-05-18 Thread Aaron Baff
It's not terribly hard to submit MR Job's. Create a hadoop Configuration object, and set it's fs.default.name and fs.defaultFS to the Namenode URI, and mapreduce.jobtracker.address and mapred.job.tracker to the JobTracker URI. You can then easily setup and use a Job object (new API), or JobConf

RE: Getting (or setting) a job ID

2011-05-10 Thread Aaron Baff
As part of the job submission, once it's submitted, grab the JobID from that object and print it out on STDOUT or to a file and have your startup script(s) parse it out from there. --Aaron -Original Message- From: Adam Phelps [mailto:a...@opendns.com] Sent: Tuesday, May 10, 2011 3:45 PM

NPE during RunningJob.getCounters()

2011-05-03 Thread Aaron Baff
Cross post from common-users. I'm using v0.21.0, with the Old API, and I have a daemon that runs and monitors MR Jobs, allows us to fetch data from the JobTracker about the MR Job's, etc. We're using Thrift as the API (so we can do PHP->Java). We're having an issue where some requests for MR Jo