DistributedCache staleness
I have been having problems with changes to DistributedCache files on HDFS not being reflected on subsequently run jobs. I can change the filename to work around this, but I would prefer a way to invalidate the Cache when neccesary. Is there a way to lower the timeout or flush the Cache? Cheers, Anthony
Anyone have a Lucene index InputFormat for Hadoop?
Anyone have a Lucene index InputFormat already implemented? Failing that, how about a Writable for the Lucene Document class? Cheers, Anthony
Re: Writer class for writing tab separated Text key, value pairs
If you are doing this in mapreduce, just set the output format to TextOutputFormat and collect Texts in your reducer. Otherwise, just open a file on the HDFS and println(key + "\t" + value) for each tuple. On Sun, Sep 28, 2008 at 9:50 PM, Palleti, Pallavi <[EMAIL PROTECTED]> wrote: > Hi, > > Can anyone please tell me which class I should use for writing a data > in plain text output format. I can use KeyValueLineRecordReader for > reading a line. But there is no KeyValueLineRecordWriter. > > The only possibility that I found is to call getRecordWriter() of > TextOutputFormat. But, the problem here is, I need to implement > Progressable interface inorder to use this. Is there a simple way to do > this? > > > > Thanks > > Pallavi > >
Outputting multiple key classes from the same job
Short of implementing a new output format, is there a way to output multiple key classes from a job to different sequence files based on the class of the key and value? Cheers, Anthony
Re: Stopping two reducer tasks on two machines from working on the same keys?
That's got to be it On Mon, Aug 11, 2008 at 9:55 PM, lohit <[EMAIL PROTECTED]> wrote: >>redoing each other's work and stomping on each others output files. > I am assuming your tasks (reducers) are generating these files and these are > not the output file like part-0 > > Looks like you have speculative execution turned on. > hadoop tries to execute parallel attempts of map/reduce tasks if it finds out > one of them is falling behind. All those task attempts are appended with a > number as you can see _0 and _1. > If you have tasks which generate files to common files, then you hit this > problem. > There are two ways out of this > 1. turn off speculative execution by setting mapred.speculative.execution to > false > 2. if you are generating outputs, try to use taskID for unique attempt. > >>I've attached the JSP output that indicates this; let me know if you >>need any other details. > No attachement. I guess the listserv must have eaten it, as the one in my sent folder has it. It looks like this: Task Attempts Machine Status ProgressStart Time Shuffle FinishedSort Finished Finish Time task_200808062237_0031_r_00_0 snark-0002.liveoffice.com RUNNING 88.01% 11-Aug-2008 14:11:1311-Aug-2008 16:21:00 (2hrs, 9mins, 47sec) 11-Aug-2008 16:21:00 (0sec) task_200808062237_0031_r_00_1 snark-0005.liveoffice.com RUNNING 88.01%11-Aug-2008 16:21:0311-Aug-2008 16:21:04 (0sec) 11-Aug-2008 16:21:04 (0sec) Last 4KB Last 8KB All 0
Stopping two reducer tasks on two machines from working on the same keys?
I have a Hadoop 0.16.4 cluster that effectively has no HDFS. It's running a job analyzing data stored on a NAS type system mounted on each tasktracker. Unfortunately, the reducers task_200808062237_0031_r_00_0 and task_200808062237_0031_r_00_1 are running simultaneously on the same keys, redoing each other's work and stomping on each others output files. I've attached the JSP output that indicates this; let me know if you need any other details. Is this a configuration error, or is it a bug in Hadoop? Cheers, Anthony