DistributedCache staleness

2008-12-10 Thread Anthony Urso
I have been having problems with changes to DistributedCache files on
HDFS not being reflected on subsequently run jobs.  I can change the
filename to work around this, but I would prefer a way to invalidate
the Cache when neccesary.

Is there a way to lower the timeout or flush the Cache?

Cheers,
Anthony


Anyone have a Lucene index InputFormat for Hadoop?

2008-11-11 Thread Anthony Urso
Anyone have a Lucene index InputFormat already implemented?  Failing
that, how about a Writable for the Lucene Document class?

Cheers,
Anthony


Re: Writer class for writing tab separated Text key, value pairs

2008-09-29 Thread Anthony Urso
If you are doing this in mapreduce, just set the output format to
TextOutputFormat and collect Texts in your reducer.

Otherwise, just open a file on the HDFS and println(key + "\t" +
value) for each tuple.

On Sun, Sep 28, 2008 at 9:50 PM, Palleti, Pallavi
<[EMAIL PROTECTED]> wrote:
> Hi,
>
>  Can anyone please tell me which class I should use for writing a data
> in plain text output format. I can use KeyValueLineRecordReader for
> reading a line. But there is no KeyValueLineRecordWriter.
>
> The only possibility that I found is to call getRecordWriter() of
> TextOutputFormat. But, the problem here is, I need to implement
> Progressable interface inorder to use this. Is there a simple way to do
> this?
>
>
>
> Thanks
>
> Pallavi
>
>


Outputting multiple key classes from the same job

2008-09-29 Thread Anthony Urso
Short of implementing a new output format, is there a way to output
multiple key classes from a job to different sequence files based on
the class of the key and value?

Cheers,
Anthony


Re: Stopping two reducer tasks on two machines from working on the same keys?

2008-08-11 Thread Anthony Urso
That's got to be it

On Mon, Aug 11, 2008 at 9:55 PM, lohit <[EMAIL PROTECTED]> wrote:
>>redoing each other's work and stomping on each others output files.
> I am assuming your tasks (reducers) are generating these files and these are 
> not the output file like part-0
>
> Looks like you have speculative execution turned on.
> hadoop tries to execute parallel attempts of map/reduce tasks if it finds out 
> one of them is falling behind. All those task attempts are appended with a 
> number as you can see _0 and _1.
> If you have tasks which generate files to common files, then you hit this 
> problem.
> There are two ways out of this
> 1. turn off speculative execution by setting mapred.speculative.execution to 
> false
> 2. if you are generating outputs, try to use taskID for unique attempt.
>
>>I've attached the JSP output that indicates this; let me know if you
>>need any other details.
> No attachement.

I guess the listserv must have eaten it, as the one in my sent folder
has it.  It looks like this:

Task Attempts   Machine Status  ProgressStart Time  Shuffle
FinishedSort Finished   Finish Time
task_200808062237_0031_r_00_0   snark-0002.liveoffice.com   RUNNING 
88.01%  11-Aug-2008
14:11:1311-Aug-2008 16:21:00 (2hrs, 9mins, 47sec)   11-Aug-2008
16:21:00 (0sec)
task_200808062237_0031_r_00_1   snark-0005.liveoffice.com   RUNNING 
88.01%11-Aug-2008
16:21:0311-Aug-2008 16:21:04 (0sec) 11-Aug-2008 16:21:04 (0sec)



Last 4KB
Last 8KB
All
0


Stopping two reducer tasks on two machines from working on the same keys?

2008-08-11 Thread Anthony Urso
I have a Hadoop 0.16.4 cluster that effectively has no HDFS.  It's
running a job analyzing data stored on a NAS type system mounted on
each tasktracker.

Unfortunately, the reducers task_200808062237_0031_r_00_0 and
task_200808062237_0031_r_00_1 are running simultaneously on the
same keys, redoing each other's work and stomping on each others
output files.

I've attached the JSP output that indicates this; let me know if you
need any other details.

Is this a configuration error, or is it a bug in Hadoop?

Cheers,
Anthony