Oh dear, I feel such a fool. However, in the spirit of knowledge-sharing I 
thought I’d pass back my results (I hate it
when I find a thread where somebody has exactly the same problem I’m having and 
they then just close it by saying
they’ve fixed it, without saying *how*).

It seems that my problems were down to threading issues in my mappers, pretty 
much as I’d surmised. I’d confused
myself by thinking that it was down to some devious subtlety of Hadoop, when in 
fact it was just good old-fashioned
threading and non-thread-safe classes. Fixed by creating a new instance for 
each mapper. I have an outstanding
problem wrt reducers, but I think I can sort that myself, based on all of the 
hadoop research I’ve done in the past
few weeks :-) Clouds and silver linings, eh?

Thanks to everyone who helped me on this though - hopefully one day I’ll be 
able to return the favour.

Cheers,

        Andy D

On 15 Nov 2011, at 18:15, Mathijs Homminga wrote:

> (see below)
> 
> Mathijs Homminga
> 
> On Nov 15, 2011, at 18:51, Andy Doddington <a...@doddington.net> wrote:
> 
>> Unfortunately, changing the data to LongWritable and Text would change the 
>> nature of the problem to such an extent that
>> any results would be meaningless :-(
> 
> Yes, I understand. Is it possible for you to post some code? The job setup 
> lines for example, and some configuration?
> 
>> I did try changing everything to use SequenceFileAsBinary, but apart from 
>> causing more
>> aggravation in having to convert back and forth between BinaryWritable it 
>> made no difference to my problem - i.e. it still failed
>> in exactly the same way as before.
>> Incidentally, I’m assuming your second paragraph should read “...you should 
>> *not* worry about splits…”.
> 
> Yes, sorry about that.
> 
>> I’m coming to the conclusion that the problem must be due to multi-threading 
>> issues in some of the third-party libraries that
>> I am using, so my next plan of attack is to look at what threading options I 
>> can configure in hadoop. As always, any pointers
>> in this direction would be really appreciated :-)
> 
> Any insights or luck when you try a different number of mappers or input 
> files? With just one? 
> Just playing trial-and-error here, let me know if you've done this all 
> before..
> 
>> Cheers,
>> 
>>   Andy D
>> 
>> ———————————————
>> 
>> On 15 Nov 2011, at 14:07, Mathijs Homminga wrote:
>> 
>>> Can you reproduce this behavior with more simple SequenceFiles, which 
>>> contain for example <LongWritable, Text> pairs?
>>> (I know, you have to adjust your mapper and reducer).
>>> 
>>> In general: when you InputFormat (and RecordReader) are properly 
>>> configured/written, you should worry about splits in the middle of a binary 
>>> object.
>>> 
>>> Mathijs
>>> 
>>> 
>>> On Nov 15, 2011, at 14:35 , Andy Doddington wrote:
>>> 
>>>> Sigh… still no success and I’m tearing my hair out :-(
>>>> 
>>>> One thought I’ve had is whether I’d be advised to use the 
>>>> SequenceFileAsBinary input and output classes? I’m not entirely
>>>> clear on how these differ from the ‘normal’ SequenceFile classes, since 
>>>> these already claim to be able to support binary data,
>>>> but at the moment I’m clutching at straws.
>>>> 
>>>> I did try changing to these but then got loads of exceptions claiming that 
>>>> I was trying to cast LongWritable (my key class)
>>>> and my value class to BytesWritable, which is what the 
>>>> SequenceFileAsBinary classes use, I believe.
>>>> 
>>>> If possible, could somebody indicate whether this change is worthwhile 
>>>> and, if so, how I can migrate my code to use BytesWritable
>>>> instead of LongWritable etc?
>>>> 
>>>> Thanks in anticipation,
>>>> 
>>>>   Andy Doddington
>>>> 
>>>> On 10 Nov 2011, at 16:33, Andy Doddington wrote:
>>>> 
>>>>> Thanks for your kind words - it still feels like pulling teeth at times 
>>>>> :-(
>>>>> 
>>>>> Following on from your comments, here are a few more questions - hope you 
>>>>> don’t find them too dumb…
>>>>> 
>>>>> 1) How does each mapper ‘know’ which file name to associate itself with?
>>>>> 2) Is it important that I name my files part<n> or will any unique name 
>>>>> suffice?
>>>>> 3) I’m using binary serialisation with Sequence files - are these ‘split’ 
>>>>> across multiple mappers? What happens if the split occurs in the middle 
>>>>> of a binary object?
>>>>> 
>>>>> Current state of play is that the mappers are being called the correct 
>>>>> number of times and are generating the correct result for the first half 
>>>>> of the number of mappers (e.g. ~502 out of 100 mappers, running small 
>>>>> test), but are then generating bad results after that. The reducer is 
>>>>> then correctly selecting the minimum - it just happens to be a bad value 
>>>>> due to the mapper problem. Ho hum…
>>>>> 
>>>>> Regards,
>>>>> 
>>>>>   Andy D
>>>>> 
>>>>> ——————————————
>>>>> 
>>>>> On 10 Nov 2011, at 15:17, Harsh J wrote:
>>>>> 
>>>>>> Hey Andy,
>>>>>> 
>>>>>> You seem to be making good progress already! Some comments inline.
>>>>>> 
>>>>>> On 10-Nov-2011, at 7:28 PM, Andy Doddington wrote:
>>>>>> 
>>>>>>> Unfortunately my employer blocks any attempt to transfer data outside 
>>>>>>> of the company - I realise this makes me look pretty
>>>>>>> foolish/uncooperative, but I hope you understand there’s little I can 
>>>>>>> do about it :-(
>>>>>>> 
>>>>>>> On a more positive note, I've found a few issues which have moved me 
>>>>>>> forward a bit:
>>>>>>> 
>>>>>>> I first noticed that the PiEstimator used files named part<n> to 
>>>>>>> transfer data to each of the Mappers - I had changed this name to be 
>>>>>>> something more meaningful to my app. I am aware that Hadoop uses some 
>>>>>>> files that are similarly named, and hoped that this might be the cause. 
>>>>>>> Sadly, this fix made no difference.
>>>>>>> While looking at this area of the code, I realised that although I was 
>>>>>>> writing data to these files, I was failing to close them! This fix did 
>>>>>>> make a difference, in that the mappers now actually appear to be 
>>>>>>> getting called. However, the final result from the reduce was still 
>>>>>>> incorrect. What seemed to be happening (based on the mapper logs) was 
>>>>>>> that the reducers was getting called once for each mapper - which is 
>>>>>>> not exactly optimal in my case.
>>>>>>> I therefore removed the jobConf call which I had made to set my reducer 
>>>>>>> to also be the combiner - and suddenly the results started looking a 
>>>>>>> lot healthier - although they are still not 100% correct. I had naively 
>>>>>>> assumed that the minimum of a set of minimums of a series of subsets of 
>>>>>>> the data would be the same as the minimum of the entire set, but I’ve 
>>>>>>> clearly misunderstood how combiners work. Will investigate the doc’n on 
>>>>>>> this a bit more. Maybe some subtle interaction wrt combiners and 
>>>>>>> partitioners?
>>>>>> 
>>>>>> Combiners would work on sorted map outputs. That is, after they are 
>>>>>> already partitioned out.
>>>>>> 
>>>>>>> I’m still confused as to how the mappers get passed the data that I put 
>>>>>>> into the part<n> files, but I *think* I’m now heading in the right 
>>>>>>> direction. If you can see the cause of my problems (despite lack of log 
>>>>>>> output)  then I’d be more than happy to hear from you :-)
>>>>>> 
>>>>>> Naively describing, one file would go to one mapper. Each mapper 
>>>>>> invocation (map IDs -- 0,1,2,…) would have a file name to associate 
>>>>>> itself with, and it would begin reading that file off the DFS using a 
>>>>>> record-reader and begin calling map() on each record read (lines, in the 
>>>>>> most common case).
>>>>>> 
>>>>>> To add to the complexity now, is that HDFS stores files as blocks, and 
>>>>>> hence you may have multiple mappers for a single file, working on 
>>>>>> different offsets (0-mid, mid-len for a simple 2-block split, say). This 
>>>>>> is configurable though -- you can choose not to have data input splits 
>>>>>> at the expense of losing some data locality.
>>>>>> 
>>>>>> Is this the explanation you were looking for?
>>>>>> 
>>>>>> And yes, looks like you're headed in the right direction already :)
>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>>   Andy D
>>>>>>> On 10 Nov 2011, at 11:52, Harsh J wrote:
>>>>>>> 
>>>>>>>> Hey Andy,
>>>>>>>> 
>>>>>>>> Can you pastebin the whole runlog of your job after you invoke it via 
>>>>>>>> 'hadoop jar'/etc.?
>>>>>>>> 
>>>>>>>> On 10-Nov-2011, at 4:25 PM, Andy Doddington wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I have written a fairly straightforward Hadoop program, modelled 
>>>>>>>>> after the PiEstimator example which is shipped with the distro.
>>>>>>>>> 
>>>>>>>>> 1) I write a series of files to HDFS, each containing the input for a 
>>>>>>>>> single map task. This amounts to around 20Mb per task.
>>>>>>>>> 2) Each of my map tasks reads the input and generates a pair of 
>>>>>>>>> floating point values.
>>>>>>>>> 3) My reduce task scans the list of floating point values produced by 
>>>>>>>>> the maps and returns the minimum.
>>>>>>>>> 
>>>>>>>>> Unfortunately, this is not working, but is exhibiting the following 
>>>>>>>>> symptoms:
>>>>>>>>> 
>>>>>>>>> Based on log output, I have no evidence that the mappers are actually 
>>>>>>>>> being called, although the 'percentage complete’ output seems to go 
>>>>>>>>> down slowly as might be expected if they were being called.
>>>>>>>>> I only ever get a single part-00000 file created, regardless of how 
>>>>>>>>> many maps I specify.
>>>>>>>>> In the case of my reducer, although its constructor, ‘setConf' and 
>>>>>>>>> ‘close' methods are called (based on log output), its reduce method 
>>>>>>>>> never gets called.
>>>>>>>>> 
>>>>>>>>> I have checked the visibility of all classes and confirmed that the 
>>>>>>>>> methods signatures are correct (as confirmed by Eclipse and use of 
>>>>>>>>> the @Override annotation), and I’m at my wits end. To further add to 
>>>>>>>>> my suffering, the log outputs do not show any errors :-(
>>>>>>>>> 
>>>>>>>>> I am using the Cloudera CDH3u1 distribution.
>>>>>>>>> 
>>>>>>>>> As a final query, could somebody explain how it is that the multiple 
>>>>>>>>> files I create get associated with the various map tasks? This part 
>>>>>>>>> is a mystery to me (and might even be the underlying source of my 
>>>>>>>>> problems).
>>>>>>>>> 
>>>>>>>>> Thanks in anticipation,
>>>>>>>>> 
>>>>>>>>>   Andy Doddington
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to