I suppose I should have been clearer. There's no problem out of box if people stick to the libraries we offer :)
Yes the LRW was marked synchronized at some point over 8 years ago [1] in support for multi-threaded maps, but the framework has changed much since then. The MultithreadedMapper/etc. API we offer now automatically shields the devs away from having to think of output thread safety [2]. I can imagine there can only be a problem if a user writes their own unsafe multi threaded task. I suppose we could document that in the Mapper/MapRunner and Reducer APIs. [1] - http://svn.apache.org/viewvc?view=revision&revision=171186 - Commit added a synchronized to the write call. [2] - MultiThreadedMapper/etc. synchronize over the collector - http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.java?view=markup On Thu, Aug 8, 2013 at 7:52 PM, Azuryy Yu <azury...@gmail.com> wrote: > sequence writer is also synchronized, I dont think this is bad. > > if you call HDFS api to write concurrently, then its necessary. > > On Aug 8, 2013 7:53 PM, "Jay Vyas" <jayunit...@gmail.com> wrote: >> >> Then is this a bug? Synchronization in absence of any race condition is >> normally considered "bad". >> >> In any case id like to know why this writer is synchronized whereas the >> other one are not.. That is, I think, then point at issue: either other >> writers should be synchronized or else this one shouldn't be - consistency >> across the write implementations is probably desirable so that changes to >> output formats or record writers don't lead to bugs in multithreaded >> environments . >> >> On Aug 8, 2013, at 6:50 AM, Harsh J <ha...@cloudera.com> wrote: >> >> While we don't fork by default, we do provide a MultithreadedMapper >> implementation that would require such synchronization. But if you are >> asking is it necessary, then perhaps the answer is no. >> >> On Aug 8, 2013 3:43 PM, "Azuryy Yu" <azury...@gmail.com> wrote: >>> >>> its not hadoop forked threads, we may create a line record writer, then >>> call this writer concurrently. >>> >>> On Aug 8, 2013 4:00 PM, "Sathwik B P" <sathwik...@gmail.com> wrote: >>>> >>>> Hi, >>>> Thanks for your reply. >>>> May I know where does hadoop fork multiple threads to use a single >>>> RecordWriter. >>>> >>>> regards, >>>> sathwik >>>> >>>> On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu <azury...@gmail.com> wrote: >>>>> >>>>> because we may use multi-threads to write a single file. >>>>> >>>>> On Aug 8, 2013 2:54 PM, "Sathwik B P" <sath...@apache.org> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> LineRecordWriter.write(..) is synchronized. I did not find any other >>>>>> RecordWriter implementations define the write as synchronized. >>>>>> Any specific reason for this. >>>>>> >>>>>> regards, >>>>>> sathwik >>>> >>>> > -- Harsh J