Re: Synchronization among Mappers in map-reduce task

Wangda Tan Tue, 12 Aug 2014 00:09:27 -0700

Hi Saurabh,
It's an interesting topic,

>> So , here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ? I read that all the mappers task don't interact with
each other


A simple way to do this is using HDFS namespace:
Create file using "public FSDataOutputStream create(Path f, boolean
overwrite)", overwrite=false. Only one mapper can successfully create file.

After write completed, the mapper will create a flag file like "completed"
in the same folder. Other mappers can wait for the "completed" file created.

>> Is there any way to have synchronization between two independent map
reduce jobs?
I think ZK can do some complex synchronization here. Like mutex, master
election, etc.

Hope this helps,

Wangda Tan




On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sauravma...@gmail.com>
wrote:

> Hi Folks ,
>
> I have been writing a map-reduce application where I am having an input
> file containing records and every field in the record is separated by some
> delimiter.
>
> In addition to this user will also provide a list of columns that he wants
> to lookup in a master properties file (stored in HDFS). If this columns
> (lets say it a key) is present in master properties file then get the
> corresponding value and update the key with this value and if the key is
> not present it in the master properties file then it will create a new
> value for this key and will write to this property file and will also
> update in the record.
>
> I have written this application , tested it and everything worked fine
> till now.
>
> *e.g :* *I/P Record :* This | is | the | test | record
>
> *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
> the master properties file.)
>
> Here , I have a question.
>
> *Q 1:* In the case when my input file is huge and it is splitted across
> the multiple mappers , I was getting the below mentioned exception where
> all the other mappers tasks were failing. *Also initially when I started
> the job my master properties file is empty.* In my code I have a check if
> this file (master properties) doesn't exist create a new empty file before
> submitting the job itself.
>
> e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
> this all the failed map tasks restarts and finally the job become
> successful.
>
> So , *here is the question , is it possible to make sure that when one of
> the mapper tasks is writing to a file , other should wait until the first
> one is finished. ?* I read that all the mappers task don't interact with
> each other.
>
> Also what will happen in the scenario when I start multiple parallel
> map-reduce jobs and all of them working on the same properties files. *Is
> there any way to have synchronization between two independent map reduce
> jobs*?
>
> I have also read that ZooKeeper can be used in such scenarios , Is that
> correct ?
>
>
> Error: 
> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: 
> IOException - failed while appending data to the file ->Failed to create file 
> [/user/cloudera/lob/master/bank.properties] for 
> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client 
> [10.X.X.17], because this file is already being created by
> [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17]
>                 at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
>                 at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
>                 at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
>                 at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
>                 at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
>                 at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>                 at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>                 at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>                 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
>                 at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
>                 at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at javax.security.auth.Subject.doAs(Subject.java:415)
>                 at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>                 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
>
>

Re: Synchronization among Mappers in map-reduce task

Reply via email to