Hi Folks , I have been writing a map-reduce application where I am having an input file containing records and every field in the record is separated by some delimiter.
In addition to this user will also provide a list of columns that he wants to lookup in a master properties file (stored in HDFS). If this columns (lets say it a key) is present in master properties file then get the corresponding value and update the key with this value and if the key is not present it in the master properties file then it will create a new value for this key and will write to this property file and will also update in the record. I have written this application , tested it and everything worked fine till now. *e.g :* *I/P Record :* This | is | the | test | record *Columns :* 2,4 (that means code will look up only field *"is" and "test"* in the master properties file.) Here , I have a question. *Q 1:* In the case when my input file is huge and it is splitted across the multiple mappers , I was getting the below mentioned exception where all the other mappers tasks were failing. *Also initially when I started the job my master properties file is empty.* In my code I have a check if this file (master properties) doesn't exist create a new empty file before submitting the job itself. e.g : If i have 4 splits of data , then 3 map tasks are failing. But after this all the failed map tasks restarts and finally the job become successful. So , *here is the question , is it possible to make sure that when one of the mapper tasks is writing to a file , other should wait until the first one is finished. ?* I read that all the mappers task don't interact with each other. Also what will happen in the scenario when I start multiple parallel map-reduce jobs and all of them working on the same properties files. *Is there any way to have synchronization between two independent map reduce jobs*? I have also read that ZooKeeper can be used in such scenarios , Is that correct ? Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: IOException - failed while appending data to the file ->Failed to create file [/user/cloudera/lob/master/bank.properties] for [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client [10.X.X.17], because this file is already being created by [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on [10.X.X.17] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)