Hi Saurabh, >> am not sure making overwrite=false , will solve the problem. As per java doc by making overwrite=false , it will throw an exception if the file already exists. So, for all the remaining mappers it will throw an exception. You can catch the exception and wait.
>> Can you please refer to me some source or link on ZK , that can help me in solving the problem. You can check this: http://zookeeper.apache.org/doc/r3.4.6/recipes.html Thanks, Wangda On Wed, Aug 13, 2014 at 9:34 AM, saurabh jain <sauravma...@gmail.com> wrote: > Hi Wangda , > > I am not sure making overwrite=false , will solve the problem. As per java > doc by making overwrite=false , it will throw an exception if the file > already exists. So, for all the remaining mappers it will throw an > exception. > > Also I am very new to ZK and have very basic knowledge of it , I am not > sure if it can solve the problem and if yes how. I am still going through > available sources on the ZK. > > Can you please refer to me some source or link on ZK , that can help me in > solving the problem. > > Best > Saurabh > > On Tue, Aug 12, 2014 at 3:08 AM, Wangda Tan <wheele...@gmail.com> wrote: > >> Hi Saurabh, >> It's an interesting topic, >> >> >> So , here is the question , is it possible to make sure that when one >> of >> the mapper tasks is writing to a file , other should wait until the first >> one is finished. ? I read that all the mappers task don't interact with >> each other >> >> A simple way to do this is using HDFS namespace: >> Create file using "public FSDataOutputStream create(Path f, boolean >> overwrite)", overwrite=false. Only one mapper can successfully create >> file. >> >> After write completed, the mapper will create a flag file like "completed" >> in the same folder. Other mappers can wait for the "completed" file >> created. >> >> >> Is there any way to have synchronization between two independent map >> reduce jobs? >> I think ZK can do some complex synchronization here. Like mutex, master >> election, etc. >> >> Hope this helps, >> >> Wangda Tan >> >> >> >> >> On Tue, Aug 12, 2014 at 10:43 AM, saurabh jain <sauravma...@gmail.com> >> wrote: >> >> > Hi Folks , >> > >> > I have been writing a map-reduce application where I am having an input >> > file containing records and every field in the record is separated by >> some >> > delimiter. >> > >> > In addition to this user will also provide a list of columns that he >> wants >> > to lookup in a master properties file (stored in HDFS). If this columns >> > (lets say it a key) is present in master properties file then get the >> > corresponding value and update the key with this value and if the key is >> > not present it in the master properties file then it will create a new >> > value for this key and will write to this property file and will also >> > update in the record. >> > >> > I have written this application , tested it and everything worked fine >> > till now. >> > >> > *e.g :* *I/P Record :* This | is | the | test | record >> > >> > *Columns :* 2,4 (that means code will look up only field *"is" and >> "test"* in >> > the master properties file.) >> > >> > Here , I have a question. >> > >> > *Q 1:* In the case when my input file is huge and it is splitted across >> > the multiple mappers , I was getting the below mentioned exception where >> > all the other mappers tasks were failing. *Also initially when I started >> > the job my master properties file is empty.* In my code I have a check >> if >> > this file (master properties) doesn't exist create a new empty file >> before >> > submitting the job itself. >> > >> > e.g : If i have 4 splits of data , then 3 map tasks are failing. But >> after >> > this all the failed map tasks restarts and finally the job become >> > successful. >> > >> > So , *here is the question , is it possible to make sure that when one >> of >> > the mapper tasks is writing to a file , other should wait until the >> first >> > one is finished. ?* I read that all the mappers task don't interact with >> > each other. >> > >> > Also what will happen in the scenario when I start multiple parallel >> > map-reduce jobs and all of them working on the same properties files. >> *Is >> > there any way to have synchronization between two independent map reduce >> > jobs*? >> > >> > I have also read that ZooKeeper can be used in such scenarios , Is that >> > correct ? >> > >> > >> > Error: >> com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException: >> IOException - failed while appending data to the file ->Failed to create >> file [/user/cloudera/lob/master/bank.properties] for >> [DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on client >> [10.X.X.17], because this file is already being created by >> > [DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on >> [10.X.X.17] >> > at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548) >> > at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377) >> > at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612) >> > at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575) >> > at >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522) >> > at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373) >> > at >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) >> > at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) >> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) >> > at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) >> > at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) >> > at java.security.AccessController.doPrivileged(Native >> Method) >> > at javax.security.auth.Subject.doAs(Subject.java:415) >> > at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) >> > at >> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) >> > >> > >> > >