Hi Folks ,

I have been writing a map-reduce application where I am having an input
file containing records and every field in the record is separated by some
delimiter.

In addition to this user will also provide a list of columns that he wants
to lookup in a master properties file (stored in HDFS). If this columns
(lets say it a key) is present in master properties file then get the
corresponding value and update the key with this value and if the key is
not present it in the master properties file then it will create a new
value for this key and will write to this property file and will also
update in the record.

I have written this application , tested it and everything worked fine till
now.

*e.g :* *I/P Record :* This | is | the | test | record

*Columns :* 2,4 (that means code will look up only field *"is" and "test"* in
the master properties file.)

Here , I have a question.

*Q 1:* In the case when my input file is huge and it is splitted across the
multiple mappers , I was getting the below mentioned exception where all
the other mappers tasks were failing. *Also initially when I started the
job my master properties file is empty.* In my code I have a check if this
file (master properties) doesn't exist create a new empty file before
submitting the job itself.

e.g : If i have 4 splits of data , then 3 map tasks are failing. But after
this all the failed map tasks restarts and finally the job become
successful.

So , *here is the question , is it possible to make sure that when one of
the mapper tasks is writing to a file , other should wait until the first
one is finished. ?* I read that all the mappers task don't interact with
each other.

Also what will happen in the scenario when I start multiple parallel
map-reduce jobs and all of them working on the same properties files. *Is
there any way to have synchronization between two independent map reduce
jobs*?

I have also read that ZooKeeper can be used in such scenarios , Is that
correct ?


Error: com.techidiocy.hadoop.filesystem.api.exceptions.HDFSFileSystemException:
IOException - failed while appending data to the file ->Failed to
create file [/user/cloudera/lob/master/bank.properties] for
[DFSClient_attempt_1407778869492_0032_m_000002_0_1618418105_1] on
client [10.X.X.17], because this file is already being created by
[DFSClient_attempt_1407778869492_0032_m_000005_0_-949968337_1] on
[10.X.X.17]
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2548)
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2377)
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2612)
                at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2575)
                at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:522)
                at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
                at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
                at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:415)
                at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)

Reply via email to