storeCheckpoint() in worker can be slow when a slow worker is presence
Hi all, I have a question related to my last experience using Giraph. In Giraph worker's code, I see a line like this: *"getServerData().getCurrentMessageStore().writePartition(verticesOutputStream, partition.getId());*" To the best of my knowledge, while executing this line, a worker writes some of its partitionMap entry into outputStream. However, with the existence of a slow workers, this line execution in another worker also gets slower. I see that the execution gets faster again after the slow worker has finished its storeCheckpoint process. >From what I know, each worker uses its own message store and write to different output stream. Hence, why the slow storeCheckpoint process in a worker can affect another worker checkpoint process? Thanks
Re: Issue in Aggregator
Hi Puneet, It's unclear to me what you're wanting in terms of aggregator behavior. Are you saying you want an aggregator such that the final output is the aggregated value just for a particular worker? With an aggregator you should at least make sure the operations you're performing are commutative; that is, the order in which items are aggregated should not matter unless it is explicitly dealt with somehow. Otherwise you'll get unpredictable results. Best, Matthew Saltz El 08/11/2014 15:05, "Puneet Agarwal" escribió: > Hi All, > In my algo, I use an Aggregator which takes a Text value. I have written > my custom aggregator class for this, as given below. > > public class MyAgg extends BasicAggregator { > ... > } > > This works fine when running on my laptop with one worker. > However, when running it on the cluster, sometimes it does not return the > correctly aggregated value. > It seems it is returning the locally aggregated value of one of the > workers. > While it should have used my logic to decide which of the aggregated > values sent by various worker should be chosen as finally aggregated values. > (But in fact I have not written such a code anywhere, it is therefore > doing the best it could) > > Following is how is my analysis about this issue. > a.I guess every worker aggregates the values locally. > b.then there is a global aggregation step, which simply compares the > values sent by various aggregators. > c.For global aggregation it uses Text.compareTo() method. This method > Text.compareTo() is a default Hadoop implementation and does not include > the logic of my program. > d.It seem it is because of the above the value returned by my > aggregator in the cluster is actually not globally aggregated, but the > locally aggregated value of one of the worker gets taken. > > If the above analysis is correct, following is how I think I can solve > this. > I should write my own class that implements Writable interface. In this > class I would also write a compareTo method as a result things will start > working fine. > > If it was using class MyAgg itself, to decide which of the values returned > by various workers should be taken as globally aggregated value then this > problem would not have occurred. > > *I seek your guidance whether my analysis is correct.* > > - Puneet > IIT Delhi, India > >
Re: Help with Giraph on Yarn
Hi Tripti, finally I was able to run the test with success. It was an issue of permission since I was running as ale not as yarn. Let me say that now I’m able to run graph examples on Yarn 2.5.1. This is the final result: 14/11/08 16:24:00 INFO yarn.GiraphYarnClient: Completed Giraph: org.apache.giraph.examples.SimpleShortestPathsComputation: SUCCEEDED, total running time: 0 minutes, 21 seconds. Many thanks for your support, Alessandro Il giorno 06/nov/2014, alle ore 15:16, Tripti Singh ha scritto: > I don't know if u have access to this node. But if u do, u could check if the > file is indeed there and u have access to it. > > Sent from my iPhone > > On 06-Nov-2014, at 6:12 pm, "Alessandro Negro" wrote: > >> You are right it works, but now I receive the following error: >> >> SLF4J: Class path contains multiple SLF4J bindings. >> SLF4J: Found binding in >> [jar:file:/private/tmp/hadoop-yarn/nm-local-dir/usercache/ale/appcache/application_1415264041937_0009/filecache/10/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/opt/yarn/hadoop-2.5.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >> explanation. >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >> 2014-11-06 13:15:37.120 java[10158:1803] Unable to load realm info from >> SCDynamicStore >> Exception in thread "pool-4-thread-1" java.lang.IllegalStateException: Could >> not configure the containerlaunch context for GiraphYarnTasks. >> at >> org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:391) >> at >> org.apache.giraph.yarn.GiraphApplicationMaster.access$500(GiraphApplicationMaster.java:78) >> at >> org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.buildContainerLaunchContext(GiraphApplicationMaster.java:522) >> at >> org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.run(GiraphApplicationMaster.java:479) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:744) >> Caused by: java.io.FileNotFoundException: File does not exist: >> hdfs://hadoop-master:9000/user/yarn/giraph_yarn_jar_cache/application_1415264041937_0009/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar >> at >> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1072) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1064) >> at >> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) >> at >> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1064) >> at org.apache.giraph.yarn.YarnUtils.addFileToResourceMap(YarnUtils.java:153) >> at org.apache.giraph.yarn.YarnUtils.addFsResourcesToMap(YarnUtils.java:77) >> at >> org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:387) >> ... 6 more >> >> >> That justify the other error I receive in the task: Could not find or load main class org.apache.giraph.yarn.GiraphYarnTask >> >> >> Thanks, >> >> Il giorno 06/nov/2014, alle ore 13:07, Tripti Singh >> ha scritto: >> >>> Why r u adding two jars? Example jar ideally contains core library so >>> everything should be available with just one example jar included >>> >>> Sent from my iPhone >>> >>> On 06-Nov-2014, at 4:33 pm, "Alessandro Negro" wrote: >>> Hi, now it seems better, I need to add: giraph-1.1.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar,giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar Now it seems that after a lot of cycle it fail with this error: Could not find or load main class org.apache.giraph.yarn.GiraphYarnTask But in this case the error appear in task-3-stderr.log not in gam-stderr.log where there is the following error: LF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/private/tmp/hadoop-yarn/nm-local-dir/usercache/ale/appcache/application_1415264041937_0006/filecache/12/giraph-1.1.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/private/tmp/hadoop-yarn/nm-local-dir/usercache/ale/appcache/application_1415264041937_0006/filecache/10/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/yarn/hadoop-2.5.1/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
Issue in Aggregator
Hi All,In my algo, I use an Aggregator which takes a Text value. I have written my custom aggregator class for this, as given below. public class MyAgg extends BasicAggregator {...} This works fine when running on my laptop with one worker.However, when running it on the cluster, sometimes it does not return the correctly aggregated value.It seems it is returning the locally aggregated value of one of the workers.While it should have used my logic to decide which of the aggregated values sent by various worker should be chosen as finally aggregated values.(But in fact I have not written such a code anywhere, it is therefore doing the best it could) Following is how is my analysis about this issue.a.I guess every worker aggregates the values locally.b. then there is a global aggregation step, which simply compares the values sent by various aggregators.c. For global aggregation it uses Text.compareTo() method. This method Text.compareTo() is a default Hadoop implementation and does not include the logic of my program.d. It seem it is because of the above the value returned by my aggregator in the cluster is actually not globally aggregated, but the locally aggregated value of one of the worker gets taken. If the above analysis is correct, following is how I think I can solve this.I should write my own class that implements Writable interface. In this class I would also write a compareTo method as a result things will start working fine. If it was using class MyAgg itself, to decide which of the values returned by various workers should be taken as globally aggregated value then this problem would not have occurred. I seek your guidance whether my analysis is correct. - PuneetIIT Delhi, India
Re: java.net.ConnectException: Connection refused
Thanks Xenia.I also managed to solve the issue. Following is how I solved it. I ran netstat on all the computers of my cluster, and I ran the gripah job in parallel.I then learnt that this process runs on 127.0.0.1 while other machine tries to connect on 172.21.xx.xxx. That's why It gives the error java.net.ConnectException Then opened the /etc/hosts file, I found that the hostname of the machine was also mapped to 127.0.0.1.I removed that entry and it worked. It took me quite a while to resolve this issue. Anyway finally it worked. PuneetIIT Delhi, India On Tuesday, November 4, 2014 4:09 AM, Xenia Demetriou wrote: Hi Puneet, I am not an expert but I had the same error and I solved it by changing the hostnames of the cluster-Pcs in lowercase e.g Make iHadoop3 -> ihadoop3 -- Xenia 2014-11-02 14:08 GMT+02:00 Puneet Agarwal : I have setup a cluster of 4 computers for running my Pregel jobs. When running a job I often get the following error (given below).I followed another thread in giraph forums and learnt that this problem is because of the firewall stopping network traffic.I have stopped the firewall service on all the machines. These are machines have RHEL 5.5 and I stopped the service using the command - "service iptables stop" But I still get the same error. Can someone tell me what could be causing this service to be blocked on port 30001 on this computer? RegardsPuneet (IIT Delhi, India) Re: Problem running the PageRank example in a cluster | | | | | | | | | Re: Problem running the PageRank example in a clusterthis is the output of the command in all servers:Chain INPUT (policy ACCEPT)target prot opt source destinationACCEPT tcp -- anywhere anywhere stateNEW tcp dpts:3:30010ACCEPT tcp -- anywhere anywhere ... | | | | View on mail-archives.apache.org | Preview by Yahoo | | | | | Error===Using Netty without authentication. 2014-11-02 14:26:24,458 WARN org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Future failed to connect with iHadoop3/172.21.208.178:30001 with 0 failures because of java.net.ConnectException: Connection refused 2014-11-02 14:26:24,458 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without authentication. 2014-11-02 14:26:24,459 INFO org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Successfully added 0 connections, (0 total connected) 1 failed, 1 failures total. 2014-11-02 14:26:24,499 WARN org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught: Channel failed with remote address null java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)2014-11-02 14:26:24,459 INFO org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Successfully added 0 connections, (0 total connected) 1 failed, 1 failures total. 2014-11-02 14:26:24,499 WARN org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught: Channel failed with remote address null java.net.ConnectException: Connection refusedjava.net.ConnectException: Connection refused