Re: Local-only aggregators
Hi, I'm not sure aggregators require necessarily high traffic. Aggregators are aggregated locally on the worker before they are aggregated on the (corresponding) master worker. Anyway, assuming you want to proceed, my understanding is that you want vertices on the same worker to share (aggregated) information. In that case, I'd suggest just using a WorkerContext. Hope this helps. Claudio On Wed, Mar 25, 2015 at 12:47 AM Alessio Arleo ingar...@icloud.com wrote: Hello everybody I was wondering if it was possible to extend the concept of aggregator from a “global” to a “local-only” perspective. Normally, aggregators DO cause network traffic because of the cycle: Workers - Aggregator Owner- MasterAggregator - AggregatorOwner - Workers What if I’d like to fetch and aggregate values as I would normally do with aggregators but without causing this traffic? Let’s assume this situation: 1 - Define a custom partitioning class and let it partition the graph. This is the partition used to assign vertices to workers. 2 - in the computation class, every time che compute method is called on a vertex, the data needed for computation is stored inside the vertex neighbours but also in non-neighbouring vertices (think about Force Directed layout algorithm for example; to compute the forces, is necessary the distance between neighbouring and not-neighbouring vertices, applying different kind of forces). — Given that the compute class is computing on vertex X a - I pick information from X neighbours as I would normally do (iterating its edges or the incoming messages) b - When it comes to non-neighbouring vertices I would like to use data from X worker only. The first thing I tried to understand before asking this question was: does this make any sense? I am probably wrong, but this actually does. If I partition my graph to maximize locality, what I am actually trying to do is to reduce the network traffic as much as possibile. My doubt is that if I use aggregators to achieve the result the network traffic would be heavy, probably losing the advantages of the initial partitioning. What if I could access and modify an aggregator-like local data structure in the same fashion (i.e. “getAggregatedValue”) but without broadcasting it (assuming that I do not need the aggregator to be accessible to every worker)? Or could it be possibile to manually assign partition owners in order to minimise network traffic (if I need to aggregate all values from vertices in partition 3 and 3 only, I assign the partition 3 aggregator owner to partition 3 worker)? I hope in your comprehension and I hope I somehow caught your attention, even if for a brief moment. Ask me if something is not clear ;) Cheers! ~~~ Ing. Alessio Arleo Dottorando in Ingegneria Industriale e dell’Informazione Dottore Magistrale in Ingegneria Informatica e dell’Automazione Dottore in Ingegneria Informatica ed Elettronica Linkedin: it.linkedin.com/in/IngArleo Skype: Ing. Alessio Arleo Tel: +39 075 5853920 Cell: +39 349 0575782 ~~~
Re: Compiling Giraph for Hadoop 2.5.x and 2.6.0 -- SASL_PROPS variable error
I see more and more people getting into this. I guess whether we should add the fix to the pure_yarn profile by default, as it feel it's going to stay. Ideas? On Sat, Jan 10, 2015 at 7:38 PM, Eugene Koontz ekoo...@hiro-tan.org wrote: Hi Allesio and Eli, Compiling with mvn -Phadoop_yarn -Dhadoop.version=2.6.0 clean will avoid the below SASL_PROPS compilation error if you remove the STATIC_SASL_SYMBOL from the munge.symbols of the hadoop_yarn profile as follows: diff --git a/pom.xml b/pom.xml index cf0e1f9..8c2a561 100644 --- a/pom.xml +++ b/pom.xml @@ -1194,7 +1194,7 @@ under the License. /modules properties hadoop.versionSET_HADOOP_VERSION_USING_MVN_DASH_D_OPTION/hadoop.version -munge.symbolsPURE_YARN,STATIC_SASL_SYMBOL/munge.symbols +munge.symbolsPURE_YARN/munge.symbols !-- TODO: add these checks eventually -- project.enforcer.skiptrue/project.enforcer.skip giraph.maven.dependency.plugin.skiptrue/giraph.maven.dependency.plugin.skip In other words, when compiling Giraph against newer releases of Hadoop, there is no need for this munge symbol. The distinction between newer and older seems to be release 2.4.0 of Hadoop, as given here: https://issues.apache.org/jira/browse/HADOOP-10221 Add a plugin to specify SaslProperties for RPC protocol based on connection properties. It seems like we need to add some additional profiles to make the pre-2.4 Hadoop (which requires the munge symbol STATIC_SASL_SYMBOL) and newer (which should not). -Eugene On 1/8/15, 11:13 PM, Eugene Koontz wrote: Hi Alessio, I am able to reproduce your problem: https://gist.github.com/ekoontz/7dbaaf6218abb4fd7832 I'll try building Hadoop 2.6.0 and getting Giraph to work with it. -Eugene On 1/8/15, 10:55 AM, Eli Reisman wrote: This looks like a munge symbol that needs to be added to the hadoop_yarn profile in the pom.xml. I'm thinking this is an issue a couple people have been having on 2.5 and 2.6 trying to build the hadoop_yarn profile? On Thu, Dec 4, 2014 at 1:01 PM, Dr. Alessio Arleo ingar...@icloud.com wrote: Hello everybody I am trying to compile Giraph release-1.1 for Hadoop 2.5.x and Hadoop 2.6.0 with Maven profile hadoop_yarn. It works fine up to Hadoop 2.4.1, but when trying with a newer version of Hadoop the following error comes up. I am working with jdk 1.7 and Maven 3.2.1. ST [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /home/hadoop/git/giraph/1.1/giraph-core/target/munged/main/org/apache/giraph/comm/netty/SaslNettyClient.java:[84,68] cannot find symbol symbol: variable SASL_PROPS location: class org.apache.hadoop.security.SaslRpcServer [ERROR] /home/hadoop/git/giraph/1.1/giraph-core/target/munged/main/org/apache/giraph/comm/netty/SaslNettyServer.java:[105,62] cannot find symbol symbol: variable SASL_PROPS location: class org.apache.hadoop.security.SaslRpcServer Do you have any suggestions? Any would be much appreciated :) Kind regards, Alessio -- Claudio Martella
Re: Please welcome our newest committer, Sergey Edunov!
Congrats Sergey and welcome! On Wed, Dec 3, 2014 at 7:34 PM, Maja Kabiljo majakabi...@fb.com wrote: I am happy to announce that the Project Management Committee (PMC) for Apache Giraph has elected Sergey Edunov to become a committer, and he accepted. Sergey has been an active member of Giraph community, finding issues, submitting patches and reviewing code. We’re looking forward to Sergey’s larger involvement and future work. List of his contributions: GIRAPH-895: Trim the edges in Giraph GIRAPH-896: Memory leak in SuperstepMetricsRegistry GIRAPH-897: Add an option to dump only live objects to JMap GIRAPH-898: Remove giraph-accumulo from Facebook profile GIRAPH-903: Detect crashes on Netty threads GIRAPH-924: Fix checkpointing GIRAPH-925: Unit tests should pass even if zookeeper port not available GIRAPH-927: Decouple netty server threads from message processing GIRAPH-933: Checkpointing improvements GIRAPH-936: Decouple netty server threads from message processing GIRAPH-940: Cleanup the list of supported hadoop versions GIRAPH-950: Auto-restart from checkpoint doesn't pick up latest checkpoint GIRAPH-963: Aggregators may fail with IllegalArgumentException upon deserialization Best, Maja -- Claudio Martella
Giraph counters on Yarn
Hello, is anybody in the list able to get the standard job counters printed at the end of the jobs when using pure YARN? I can get the logs, but I cannot find the usual stats printed to the command line on mapreduce. Thanks, Claudio -- Claudio Martella
Re: Enabling Giraph Level Loggin - Hadoop-2.2.0
For completeness and future reference, where can they be found in you run it as purely YARN app? On Sun, Nov 16, 2014 at 9:07 PM, Eli Reisman apache.mail...@gmail.com wrote: If you mean running as a MapReduce application as opposed to running directly on YARN, the logs should be where MR is configured to put them, and with per-worker logs in the MR task logs for the cluster job. On Mon, Nov 10, 2014 at 11:41 AM, Charith Wickramarachchi charith.dhanus...@gmail.com wrote: Hi, I am running Apache Giraph 1.1.0 in Hadoop 2.2.0 as an mapreduce application. But I could not find the Giraph logs. It will be great if someone could tell me how to enable Apache giraph logging. Also, I see that group collects very detailed runtime statistics, how can I collect those stats? Thanks, Charith -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253 Web http://apache.org/~charith http://www-scf.usc.edu/~cwickram/ http://charith.wickramaarachchi.org/ Blog http://charith.wickramaarachchi.org/ http://charithwiki.blogspot.com/ Twitter @charithwiki https://twitter.com/charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may have received this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions -- Claudio Martella
Re: [VOTE] Apache Giraph 1.1.0 RC2
+1. On Thu, Nov 13, 2014 at 2:28 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: This vote is for Apache Giraph, version 1.1.0 release It fixes the following issues: http://s.apache.org/a8X *** Please download, test and vote by Mon 11/17 noon PST Note that we are voting upon the source (tag): release-1.1.0-RC2 Source and binary files are available at: http://people.apache.org/~rvs/giraph-1.1.0-RC2/ Staged website is available at: http://people.apache.org/~rvs/giraph-1.1.0-RC2/site/ Maven staging repo is available at: https://repository.apache.org/content/repositories/orgapachegiraph-1003 Please notice, that as per earlier agreement two sets of artifacts are published differentiated by the version ID: * version ID 1.1.0 corresponds to the artifacts built for the hadoop_1 profile * version ID 1.1.0-hadoop2 corresponds to the artifacts built for hadoop_2 profile. The tag to be voted upon (release-1.1.0-RC1): https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=log;h=refs/tags/release-1.1.0-RC2 The KEYS file containing PGP keys we use to sign the release: http://svn.apache.org/repos/asf/bigtop/dist/KEYS Thanks, Roman. -- Claudio Martella
Re: [VOTE] Apache Giraph 1.1.0 RC1
Yes, I did re-run the build this weekend, and it built succesfully for the default profile and the hadoop_2 one. I ran a couple of examples on the cluster, and it ran succesfully. I'm +1. On Tue, Nov 4, 2014 at 8:10 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: On Tue, Nov 4, 2014 at 5:47 AM, Claudio Martella claudio.marte...@gmail.com wrote: I am indeed having some problems. mvn install will fail because the test is opening too many files: [snip] I have to investigate why this happens. I'm not using a different ulimit than what I have on my Mac OS X by default. Where are you building yours? This is really weird. I have not issues whatsoever on Mac OS X with the following setup: $ uname -a Darwin usxxshaporm1.corp.emc.com 12.4.1 Darwin Kernel Version 12.4.1: Tue May 21 17:04:50 PDT 2013; root:xnu-2050.40.51~1/RELEASE_X86_64 x86_64 $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 2560 pipe size(512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 709 virtual memory (kbytes, -v) unlimited $ mvn --version Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T13:58:10-07:00) Maven home: /Users/shapor/dist/apache-maven-3.2.3 Java version: 1.7.0_51, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre Default locale: en_US, platform encoding: UTF-8 OS name: mac os x, version: 10.8.4, arch: x86_64, family: mac Thanks, Roman. -- Claudio Martella
Re: Compiling Giraph 1.1
I just built trunk with that command. Are you sure you're building latest trunk? On Fri, Nov 7, 2014 at 3:21 PM, Ryan freelanceflashga...@gmail.com wrote: Any updated thoughts on this? On Tue, Nov 4, 2014 at 5:59 PM, Ryan freelanceflashga...@gmail.com wrote: It's 'mvn -Phadoop_2 -fae -DskipTests clean install' Thanks, Ryan On Tue, Nov 4, 2014 at 2:02 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: What's the exact compilation incantation you use? Thanks, Roman. On Tue, Nov 4, 2014 at 9:56 AM, Ryan freelanceflashga...@gmail.com wrote: I'm attempting to build, compile and install Giraph 1.1 on a server running CDH5.1.2. A few weeks ago I successfully compiled it by changing the hadoop_2 profile version to be 2.3.0-cdh5.1.2. I recently did a fresh install and was unable to build, compile and install (perhaps due to the latest code updates). The error seems to be related to the SaslNettyClient and SaslNettyServer. Any idea on fixes? Here's part of the error log: [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.0:compile (default-compile) on project giraph-core: Compilation failure: Compilation failure: [ERROR] /[myPath]/giraph/giraph-core/src/main/java/org/apache/giraph/comm/netty/SaslNettyClient.java:[28,34] cannot find symbol [ERROR] symbol: class SaslPropertiesResolver [ERROR] location: package org.apache.hadoop.security ... [ERROR] /[myPath]/giraph/giraph-core/src/main/java/org/apache/giraph/comm/netty/SaslNettyServer.java:[108,11] cannot find symbol [ERROR] symbol: variable SaslPropertiesResolver [ERROR] location: class org.apache.giraph.comm.netty.SaslNettyServer -- Claudio Martella
Re: [VOTE] Apache Giraph 1.1.0 RC1
I am indeed having some problems. mvn install will fail because the test is opening too many files: Caused by: java.io.FileNotFoundException: /private/var/folders/5b/8yx5dbyn40nbt_70syjs86chgp/T/giraph-hive-1415098102276/metastore_db/seg0/c90.dat (Too many open files in system) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:241) at org.apache.derby.impl.io.DirRandomAccessFile.init(Unknown Source) at org.apache.derby.impl.io.DirRandomAccessFile4.init(Unknown Source) at org.apache.derby.impl.io.DirFile4.getRandomAccessFile(Unknown Source) at org.apache.derby.impl.store.raw.data.RAFContainer.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at org.apache.derby.impl.store.raw.data.RAFContainer.createContainer(Unknown Source) at org.apache.derby.impl.store.raw.data.RAFContainer4.createContainer(Unknown Source) at org.apache.derby.impl.store.raw.data.FileContainer.createIdent(Unknown Source) at org.apache.derby.impl.store.raw.data.RAFContainer.createIdentity(Unknown Source) at org.apache.derby.impl.services.cache.ConcurrentCache.create(Unknown Source) at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.addContainer(Unknown Source) at org.apache.derby.impl.store.raw.xact.Xact.addContainer(Unknown Source) at org.apache.derby.impl.store.access.heap.Heap.create(Unknown Source) at org.apache.derby.impl.store.access.heap.HeapConglomerateFactory.createConglomerate(Unknown Source) at org.apache.derby.impl.store.access.RAMTransaction.createConglomerate(Unknown Source) at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.createConglomerate(Unknown Source) at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.createDictionaryTables(Unknown Source) at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source) at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source) at org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown Source) at org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown Source) ... 96 more I have to investigate why this happens. I'm not using a different ulimit than what I have on my Mac OS X by default. Where are you building yours? On Sat, Nov 1, 2014 at 11:49 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: Ping! Any progress on testing the current RC? Thanks, Roman. On Fri, Oct 31, 2014 at 9:00 AM, Claudio Martella claudio.marte...@gmail.com wrote: Oh, thanks for the info! On Fri, Oct 31, 2014 at 3:06 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: On Fri, Oct 31, 2014 at 3:26 AM, Claudio Martella claudio.marte...@gmail.com wrote: Hi Roman, thanks again for this. I have had a look at the staging site so far (our cluster has been down whole week... universities...), and I was wondering if you have an insight why some of the docs are missing, e.g. gora and rexster documentation. None of them are missing. The links moved to a User Docs - Modules though: http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/gora.html http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/rexster.html and so forth. Thanks, Roman. -- Claudio Martella -- Claudio Martella
Re: Graph partitioning and data locality
Hi, answers are inline. On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns martin.jungha...@gmx.net wrote: Hi group, I got a question concerning the graph partitioning step. If I understood the code correctly, the graph is distributed to n partitions by using vertexID.hashCode() n. I got two questions concerning that step. 1) Is the whole graph loaded and partitioned only by the Master? This would mean, the whole data has to be moved to that Master map job and then moved to the physical node the specific worker for the partition runs on. As this sounds like a huge overhead, I further inspected the code: I saw that there is also a WorkerGraphPartitioner and I assume he calls the partitioning method on his local data (lets say his local HDFS blocks) and if the resulting partition for a vertex is not himself, the data gets moved to that worker, which reduces the overhead. Is this assumption correct? That is correct, workers forward vertex data to the correct worker who is responsible for that vertex via hash-partitioning (by default), meaning that the master is not involved. 2) Let's say the graph is already partitioned in the file system, e.g. blocks on physical nodes contain logical connected graph nodes. Is it possible to just read the data as it is and skip the partitioning step? In that case I currently assume, that the vertexID should contain the partitionID and the custom partitioning would be an identity function in that case (instead of hashing or range). In principle you can. You would need to organize splits so that they contain all the data for each particular worker, and then assign relevant splits to the corresponding worker. Thanks for your time and help! Cheers, Martin -- Claudio Martella
Re: [VOTE] Apache Giraph 1.1.0 RC1
Hi Roman, thanks again for this. I have had a look at the staging site so far (our cluster has been down whole week... universities...), and I was wondering if you have an insight why some of the docs are missing, e.g. gora and rexster documentation. Thanks, Claudio On Fri, Oct 31, 2014 at 6:38 AM, Roman Shaposhnik ro...@shaposhnik.org wrote: On Wed, Oct 29, 2014 at 6:51 PM, Maja Kabiljo majakabi...@fb.com wrote: Roman, again thanks for taking care of the release. We found one issue https://issues.apache.org/jira/browse/GIRAPH-961 - any application using MasterLoggingAggregator fails without this fix. Can we backport it to the release? This looks like a really good idea to me. I will be re-cutting the RC over the weekend. For now I'd really, really, really ask everybody to once again consider issue like GIRAPH-961 so that we don't have to re-cut multiple times. Thanks, Roman. -- Claudio Martella
Re: [VOTE] Apache Giraph 1.1.0 RC1
Oh, thanks for the info! On Fri, Oct 31, 2014 at 3:06 PM, Roman Shaposhnik ro...@shaposhnik.org wrote: On Fri, Oct 31, 2014 at 3:26 AM, Claudio Martella claudio.marte...@gmail.com wrote: Hi Roman, thanks again for this. I have had a look at the staging site so far (our cluster has been down whole week... universities...), and I was wondering if you have an insight why some of the docs are missing, e.g. gora and rexster documentation. None of them are missing. The links moved to a User Docs - Modules though: http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/gora.html http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/rexster.html and so forth. Thanks, Roman. -- Claudio Martella
Re: Resource Allocation Model Of Apache Giraph
giraph.userPartitionCount is the way to go, but not giraph.maxPartitionsInMemory. That is for the out-of-core graph functionality. On Fri, Oct 24, 2014 at 1:23 PM, Matthew Saltz sal...@gmail.com wrote: You may set giraph.userPartitionCount=number of workers and giraph.maxPartitionsInMemory=1. Like Avery said though, since parallelism occurs on a partition level (each thread processes a different partition) if you only have one partition per worker you cannot take advantage of multithreading. Best, Matthew On Fri, Oct 24, 2014 at 3:53 AM, Zhang, David (Paypal Risk) pengzh...@ebay.com wrote: I think no good solution. You can try to run a java application by using FileInputFormat.getSplits to get the size of the array, which number you can set to giraph workers. Or run a simple map-reduce job by using IdentityMapper to see how many mappers there. Thanks, Zhang, David (Paypal Risk) *From:* Charith Wickramarachchi [mailto:charith.dhanus...@gmail.com] *Sent:* 2014年10月24日 5:37 *To:* user *Subject:* Re: Resource Allocation Model Of Apache Giraph Thanks Claudio and Avery, I find a way way to configure hadoop to have desired number of mappers per machine as Claudio mentioned. Avery, Could you please tell me how I can configure giraph to make each worker handle only a single partition? Thanks, Charith On Thu, Oct 23, 2014 at 2:26 PM, Avery Ching ach...@apache.org wrote: Regarding your second point, partitions are decoupled from workers. A worker can handle zero or more partitions. You can make each worker handle one partition, but we typically like multiple partitions since we can use multi-threading per machine. On 10/23/14, 9:04 AM, Claudio Martella wrote: the way mappers (or containers) and hence workers are assigned to machines is not under the control of giraph, but of the underlying hadoop environment (with different responsibilities that depend on the hadoop version, e.g. YARN). You'll have to tweak your hadoop configuration to control the maximum number of workers assigned to one machine (optimally one with multiple threads). On Thu, Oct 23, 2014 at 5:53 PM, Charith Wickramarachchi charith.dhanus...@gmail.com wrote: Hi Folks, I'm wondering what is the resource allocation model for Apache Giraph. As I understand each worker is one to one Mapped with a Mapper and a worker can process multiple partitions with a user defined number of threads. Is it possible to make sure that one worker, only process a single partition? Also is it possible to control the worker assignment in the cluster nodes? (Ex: Make sure only N workers runs on a single machine, assuming we have enough resources) Thanks, Charith -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253 Web http://apache.org/~charith http://www-scf.usc.edu/%7Ecwickram/ Blog http://charith.wickramaarachchi.org/ http://charithwiki.blogspot.com/ Twitter @charithwiki https://twitter.com/charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may have received this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions -- Claudio Martella -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253 Web http://apache.org/~charith http://www-scf.usc.edu/~cwickram/ Blog http://charith.wickramaarachchi.org/ http://charithwiki.blogspot.com/ Twitter @charithwiki https://twitter.com/charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may have received this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions -- Claudio Martella
Re: Resource Allocation Model Of Apache Giraph
the way mappers (or containers) and hence workers are assigned to machines is not under the control of giraph, but of the underlying hadoop environment (with different responsibilities that depend on the hadoop version, e.g. YARN). You'll have to tweak your hadoop configuration to control the maximum number of workers assigned to one machine (optimally one with multiple threads). On Thu, Oct 23, 2014 at 5:53 PM, Charith Wickramarachchi charith.dhanus...@gmail.com wrote: Hi Folks, I'm wondering what is the resource allocation model for Apache Giraph. As I understand each worker is one to one Mapped with a Mapper and a worker can process multiple partitions with a user defined number of threads. Is it possible to make sure that one worker, only process a single partition? Also is it possible to control the worker assignment in the cluster nodes? (Ex: Make sure only N workers runs on a single machine, assuming we have enough resources) Thanks, Charith -- Charith Dhanushka Wickramaarachchi Tel +1 213 447 4253 Web http://apache.org/~charith http://www-scf.usc.edu/~cwickram/ http://charith.wickramaarachchi.org/ Blog http://charith.wickramaarachchi.org/ http://charithwiki.blogspot.com/ Twitter @charithwiki https://twitter.com/charithwiki This communication may contain privileged or other confidential information and is intended exclusively for the addressee/s. If you are not the intended recipient/s, or believe that you may have received this communication in error, please reply to the sender indicating that fact and delete the copy you received and in addition, you should not print, copy, retransmit, disseminate, or otherwise use the information contained in this communication. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions -- Claudio Martella
Re: how do I maintain a cached List across supersteps?
I would use a workercontext, it is shared and persistent during computation by all vertices in a worker. If it's readonly, you won't have to manage concurrency. On Tue, Sep 16, 2014 at 9:42 PM, Matthew Cornell m...@matthewcornell.org wrote: Hi Folks. I have a custom argument that's passed into my Giraph job that needs parsing. The parsed value is accessed by my Vertex#compute. To avoid excessive GC I'd like to cache the parsing results. What's a good way to do so? I looked at using the ImmutableClassesGiraphConfiguration returned by getConf(), but it supports only String properties. I looked at using my custom MasterCompute to manage it, but I couldn't find how to access the master compute instance from the vertex. My last idea is to use (abuse?) an aggregator to do this. I'd appreciate your thoughts! -- matt -- Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson Street, Amherst MA 01002 | matthewcornell.org -- Claudio Martella
Re: OrientDB Rexster Apache Giraph combination
are you using trunk? giraph-rexster documentation is here: http://giraph.apache.org/rexster.html On Wed, May 28, 2014 at 8:41 AM, Arun Km arunkm@gmail.com wrote: Hi Claudio Im beginner to Giraph. I was following the Apache Girpah Quick Start link I could see Giraph-Accumulo, core, examples, hbase, hcatelog, hive subprojects only. Will you please direct me to the giraph-rexster subproject? About the documentation, I was referring to http://giraph.apache.org/apidocs/ . will you please direct me to the right link. Thanks a lot Arun On 23 May 2014 13:34, Arun Km arunkm@gmail.com wrote: Thanks for help, let me look into sub projects. Cheers ! On 22 May 2014 20:20, Claudio Martella claudio.marte...@gmail.comwrote: You can have a look at the giraph-rexster subproject within the giraph codebase. there is also some documentation on our site. On Thu, May 22, 2014 at 3:25 PM, Arun Km arunkm@gmail.com wrote: Hello I would like to know Your opinion on OrientDB -- Rexster -- Apache Giraph combination. I'm more interested on *input/output formats* to be used between these three? BR Arun -- Claudio Martella -- Claudio Martella
Re: n-ary relationship on Giraph
Well, you don't know of how many supesteps his computation is. What he's asking is very typical in the semantic web community and he's basically hinting at indexes on the labels of the edges. Imagine he wants to involve in a superstep of the computation (of potentially of multiple steps) all the vertices that have a particular edge. He either scans all the vertices and lets them decide who has these edges, or you create an supernode that connects to all the vertices that have such outgoing edge (at loading? previously with mapreduce?). In a homogenous dataset like Facebook's where every vertex has more or less the same set of edge labels (friends, comment, likes, etc.) this indices would be overkill and unnecessary, but with heterogenous datasets with respect to schema, like DBpedia, this question may still be relevant. To answer Sujan's answer, no we don't have indices on edges. You may want to create specific vertices to do that (but be aware of their degree!), or store some of these indices in a WorkerContext (or even aggregators). It really depends on the size of these indices. Hope this helps, Claudio On Thu, May 22, 2014 at 2:57 AM, Pavan Kumar A pava...@outlook.com wrote: The state of triplet A CB can be stored in the edge value for C (the edge from A - B) I would like to remind you that Giraph is a batch processing framework, and not a graph database. You can do complex graph processing on the input graph, such questions can be answered very trivially. But performance need not be great. You must write java code and run a map-reduce job. For this case your compute function consists of just 1 superstep which filters edges for a vertex based on the criterion and then you can write the output back to one of the supported storage formats. -- Date: Wed, 21 May 2014 16:32:44 -0700 From: sujanu...@yahoo.com Subject: Re: n-ary relationship on Giraph To: user@giraph.apache.org Lets say I have node A and B, linked with edge C. Now I have properties which belongs to this A - C - B triplet. For example I have property 'date created'. 'date created' property belongs to A- C- B. Can I represent this in Giraph. Also does giraph has querying mechanism? So that I can retrieve triplets which are created before particular date? Sujan Perera On Wednesday, May 21, 2014 3:51 PM, Pavan Kumar A pava...@outlook.com wrote: Can you please provide more context. vertex - edge (edge value can store any properties required of that edge) - vertex (vertex value can store any property required for the vertex) -- Date: Wed, 21 May 2014 13:50:34 -0700 From: sujanu...@yahoo.com Subject: n-ary relationship on Giraph To: user@giraph.apache.org Hi, Does Giraph supports n-ary relationships? I need to store some properties of triplet vertex - edge - vertex and be able to query with those properties. Sujan Perera -- Claudio Martella
Re: OrientDB Rexster Apache Giraph combination
You can have a look at the giraph-rexster subproject within the giraph codebase. there is also some documentation on our site. On Thu, May 22, 2014 at 3:25 PM, Arun Km arunkm@gmail.com wrote: Hello I would like to know Your opinion on OrientDB -- Rexster -- Apache Giraph combination. I'm more interested on *input/output formats* to be used between these three? BR Arun -- Claudio Martella
Re: Superstep duration increases
I'd start by taking HBase out of the equation. On Thu, May 8, 2014 at 1:46 PM, Pascal Jäger pas...@pascaljaeger.de wrote: Hi all, I have implemented a label propagation algorithm to find clusters in a graph. I just realized that the time the algorithm takes for one superstep is increasing and I don’t know why. The graph is static and the number of messages is the same throughout all supersteps. During every superstep each node sends its label to its neighbors which then calculate their label based on the received messages and then again send their label. At the end of each superstep each node writes a nodeID - label pair to an HBase table. Do you have any general hints where I can look at? I absolutely have no clue where to start Thanks for your help! Regards Pascal -- Claudio Martella
Re: MessageCombiner
you can check the combiner used by the shortest paths algorithm, that has the inverted semantics as yours, as it is using the minimum value. On Mon, May 12, 2014 at 8:03 AM, nishant gandhi nishantgandh...@gmail.comwrote: Let say I have 5 nodes which are sending message to 6th node. each 5 node sending one message to 6th node containing some value. I want to intercept all those 5 messages going towards 6th node. after that I want to find the maximum value contained among those 5 nodes. and send single message from combiner to 6th node which contain only the maximum value among those messages. I have tried to write a simple code for it but it seems not working. I dont know what I am doing wrong. In the code below, node 0 trying to send 3 message to node 1. My final goal is, in superstep=1, node 1 should receive only one message with contain 4 only. My current code receiving all the 3 messages and hence I am getting final value of variable:test as 7. public class CombinerTest extends BasicComputationLongWritable, DoubleWritable, FloatWritable,DoubleWritable { @Override public void compute(VertexLongWritable, DoubleWritable, FloatWritable arg0,IterableDoubleWritable arg1) throws IOException { if(getSuperstep()==0 (arg0.getId().get()==0)) { sendMessage(new LongWritable(1),new DoubleWritable(1)); sendMessage(new LongWritable(1),new DoubleWritable(2)); sendMessage(new LongWritable(1),new DoubleWritable(4)); } DoubleWritable test=new DoubleWritable(0); if(getSuperstep()==1) { for(DoubleWritable messages :arg1) { test.set(test.get()+messages.get()); //changed=true; } arg0.setValue(test); } arg0.voteToHalt(); } public void combine(LongWritable vertexIndex, DoubleWritable originalMessage, DoubleWritable messageToCombine) { if(originalMessage.get() messageToCombine.get()) { originalMessage.set(messageToCombine.get()); } } public DoubleWritable createInitialMessage() { return new DoubleWritable(Double.MAX_VALUE); } } please help me to figure out correct way to write code. Thanks Maria. Nishant Gandhi M.Tech. CSE IIT Patna On Mon, May 12, 2014 at 11:11 AM, Maria Stylianou mars...@gmail.comwrote: Hi Nishant, Can you be more specific? Are you trying to combine all incoming messages of a vertex into one message? What do you mean combine? Add values? Or append to a list? The message can be a list so you can put all values together. Maria On Sunday, May 11, 2014, nishant gandhi nishantgandh...@gmail.com wrote: Hi, I am trying to write code that use Combiner. I want to combine all message into one for each vertex. That one message contains message value bigger than all the other message values. Please help. Nishant Gandhi M.Tech. CSE IIT Patna -- Sent from Android Mobile -- Claudio Martella
Re: Blogpost: Large-scale graph partitioning with Apache Giraph
Very interesting. We recently wrote an article about a very similar technique: http://arxiv.org/pdf/1404.3861v1.pdf and we also evaluated it on 1B vertices. It would be nice to test it on your graph. On Tue, Apr 22, 2014 at 8:24 PM, Avery Ching ach...@apache.org wrote: Hi Giraphers, Recently, a few internal Giraph users at Facebook published a really cool blog post on how we partition huge graphs (1.15 billion people and 150 billion friendships - 300B directed edges). https://code.facebook.com/posts/274771932683700/large- scale-graph-partitioning-with-apache-giraph/ Avery -- Claudio Martella
Re: Using out of core messages
Answers are inline. On Thu, Apr 24, 2014 at 4:21 PM, Pascal Jäger pas...@pascaljaeger.dewrote: Hi all, I am struggling with the settings to use out of core messages. I have 3 nodes with 16 GB RAM each ( one master, two workers). I ran into a java heap space OOM Error. First question is: Where do I set the mapred.child.java-opts Options? Do I need to add them via the -ca mapred.child….“ option or by using „-Dmapred.child…..“ I tried both, but nothing seems to work out. I run it on a cloudera cluster, and when looking in the web frontend I see, that it only uses 3 GB of my 16 GB RAM. Are those even the right options ? You can use both, but the correct parameter name is mapred.child.java.opts. giraph.maxMessagesInMemory - is it per worker? Or what exactly is counted here? and how does it correlate to giraph.messagesBufferSize? It is per worker, and it tells the maximum number of messages each worker should keep in main memory. The messageBufferSize defines the buffer used to read and write messages to the disk and you can probably keep the current value. I am really lost right now. My graph has currently only 8000 nodes and 7 edges. During one step I need to send more than 15 000 000 messages and this is when I get the OOM error. I turned on the out of core messages feature without changing the above mentioned options and my computation really slowed down. I guess because it was writing 14 000 000 messages to disk Each worker is currently keeping 1M messages in memory (if you have activated OOC messages but have not played with maxMessagesInMemory). In your case, it's something around 1/8 of the messages a worker receives. Once you're able to increase the heap and use all your 16GB of ram on your workers, you should be able to increase that parameter, depending on the message size. Hope you can help me. Regards Pascal Hope this helps. -- Claudio Martella
Re: Starting a second computation
Hi, there is currently now way to re-active the vertices from the master. One thing you could do is use an aggregator, instead of actually voting to halt. For example, with a sum aggregator VERTEX_FINISHED, where vertices add 1 when they would vote to halt, you can see from the master when all the vertices are finished, and then switch to a new computation. On Sat, Apr 19, 2014 at 1:36 AM, Schweiger, Tom thschwei...@ebay.comwrote: Hello Giraph list, I have a problem that has two steps. Step 2 needs to start after step 1 completes. Step 1 is completed when all the vertices have voted to halt and there are no more messages. I know I can switch my computes using a MasterCompute, but it is unclear how I re-awaken all the vertices. Has anyone else solved a problem like this? If so, how did you do it? Is there an easier way to do this? Basically I'm thinking this: class TwoStep { class TwoStepMaster extends DefaultMasterCompute { public final void compute() { // // switch from StepOne to StepTwo if StepOne is done // if (this .isHalted this.getComputation().equals(StepOne.class);) { setComputation(StepTwo.class()); // send a message to all vertices??? // unhalt somehow?? // suggestions anyone?? } } } class StepOne extends BasicComputation { public void compute(...) { // do step one stuff vertex.voteToHalt(); } } class StepTwo extends BasicComputation { public void compute(...) { // do step two stuff vertex.voteTpHalt(); } } } -- Claudio Martella
Re: Changing index of a graph
The only solution i know is usually done via a so-called dictionary outside of giraph (e.g. for semantic web graphs which also have URIs as IDs), through a datastore like HBase/Cassandra, basically the hashmap you mentioned. While initially computationally expensive, it allows you to scale in the long run, because adding an edge is just incrementing a counter in the store and add the mapping. On Tue, Apr 15, 2014 at 3:33 PM, Martin Neumann mneum...@spotify.comwrote: Hej, I have a huge edgelist (several billion edges) where node ID's are URL's. The algorithm I want to run needs the ID's to be long and there should be no holes in the ID space (so I cant simply hash the URL's). Is anyone aware of a simple solution that does not require a impractical huge hash map? My idea currently is to load the graph into another giraph job and then assigning a number to each node. This way the mapping of number to URL would be stored in the Node. Problem is that I have to assign the numbers in a sequential way to ensure there are no holes and numbers are unique. No Idea if this is even possible in Giraph. Any input is welcome cheers Martin -- Claudio Martella
Powered-by Giraph page
Hello giraphers, as Giraph is getting more visibility and users, I think it would be nice to add a Powered-by page on our site, were we collect names of companies that (want to share that) are using Giraph. So, this is basically a small survey about who is using Giraph. For those that I know: - Facebook Anybody else? Thanks! Claudio -- Claudio Martella
Re: How to set giraph runtime parameters?
you can set them on the command line, by using -D (e.g. -Dgiraph.isStaticGraph=true) after GiraphRunner, On Wed, Apr 9, 2014 at 4:44 PM, Suijian Zhou suijian.z...@gmail.com wrote: Hi, Does anybody know how to set runtime parameters in giraph? It should be set in command line or in a *.xml file? I tried -Dgiraph.zkSessionMsecTimeout=90( googled) in command line but failed. Thanks! Best Regards, Suijian -- Claudio Martella
Re: How to set more zooKeeper nodes in giraph.
you don't need to recompile it, you can set it runtime by setting giraph.zkServerCount accordingly. but are you trying to get giraph to start multiple instances of zookeeper in multiple nodes, or do you want giraph to use multiple of your existing zookeepers? On Tue, Apr 8, 2014 at 11:17 PM, Suijian Zhou suijian.z...@gmail.comwrote: Hi, Does anybody know how to set more zooKeeper nodes in giraph? I tried to modify ZOOKEEPER_SERVER_COUNT in file: giraph-core/target/munged/main/org/apache/giraph/conf/GiraphConstants.java but recompilation of giraph shows no effect at all( giraph seems always use 1 zookeeper node?) and when it failed(e.g, due to timeout), the client could not connect and finally the giraph job failed too. It's also strange that although I see negotiated timeout = 60 which means the session is supposed to be running for 10 minutes , but why the job failed to connect to it only after ~1 minutes? 14/04/08 15:58:18 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=compute-0-23.local:22181 sessionTimeout=6 watcher=org.apache.giraph.job.JobProgressTracker@69b28a51 14/04/08 15:58:18 INFO mapred.JobClient: Running job: job_201404081444_0011 14/04/08 15:58:18 INFO zookeeper.ClientCnxn: Opening socket connection to server compute-0-23.local/10.1.255.231:22181. Will not attempt to authenticate using SASL (unknown error) 14/04/08 15:58:18 INFO zookeeper.ClientCnxn: Socket connection established to compute-0-23.local/10.1.255.231:22181, initiating session 14/04/08 15:58:18 INFO zookeeper.ClientCnxn: Session establishment complete on server compute-0-23.local/10.1.255.231:22181, sessionid = 0x14543222b640009, negotiated timeout = 60 14/04/08 15:59:48 INFO job.JobProgressTracker: Data from 8 workers - Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64 partitions computed; min free memory on worker 8 - 152.48MB, average 217.58MB 14/04/08 15:59:51 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x14543222b640009, likely server has closed socket, closing socket connection and attempting reconnect 14/04/08 15:59:52 INFO zookeeper.ClientCnxn: Opening socket connection to server compute-0-23.local/10.1.255.231:22181. Will not attempt to authenticate using SASL (unknown error) 14/04/08 15:59:52 WARN zookeeper.ClientCnxn: Session 0x14543222b640009 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused Best Regards, Suijian -- Claudio Martella
Re: Information
It looks like you're expecting to use Giraph in an online fashion, such as you would use a database to answer queries within milliseconds or seconds. Giraph is an offline batch processing system. On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo -- Claudio Martella
Re: Information
Nope, you can think about Giraph as MapReduce for graphs. Probably neo4j C is the way to go for you. On Wed, Mar 26, 2014 at 3:18 PM, Angelo Immediata angelo...@gmail.comwrote: hi Sebastian OK...I got itI was thinking I could use it for an online scenario.. Thank you Angelo 2014-03-26 14:52 GMT+01:00 Sebastian Schelter s...@apache.org: Hi Angelo, It very much depends on your use case. Do you want to precompute paths offline in batch or are you looking for a system that answers online? Giraph has been built for the first scenario. --sebastian On 03/26/2014 02:48 PM, Angelo Immediata wrote: hi Claudio so, if I understood correctly, it has no sense to use Giraph for shortest path calculation in my scenario Am I right? 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com : It looks like you're expecting to use Giraph in an online fashion, such as you would use a database to answer queries within milliseconds or seconds. Giraph is an offline batch processing system. On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.com wrote: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo -- Claudio Martella -- Claudio Martella
Re: Is it possible to know the mapper task a particular vertex is assigned to?
by default vertices stay where they are when they are loaded. On Thu, Mar 6, 2014 at 7:31 AM, Pankaj Malhotra pankajiit...@gmail.comwrote: There is a vertex with a large outgoing edge-list. I wanted to compare the memory usage, number of messages, and few other statistics for the worker with this vertex and the average statistics across workers. Does the mapping change within the same job? Thanks, Pankaj On 6 March 2014 11:38, Roman Shaposhnik shaposh...@gmail.com wrote: On Wed, Mar 5, 2014 at 9:53 PM, Pankaj Malhotra pankajiit...@gmail.com wrote: Hi, How can I find the mapper task a particular vertex is assigned to? I can do this by doing a sysout and then looking at the logs. But there must be a smarter way to do this. Please suggest. That mapping is not static and can change. In theory you can rely on the info in ZK, but that would be relying on what is, essentially, an implementation detail of Giraph. What's the reason for you to need this info? Thanks, Roman. -- Claudio Martella
Re: Giraph program stucks.
did you actually increase the heap? On Thu, Mar 6, 2014 at 11:43 PM, Suijian Zhou suijian.z...@gmail.comwrote: Hi, I tried to process only 2 of the input files, i.e, 2GB + 2GB input, the program finished successfully in 6 minutes. But as I have 39 nodes, they should be enough to load and process the 8*2GB=16GB size graph? Can somebody help to give some hints( Will all the nodes participate in graph loading from HDFS or only master node load the graph?)? Thanks! Best Regards, Suijian 2014-03-06 16:24 GMT-06:00 Suijian Zhou suijian.z...@gmail.com: Hi, Experts, I'm trying to process a graph by pagerank in giraph, but the program always stucks there. There are 8 input files, each one is with size ~2GB and all copied onto HDFS. I use 39 nodes and each node has 16GB Mem and 8 cores. It keeps printing the same info(as the following) on the screen after 2 hours, looks no progress at all. What are the possible reasons? Testing small example files run without problems. Thanks! 14/03/06 16:17:42 INFO job.JobProgressTracker: Data from 39 workers - Compute superstep 0: 5854829 out of 4920 vertices computed; 181 out of 1521 partitions computed 14/03/06 16:17:47 INFO job.JobProgressTracker: Data from 39 workers - Compute superstep 0: 5854829 out of 4920 vertices computed; 181 out of 1521 partitions computed Best Regards, Suijian -- Claudio Martella
Re: To process a BIG input graph in giraph.
-vip /user/hadoop/input should be enough. On Wed, Mar 5, 2014 at 5:31 PM, Suijian Zhou suijian.z...@gmail.com wrote: Hi, Experts, Could anybody remind me how to load mutiple input files in a giraph command line? The following do not work, they only load the first input file: -vip /user/hadoop/input/ttt.txt /user/hadoop/input/ttt2.txt or -vip /user/hadoop/input/ttt.txt -vip /user/hadoop/input/ttt2.txt Best Regards, Suijian 2014-03-01 16:12 GMT-06:00 Suijian Zhou suijian.z...@gmail.com: Hi, Here I'm trying to process a very big input file through giraph, ~70GB. I'm running the giraph program on a 40 nodes linux cluster but the program just get stuck there after it read in a small fraction of the input file. Although each node has 16GB mem, it looks that only one node read the input file which is on HDFS(into its memory). As the input file is so big, is there a way to scatter the input file on all the nodes so each node will read in a fraction of the file then start processing the graph? Will it be helpful if we split the single big input file into many smaller files and let each node read in one of them to process( of course the overall stucture of the graph should be kept)? Thanks! Best Regards, Suijian -- Claudio Martella
Re: Giraph talks at Hadoop Summit
Btw guys my talk at the Hadoop summit in amsterdam this April was accepted. So we ll have another one there. On Friday, February 28, 2014, Avery Ching ach...@apache.org wrote: That's great Roman! I certainly hope it gets accepted. We also have a submission. Hopefully there will be at least one Giraph talk at the Hadoop Summit. https://hadoopsummit.uservoice.com/forums/242790- committer-track/suggestions/5568083-dynamic-graph- iterative-computation-on-apache-gira Avery On 2/27/14, 2:19 PM, Roman Shaposhnik wrote: Hi! not sure if anybody from the Giraph community submitted any talks to Hadoop Summit, but here's the one I submitted: https://hadoopsummit.uservoice.com/forums/242790- committer-track/suggestions/5568061-apache-giraph-start- analyzing-graph-relationships Feel free to upvote if you feel like Giraph deserves to be well represented at Hadoop Summit. Thanks, Roman. -- Claudio Martella
Re: Giraph avro input format
I'm not sure about what I'm going to say, but Gora should read from Avro, and we do support reading transparently through Gora. you could check that out. On Mon, Feb 17, 2014 at 1:32 PM, Martin Neumann mneum...@spotify.comwrote: Hej, Is there an avro input format for Giraph? I saw some older (july 2013) entries on the mailing list and none existed by then. Have things changed since then, or do I have to write my own? When I write my own what's a good base class to start from? cheers Martin -- Claudio Martella
Re: Basic questions about Giraph internals
Yes, Giraph hijacks mapper tasks, and then does everything else on its own. On Fri, Feb 7, 2014 at 12:39 PM, Alexander Frolov alexndr.fro...@gmail.comwrote: On Fri, Feb 7, 2014 at 2:30 PM, Claudio Martella claudio.marte...@gmail.com wrote: On Fri, Feb 7, 2014 at 9:44 AM, Alexander Frolov alexndr.fro...@gmail.com wrote: Thank you, I will try to do this. As I understood I should set number of threads manually through Giraph API. BTW, what is conceptual difference between running multiple workers on the TaskTracker and running single worker and multiple threads? In terms of vertex fetching, memory sharing etc. Basically, better usage of resources: one single JVM, no duplication of core data structures, less netty threads and communication points, more locality (less messages over the network), less actors accessing zookeeper etc. Also I would like to ask how message transfer between vertices is implemented in terms of Hadoop primitives? Source code reference will be enough. Communication does not happen via Hadoop primitives, but ad-hoc via netty. Ok. It seams that Hadoop has minimalistic influence on Giraph application execution after graph is loaded into memory (that is mapping is done). -- Claudio Martella
Re: Basic questions about Giraph internals
Hi Alex, answers are inline. On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov alexndr.fro...@gmail.comwrote: Hi, folks! I have started small research of Giraph framework and I have not much experience with Giraph and Hadoop :-(. I would like to ask several questions about how things are working in Giraph which are not straightforward for me. I am trying to use the sources but sometimes it is not too easy ;-) So here they are: 1) How Workers are assigned to TaskTrackers? Each worker is a mapper, and mapper tasks are assigned to tasktrackers by the jobtracker. There's no control by Giraph there, and because Giraph doesn't need data-locality like Mapreduce does, basically nothing is done. 2) How vertices are assigned to Workers? Does it depend on distribution of input file on DataNodes? Is there available any choice of distribution politics or no? In the default scheme, vertices are assigned through modulo hash partitioning. Given k workers, vertex v is assigned to worker i according to hash(v) % k = i. 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)? (1:n)? It's 1:1. Each worker is implemented by a mapper task. The master is usually (but does not need to) implemented by an additional mapper. 4) Can Workers migrate from one TaskTracker to the other? Workers does not migrate. A Giraph computation is not dynamic wrt to assignment and size of the tasks. 5) What is the best way to monitor Giraph app execution (progress, worker assignment, load balancing etc.)? Just like you would for a standard Mapreduce job. Go to the job page on the jobtracker http page. I think this is all for the moment. Thank you. Testbed description: Hardware: 8 node dual-CPU cluster with IB FDR. Giraph: release-1.0.0-RC2-152-g585511f Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8 Best, Alex -- Claudio Martella
Re: Basic questions about Giraph internals
On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov alexndr.fro...@gmail.comwrote: Hi Claudio, thank you. If I understood correctly, mapper and mapper task is the same thing. More or less. A mapper is a functional element of the programming model, while the mapper task is the task that executes the mapper function on the records. On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella claudio.marte...@gmail.com wrote: Hi Alex, answers are inline. On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov alexndr.fro...@gmail.com wrote: Hi, folks! I have started small research of Giraph framework and I have not much experience with Giraph and Hadoop :-(. I would like to ask several questions about how things are working in Giraph which are not straightforward for me. I am trying to use the sources but sometimes it is not too easy ;-) So here they are: 1) How Workers are assigned to TaskTrackers? Each worker is a mapper, and mapper tasks are assigned to tasktrackers by the jobtracker. That is each Worker is created at the beginning of superstep and then dies. In the next superstep all Workers are created again. Is it correct? Nope. The workers are created at the beginning of the computation, and destroyed at the end of the computation. A computation is persistent throughout the computation. There's no control by Giraph there, and because Giraph doesn't need data-locality like Mapreduce does, basically nothing is done. This is important for me. So Giraph Worker (a.k.a Hadoop mapper) fetches vertex with corresponding index from the HDFS and perform computation. What does it do next with it? As I understood Giraph is fully in-memory framework and in the next superstep this vertex should be fetched from the memory by the same Worker. Where the vertices are stored between supersteps? In HDFS or in memory? As I said, the workers are persistent (in-memory) between supersteps, so they keep everything in memory. 2) How vertices are assigned to Workers? Does it depend on distribution of input file on DataNodes? Is there available any choice of distribution politics or no? In the default scheme, vertices are assigned through modulo hash partitioning. Given k workers, vertex v is assigned to worker i according to hash(v) % k = i. 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)? (1:n)? It's 1:1. Each worker is implemented by a mapper task. The master is usually (but does not need to) implemented by an additional mapper . 4) Can Workers migrate from one TaskTracker to the other? Workers does not migrate. A Giraph computation is not dynamic wrt to assignment and size of the tasks. 5) What is the best way to monitor Giraph app execution (progress, worker assignment, load balancing etc.)? Just like you would for a standard Mapreduce job. Go to the job page on the jobtracker http page. I think this is all for the moment. Thank you. Testbed description: Hardware: 8 node dual-CPU cluster with IB FDR. Giraph: release-1.0.0-RC2-152-g585511f Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8 Best, Alex -- Claudio Martella -- Claudio Martella
Re: Basic questions about Giraph internals
On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov alexndr.fro...@gmail.comwrote: On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella claudio.marte...@gmail.com wrote: On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov alexndr.fro...@gmail.com wrote: Hi Claudio, thank you. If I understood correctly, mapper and mapper task is the same thing. More or less. A mapper is a functional element of the programming model, while the mapper task is the task that executes the mapper function on the records. Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a maximum number of Workers [or Workers + Master] which will be created at the same node. That is if I have 8 node cluster with mapred.tasktracker.map.tasks.maximum=4, then I can run up to 31 Workers + 1 Master. Is it correct? That is correct. However, if you have total control over your cluster, you may want to run one worker per node (hence setting the max number of map tasks per machine to 1), and use multiple threads (input, compute, output). This is going to make better use of resources. On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella claudio.marte...@gmail.com wrote: Hi Alex, answers are inline. On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov alexndr.fro...@gmail.com wrote: Hi, folks! I have started small research of Giraph framework and I have not much experience with Giraph and Hadoop :-(. I would like to ask several questions about how things are working in Giraph which are not straightforward for me. I am trying to use the sources but sometimes it is not too easy ;-) So here they are: 1) How Workers are assigned to TaskTrackers? Each worker is a mapper, and mapper tasks are assigned to tasktrackers by the jobtracker. That is each Worker is created at the beginning of superstep and then dies. In the next superstep all Workers are created again. Is it correct? Nope. The workers are created at the beginning of the computation, and destroyed at the end of the computation. A computation is persistent throughout the computation. There's no control by Giraph there, and because Giraph doesn't need data-locality like Mapreduce does, basically nothing is done. This is important for me. So Giraph Worker (a.k.a Hadoop mapper) fetches vertex with corresponding index from the HDFS and perform computation. What does it do next with it? As I understood Giraph is fully in-memory framework and in the next superstep this vertex should be fetched from the memory by the same Worker. Where the vertices are stored between supersteps? In HDFS or in memory? As I said, the workers are persistent (in-memory) between supersteps, so they keep everything in memory. Ok. Is there any means to see assignment of Workers to TaskTrackers during or after the computation? The jobtracker http interface will show you the mapper running, hence i'd check there And is there any means to see assignment of vertices to Workers (as distribution function, histogram etc.)? You can check the worker logs, I think the information should be there. 2) How vertices are assigned to Workers? Does it depend on distribution of input file on DataNodes? Is there available any choice of distribution politics or no? In the default scheme, vertices are assigned through modulo hash partitioning. Given k workers, vertex v is assigned to worker i according to hash(v) % k = i. 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)? (1:n)? It's 1:1. Each worker is implemented by a mapper task. The master is usually (but does not need to) implemented by an additional mapper . 4) Can Workers migrate from one TaskTracker to the other? Workers does not migrate. A Giraph computation is not dynamic wrt to assignment and size of the tasks. 5) What is the best way to monitor Giraph app execution (progress, worker assignment, load balancing etc.)? Just like you would for a standard Mapreduce job. Go to the job page on the jobtracker http page. I think this is all for the moment. Thank you. Testbed description: Hardware: 8 node dual-CPU cluster with IB FDR. Giraph: release-1.0.0-RC2-152-g585511f Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8 Best, Alex -- Claudio Martella -- Claudio Martella -- Claudio Martella
Re: Basic questions about Giraph internals
On Thu, Feb 6, 2014 at 3:04 PM, Alexander Frolov alexndr.fro...@gmail.comwrote: Claudio, thank you very much for your help. On Thu, Feb 6, 2014 at 4:06 PM, Claudio Martella claudio.marte...@gmail.com wrote: On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov alexndr.fro...@gmail.com wrote: On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella claudio.marte...@gmail.com wrote: On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov alexndr.fro...@gmail.com wrote: Hi Claudio, thank you. If I understood correctly, mapper and mapper task is the same thing. More or less. A mapper is a functional element of the programming model, while the mapper task is the task that executes the mapper function on the records. Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a maximum number of Workers [or Workers + Master] which will be created at the same node. That is if I have 8 node cluster with mapred.tasktracker.map.tasks.maximum=4, then I can run up to 31 Workers + 1 Master. Is it correct? That is correct. However, if you have total control over your cluster, you may want to run one worker per node (hence setting the max number of map tasks per machine to 1), and use multiple threads (input, compute, output). This is going to make better use of resources. Should I explicitly force Giraph to use multiple threads for input, compute, output? Only three threads, I suppose? But I have 12 cores available in each node (24 if HT is enabled). You're right, I was not clear. I suggest you use N threads for each of those three classes, where N is something close to the number of processing units (e.g. cores) you have available on each machine. Consider that Giraph has a number of other threads running in the background, for example to handle communication etc. I suggest you try different setups through benchmarking. On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella claudio.marte...@gmail.com wrote: Hi Alex, answers are inline. On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov alexndr.fro...@gmail.com wrote: Hi, folks! I have started small research of Giraph framework and I have not much experience with Giraph and Hadoop :-(. I would like to ask several questions about how things are working in Giraph which are not straightforward for me. I am trying to use the sources but sometimes it is not too easy ;-) So here they are: 1) How Workers are assigned to TaskTrackers? Each worker is a mapper, and mapper tasks are assigned to tasktrackers by the jobtracker. That is each Worker is created at the beginning of superstep and then dies. In the next superstep all Workers are created again. Is it correct? Nope. The workers are created at the beginning of the computation, and destroyed at the end of the computation. A computation is persistent throughout the computation. There's no control by Giraph there, and because Giraph doesn't need data-locality like Mapreduce does, basically nothing is done. This is important for me. So Giraph Worker (a.k.a Hadoop mapper) fetches vertex with corresponding index from the HDFS and perform computation. What does it do next with it? As I understood Giraph is fully in-memory framework and in the next superstep this vertex should be fetched from the memory by the same Worker. Where the vertices are stored between supersteps? In HDFS or in memory? As I said, the workers are persistent (in-memory) between supersteps, so they keep everything in memory. Ok. Is there any means to see assignment of Workers to TaskTrackers during or after the computation? The jobtracker http interface will show you the mapper running, hence i'd check there And is there any means to see assignment of vertices to Workers (as distribution function, histogram etc.)? You can check the worker logs, I think the information should be there. 2) How vertices are assigned to Workers? Does it depend on distribution of input file on DataNodes? Is there available any choice of distribution politics or no? In the default scheme, vertices are assigned through modulo hash partitioning. Given k workers, vertex v is assigned to worker i according to hash(v) % k = i. 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)? (1:n)? It's 1:1. Each worker is implemented by a mapper task. The master is usually (but does not need to) implemented by an additional mapper . 4) Can Workers migrate from one TaskTracker to the other? Workers does not migrate. A Giraph computation is not dynamic wrt to assignment and size of the tasks. 5) What is the best way to monitor Giraph app execution (progress, worker assignment, load balancing etc.)? Just like you would for a standard Mapreduce job. Go to the job page on the jobtracker http page. I think this is all for the moment. Thank you. Testbed description: Hardware: 8 node dual-CPU cluster with IB FDR
Re: Giraph installation without internet connection
you should not have problems if you build the jar with dependencies elsewhere and then deploy it to your cluster. On Tue, Feb 4, 2014 at 2:40 PM, Alexander Frolov alexndr.fro...@gmail.comwrote: Hi, Is it possible to build Giraph w/o Internet? Target cluster has not got internet connection. Best, Alex -- Claudio Martella
Re: constraint about no of supersteps
looks like one of your workers died. If you expect such a long job, I'd suggest you turn checkpointing on. On Wed, Jan 29, 2014 at 5:30 PM, Jyoti Yadav rao.jyoti26ya...@gmail.comwrote: Thanks all for your reply.. Actually i am working with an algorithm in which single source shortest path algorithm runs for thousands of vertices .suppose on an average for one vertex this algo takes 5-6 supersteps,then for thousands of vertices,count of superstep is extremely large..In that case at run time following error is thrown... ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=kanha-Vostro-1014, MRtaskID=1, port=30001) on superstep 19528 2014-01-28 05:11:36,852 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 19528 took 636.831 seconds ended with state WORKER_FAILURE and is now on superstep 19528 2014-01-28 05:11:39,446 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: -1 Any ideas?? Thanks Jyoti On Wed, Jan 29, 2014 at 8:55 PM, Peter Grman peter.gr...@gmail.comwrote: Yes but you can disable the counters per superstep, if you don't need the data, and than I had around 2000 after which my algorithm stopped. Cheers Peter On Jan 29, 2014 4:22 PM, Claudio Martella claudio.marte...@gmail.com wrote: the limit is currently defined by the maximum number of counters your jobtracker allows. Hence, by default the max number of supersteps is around 90. check http://giraph.apache.org/faq.html to see how to increase it. On Wed, Jan 29, 2014 at 4:12 PM, Jyoti Yadav rao.jyoti26ya...@gmail.com wrote: Hi folks.. Is there any limit for maximum no of supersteps while running a giraph job?? Thanks Jyoti -- Claudio Martella -- Claudio Martella
Re: out of core option
* ${hadoop.tmp.dir}/mapred/staging*mapred.queue.names* default *dfs.access.time.precision*360*fs.hsftp.impl* org.apache.hadoop.hdfs.HsftpFileSystem *mapred.task.tracker.http.address*0.0.0.0:50060 *mapred.reduce.parallel.copies* 5*io.seqfile.lazydecompress*true *mapred.output.dir*/user/hduser/output/shortestpaths *io.sort.mb*100 *ipc.client.connection.maxidletime*1*mapred.compress.map.output*false *hadoop.security.uid.cache.secs*14400 *mapred.task.tracker.report.address*127.0.0.1:0 *mapred.healthChecker.interval*6*ipc.client.kill.max*10 *ipc.client.connect.max.retries* 10*ipc.ping.interval*30 *mapreduce.user.classpath.first*true *mapreduce.map.class* org.apache.giraph.graph.GraphMapper*fs.s3.impl* org.apache.hadoop.fs.s3.S3FileSystem*mapred.user.jobconf.limit* 5242880 *mapred.job.tracker.http.address*0.0.0.0:50030*io.file.buffer.size* 4096*mapred.jobtracker.restart.recover*false*io.serializations* org.apache.hadoop.io.serializer.WritableSerialization *dfs.datanode.handler.count*3*mapred.reduce.copy.backoff*300 *mapred.task.profile* false*dfs.replication.considerLoad*true *jobclient.output.filter*FAILED *dfs.namenode.delegation.token.max-lifetime*60480 *mapred.tasktracker.map.tasks.maximum*4*io.compression.codecs* org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec *fs.checkpoint.size*67108864 Additionally, if I have more than one worker I get an Exception, too? Are my configurations wrong? best regards, Sebastian -- Claudio Martella claudio.marte...@gmail.com
Re: About LineRank algo ..
do you plan to share it when you're done? :) On Mon, Jan 20, 2014 at 9:15 AM, Sebastian Schelter s...@apache.org wrote: I have a student working on an implementation, do you have questions? On 01/20/2014 08:11 AM, Jyoti Yadav wrote: Hi.. Is there anyone who is working with linerank algorithm?? Thanks Jyoti -- Claudio Martella claudio.marte...@gmail.com
Re: Intermediate output
you can use giraph.doOutputDuringComputation. If you use this option, instead of having saving vertices in the end of application, saveVertex will be called right after each vertex.compute() is called.NOTE: This feature doesn't work well with checkpointing - if you restart from a checkpoint you won't have any ouptut from previous supresteps. On Sat, Jan 18, 2014 at 11:02 AM, Sebastian Schelter s...@apache.org wrote: Hi, Did we have a way to write out the state of the graph after each superstep? I have an algorithm that requires this and I don't want to buffer the intermediate results in memory until the algorithm finishes. --sebastian -- Claudio Martella claudio.marte...@gmail.com
Re: Release date for 1.1.0
I agree, in particular considering that the current patch does not even apply to trunk. On Sat, Jan 18, 2014 at 4:43 PM, Sebastian Schelter s...@apache.org wrote: Hi, I had a look at the list and noticed that https://issues.apache.org/ jira/browse/GIRAPH-818 (which is based on a research paper) is marked for 1.1.0. Given that this issue proposes to change the programming and execution model of Giraph, I don't see it in the scope of the upcoming release. --sebastian On 01/18/2014 04:07 PM, Roman Shaposhnik wrote: The easiest way to start helping with a release would be to take a look at the JIRAs in the link I sent to the list a few days ago. Thanks, Roman, On Fri, Jan 17, 2014 at 5:54 PM, Rob Paul urlop...@gmail.com wrote: Hi Roman, I will be happy to extend my help in the new release. If you allow, I can initiate it and then you can jump in, as and when your time permits. Thanks On Wed, Jan 15, 2014 at 5:57 PM, Roman Shaposhnik r...@apache.org wrote: It is the usual community-driven ASF process. Somebody familiar with the project has to step forward as a Release Manager and drive the release. I did a few months back, but since then I went through a career change that made it very difficult for me to find free cycles to drive this release. I fully intend to pick up the slack begging of Feb. Given that I think beginning of March should be a realistic deadline, but it all depends on the availability of the Giraph PMC members to cast votes on the release candidate. That said, if there's anybody else who would want to speed up this release I'd be more than happy to yield. By and large though, ASF project typically don't give any schedule for future releases. The way to speed it up is to join the community, start contributing and volunteering as RM. Thanks, Roman. On Wed, Jan 15, 2014 at 5:02 PM, Zhu, Xia xia@intel.com wrote: Is it possible to release 1.1.0 before March 2014? Thanks, Xia -Original Message- From: Zhu, Xia [mailto:xia@intel.com] Sent: Wednesday, January 15, 2014 4:36 PM To: user@giraph.apache.org Subject: RE: Release date for 1.1.0 May I know what are the Giraph release process? Thanks, Ivy -Original Message- From: shaposh...@gmail.com [mailto:shaposh...@gmail.com] On Behalf Of Roman Shaposhnik Sent: Monday, January 06, 2014 9:22 PM To: user@giraph.apache.org Subject: Re: Release date for 1.1.0 On Mon, Jan 6, 2014 at 6:13 AM, Ahmet Emre Aladağ aladage...@gmail.com wrote: Hi, Are there any advances so far on the 1.1.0 release schedule? Unfortunately, with my recent job change driving 1.1.0 release dropped from my list. I'll try to pick it up back this month. Still very much would like to help make it happen. Thanks, Roman. -- Claudio Martella claudio.marte...@gmail.com
Re: Release date for 1.1.0
I think that after 1.1.0 we could consider this big change, and that there should be voting for this. It's changing Giraph's shape at the core. On Sat, Jan 18, 2014 at 5:15 PM, Mirko Kämpf mirko.kae...@cloudera.comwrote: Hi, also think this major change (or enhancement) might be something which goes into Giraph in a later release. Will there be a voting for such issues? Mirko -- Claudio Martella claudio.marte...@gmail.com
Re: minLocalEdgesRatio is PseudoRandomLocalEdgesHelper
it's the ratio of edges that connect two vertices stored in the same worker. On Sun, Dec 15, 2013 at 8:17 PM, Pushparaj Motamari pushpara...@gmail.comwrote: Hi, Could anyone explain what is the siginificance of minLocalEdgesRatio field in PseudoRandomInputFormat , way of generating the graph Thanks Pushparaj -- Claudio Martella claudio.marte...@gmail.com
Re: Running Giraph on YARN (0.23)
15:22:55,111 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: yarn.app.mapreduce.am.job.client.port-range; Ignoring. 2013-11-20 15:22:55,111 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.admin.reduce.child.java.opts; Ignoring. 2013-11-20 15:22:55,111 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: hadoop.tmp.dir; Ignoring. 2013-11-20 15:22:55,117 WARN [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete hdfs://zaniumtan-nn1.tan.ygrid.yahoo.com:8020/user/bordino/test-giraph-tmp/_temporary/1/_temporary/attempt_1382563758657_470916_m_56_0 2013-11-20 http://zaniumtan-nn1.tan.ygrid.yahoo.com:8020/user/bordino/test-giraph-tmp/_temporary/1/_temporary/attempt_1382563758657_470916_m_56_02013-11-20 15:22:55,121 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system... 2013-11-20 15:22:55,122 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped. 2013-11-20 15:22:55,122 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete. Cheres, -- Gianmarco -- Claudio Martella claudio.marte...@gmail.com
Re: Using the RandomEdge ... RandomVertex InputFormat
Every inputformat has a IVE signature with the type of vertex index, value and edge value. They have to match the signature of the computation class you're using. In your case, the inputformat generates vertices with Long ids, while the computation class expects floats. On Mon, Nov 4, 2013 at 10:57 AM, Mirko Kämpf mirko.kae...@cloudera.comwrote: Hello, I try to use the RandomInputFormat. My giraph-job is submitted via the following command: hadoop jar giraph-ex.jar org.apache.giraph.GiraphRunner -Dgiraph.zkList= 127.0.0.1:2181 -libjars giraph-core.jar org.apache.giraph.examples.SimpleShortestPathsVertex -eif org.apache.giraph.io.formats.PseudoRandomEdgeInputFormat -vif org.apache.giraph.io.formats.PseudoRandomVertexInputFormat -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/cloudera/goutput/shortestpaths_rand_$NOW -w 1 -ca giraph.pseudoRandomInputFormat.edgesPerVertex=10 but I get the following exception: 13/11/04 01:28:54 INFO utils.ConfigurationUtils: Setting custom argument [giraph.pseudoRandomInputFormat.edgesPerVertex] to [10] in GiraphConfiguration 13/11/04 01:28:54 INFO utils.ConfigurationUtils: No input path for vertex data was specified. Ensure your InputFormat does not require one. 13/11/04 01:28:54 INFO utils.ConfigurationUtils: No input path for edge data was specified. Ensure your InputFormat does not require one. Exception in thread main java.lang.IllegalArgumentException: checkClassTypes: Edge value types don't match, vertex - class org.apache.hadoop.io.FloatWritable, vertex input format - class org.apache.hadoop.io.DoubleWritable at org.apache.giraph.job.GiraphConfigurationValidator.verifyVertexInputFormatGenericTypes(GiraphConfigurationValidator.java:245) at org.apache.giraph.job.GiraphConfigurationValidator.validateConfiguration(GiraphConfigurationValidator.java:122) at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:154) at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) I think that I have not all required command line parameters set. But the problem is, I can not find any docu, which explains how to run giraph with random networks, generated on the fly. The job runs with the tiny_graph.txt file (and appropriate parameters) but nit with the random format. Could anybody please help me to find out how to use the *random graph*and the *watts strogatz model* which are mentioned by Claudio in this mail: *http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3cof4e9a3736.19e56fe9-on85257bff.000b4243-85257bff.000d8...@us.ibm.com%3E http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3cof4e9a3736.19e56fe9-on85257bff.000b4243-85257bff.000d8...@us.ibm.com%3E* Can I use the RandomVertex and the RandomEdgeInputFormat to build random graphs on the fly? Thanks a lot in advance. Best wishes Mirko -- Claudio Martella claudio.marte...@gmail.com
Re: Using the RandomEdge ... RandomVertex InputFormat
Yes, you'll have to make sure that the pseudorandomedgeinputformat provides the right types. The code for the watts strogatz model is the same package as the pseudorandom... but in trunk and not in 1.0. On Mon, Nov 4, 2013 at 12:14 PM, Mirko Kämpf mirko.kae...@cloudera.comwrote: Thanks, Claudio. I conclude from your mail, I have to create my own PseudoRandomEdgeInputFormat and PseudoRandomVertexInputFormat with types, which fit to the algorithm I want to use. So I misunderstood the concept and not all InputFormats fit to any given implemented algorithm. I this right? But what about the *config parameters*, I have to provide for the PseudoRandom ... InputFormat and where is the code for the *watts strogatz model* you mentioned in a previous post? Best wishes Mirko -- Claudio Martella claudio.marte...@gmail.com
Re: Link Prediction with Giraph
I would assume that it depends on your data. A graph is a very general structure, and it is difficult to attack this problem in general. The most obvious one is transitive closure (if A is connected to B and B to C then A could be conntected to C). The triangle counting example in our codebase (although the name is misleading) is based on these kinds of assumptions. On Thu, Oct 31, 2013 at 1:26 PM, Pascal Jäger pas...@pascaljaeger.dewrote: Hi, Does anyone happen to know a paper about link prediction using a pregel like framework like Giraph? Or has someone an idea about how link prediction could be accomplished with Giraph? Any input is highly appreciated :) Thanks Pascal -- Claudio Martella claudio.marte...@gmail.com
Re: Release date for 1.1.0
I actually agree, we should start heading towards 1.1.0 with a plan. Avery, what do you think? On Tue, Oct 29, 2013 at 2:13 PM, Ahmet Emre Aladağ aladage...@gmail.comwrote: Hi all, Is there an expected date for 1.1.0? There has been a lot of way taken since 1.0.0. -- Ahmet Emre Aladağ -- Claudio Martella claudio.marte...@gmail.com
Re: master knowing about message traffic
The most simple solution is to use an aggregator. On Mon, Oct 21, 2013 at 3:48 PM, Jyoti Yadav rao.jyoti26ya...@gmail.comwrote: Is there any way for the master to know about how much message traffic is there? In my algo, I have to implement something when there is no msg flowing. Any ideas are really appreciated.. Regards Jyoti -- Claudio Martella claudio.marte...@gmail.com
Re: Problem running the PageRank example in a cluster
authentication. 2013-10-21 10:12:15,692 WARN org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught: Channel failed with remote address null java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708) at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:404) at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:366) at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) 2013-10-21 10:12:15,693 INFO org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Successfully added 0 connections, (0 total connected) 1 failed, 6 failures total. 2013-10-21 10:12:15,693 WARN org.apache.giraph.comm.netty.NettyClient: connectAllAddresses: Future failed to connect with hdnode02/172.24.10.72:30001 with 6 failures because of java.net.ConnectException: Connection refused 2013-10-21 10:12:15,693 INFO org.apache.giraph.comm.netty.NettyClient: Using Netty without authentication. 2013-10-21 10:12:15,694 WARN org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught: Channel failed with remote address null java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708) at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:404) at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:366) at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- Claudio Martella claudio.marte...@gmail.com
Re: How to specify parameters in order to run giraph job in parallel
how many mapper tasks do you have set for each node? how many workers are you using for giraph? On Fri, Oct 18, 2013 at 7:12 PM, YAN Da ya...@ust.hk wrote: Dear Claudio Martella, I don't quite get what you mean. Our cluster has 15 servers each with 24 cores, so ideally there can be 15*24 threads/partitions work in parallel, right? (Perhaps deduct one for ZooKeeper) However, when we set the -Dgiraph.numComputeThreads option, we find that we cannot have even 20 threads, and when set to 10, the CPU usage is just a little bit doubles that of the default setting, not anything close to 100*numComputeThreads%. How can we set it to work on our server to utilize all the processors? Regards, Da Yan It actually depends on the setup of your cluster. Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node (ideally to run giraph), so that you would have 14 workers, one per computing node, plus one for master+zookeeper. Once that is reached, you would have a number of compute threads equals to the number of threads that you can run on each node (24 in your case). Does this make sense to you? On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote: Hi, I have a computer cluster consisting of 15 slave machines and 1 master machine. On each slave machine, there are two Xeon E5-2620 CPUs. With the help of HT, there are 24 threads. I am wondering how to specify parameters in order to run giraph job in parallel on my cluster. I am using the following parameters to run a pagerank algorithm. hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner SimplePageRank -vif PageRankInputFormat -vip /input -vof PageRankOutputFormat -op /pagerank -w 1 -mc SimplePageRank\$SimplePageRankMasterCompute -wc SimplePageRank\$SimplePageRankWorkerContext In particular, 1)I know I can use “-w” to specify the number of workers. In my opinion, the number of workers equals to the number of mappers in hadoop except zookeeper. Therefore, in my case(15 slave machine), which number should be chosen? Is 15 a good choice? Since, I find if I input a large number, e.g. 100, the mappers will hang. 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex computing thread number. However, if I specify it to 10, the total runtime is much longer than default. I think the default is 1, which is found in the source code. I wonder if I want to use this parameter, which number should be chosen. 3)When the giraph job is running, I use “top” command to monitor my cpu usage on slave machines. I find that the java process can use 200%-300% cpu resource. However, if I change the number of vertex computing threads to 10, the java process can use 800% cpu resource. I think it is not a linear relation and I want to know why. Thanks for your help. Best, -Yi -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: how to use out of core options
to test the out of core performance of my cluster. Thanks very much, Jian -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 -- Claudio Martella claudio.marte...@gmail.com
Re: How to specify parameters in order to run giraph job in parallel
It actually depends on the setup of your cluster. Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node (ideally to run giraph), so that you would have 14 workers, one per computing node, plus one for master+zookeeper. Once that is reached, you would have a number of compute threads equals to the number of threads that you can run on each node (24 in your case). Does this make sense to you? On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote: Hi, I have a computer cluster consisting of 15 slave machines and 1 master machine. On each slave machine, there are two Xeon E5-2620 CPUs. With the help of HT, there are 24 threads. I am wondering how to specify parameters in order to run giraph job in parallel on my cluster. I am using the following parameters to run a pagerank algorithm. hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner SimplePageRank -vif PageRankInputFormat -vip /input -vof PageRankOutputFormat -op /pagerank -w 1 -mc SimplePageRank\$SimplePageRankMasterCompute -wc SimplePageRank\$SimplePageRankWorkerContext In particular, 1)I know I can use “-w” to specify the number of workers. In my opinion, the number of workers equals to the number of mappers in hadoop except zookeeper. Therefore, in my case(15 slave machine), which number should be chosen? Is 15 a good choice? Since, I find if I input a large number, e.g. 100, the mappers will hang. 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex computing thread number. However, if I specify it to 10, the total runtime is much longer than default. I think the default is 1, which is found in the source code. I wonder if I want to use this parameter, which number should be chosen. 3)When the giraph job is running, I use “top” command to monitor my cpu usage on slave machines. I find that the java process can use 200%-300% cpu resource. However, if I change the number of vertex computing threads to 10, the java process can use 800% cpu resource. I think it is not a linear relation and I want to know why. Thanks for your help. Best, -Yi -- Claudio Martella claudio.marte...@gmail.com
Re: knowing about the vertex id of the sender of the message.
No, you'll have to add it to the message data. On Thu, Oct 17, 2013 at 6:10 PM, Jyoti Yadav rao.jyoti26ya...@gmail.comwrote: Hi.. In vertex computation code,at the start of the superstep every vertex processes its received messages.. Is there any way for the vertex to know who is the sender of the message it is currenty processing.? Thanks Jyoti -- Claudio Martella claudio.marte...@gmail.com
Re: Running the example in http://giraph.apache.org/quick_start.html
format edge value type is not known 13/10/09 14:21:36 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4) 13/10/09 14:21:37 INFO mapred.JobClient: Running job: job_201310091401_0002 13/10/09 14:21:38 INFO mapred.JobClient: map 0% reduce 0% 13/10/09 14:21:52 INFO mapred.JobClient: map 50% reduce 0% 13/10/09 14:21:58 INFO mapred.JobClient: map 100% reduce 0% 13/10/09 14:21:59 INFO mapred.JobClient: map 50% reduce 0% 13/10/09 14:32:01 INFO mapred.JobClient: map 0% reduce 0% 13/10/09 14:32:02 INFO mapred.JobClient: Job complete: job_201310091401_0002 13/10/09 14:32:02 INFO mapred.JobClient: Counters: 6 13/10/09 14:32:02 INFO mapred.JobClient: Job Counters 13/10/09 14:32:02 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=622821 13/10/09 14:32:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/10/09 14:32:02 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/10/09 14:32:02 INFO mapred.JobClient: Launched map tasks=2 13/10/09 14:32:02 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 13/10/09 14:32:02 INFO mapred.JobClient: Failed map tasks=1 I appreciate any help. Maybe I did it wrong. Andro. -- Claudio Martella claudio.marte...@gmail.com
Re: connected components example in giraph 1.0
Can you try applying this one first? http://www.mail-archive.com/user@giraph.apache.org/msg00945/check.diff On Mon, Oct 7, 2013 at 8:40 AM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: *As i said i have builded* *giraph-examples-1.0.0-for-hadoop-2.0.0-cdh4.1.2-jar-with-dependencies.jar* *for cdh4, successfully. The job start to monitoring the success rate:* *13/10/07 08:28:45 INFO mapred.JobClient: map 0% reduce 0%* *but then* *Error running child java.lang.IllegalStateException: run: Caught an unrecoverable exception java.io.FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist. * *...* *Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist.* 2013/10/5 Silvio Di gregorio silvio.digrego...@gmail.com I ha ve build w/ hadoop_cdh4.1.2 parameter. Something is changed, monday i report the result. Now the farm is closed. Il giorno 05/ott/2013 14:06, Claudio Martella claudio.marte...@gmail.com ha scritto: Oh, right, -vof is in trunk. Anyway it looks like you built giraph for the wrong profile. You mentioned you're running on 2.0, but your giraph is built for 0.20.203. try building with a profile for your hadoop version. On Fri, Oct 4, 2013 at 2:35 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -vof in 1.0 version is -of,--outputFormat arg Vertex output format -op,--outputPath arg Vertex output path 2013/10/4 Claudio Martella claudio.marte...@gmail.com did you try the argument (-vof) i suggested? On Fri, Oct 4, 2013 at 2:13 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: i've specified -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat but the same error was produced Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:984) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:237) at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:94) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) 2013/10/4 Claudio Martella claudio.marte...@gmail.com Hi, you need to specify the vertex outputformat class (-vof option), e.g. org.apache.giraph.io.formats.IdWithValueTextOutputFormat. On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: Hi, I hope I have sent to the right address. i have a graph (directed and unweighted) stored in hdfs like a adjacency list (140Milions of edges 6Milions of vertex) nodetabneighbors 23 2 1343 1 999 99923 909 ... .. hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6 I have executed the giraph-1.0 connected components example, in this fashion hadoop jar /usr/local/giraph/giraph-examples/target/giraph- examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.ConnectedComponentsVertex -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4 and then fail with: 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one. 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No output format specified. Ensure your OutputFormat does not require one. 13/10/04 09:28:30 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4
Re: connected components example in giraph 1.0
OK, thanks. I really have to push that patch in. On Mon, Oct 7, 2013 at 12:17 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: yes i do, i have seen this in your post in: http://www.mail-archive.com/user@giraph.apache.org/msg00957.html excuse me if i had checked in the mail-achive first I would have avoided the last post. Now zk issue are resolved. 2013/10/7 Claudio Martella claudio.marte...@gmail.com Can you try applying this one first? http://www.mail-archive.com/user@giraph.apache.org/msg00945/check.diff On Mon, Oct 7, 2013 at 8:40 AM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: *As i said i have builded* *giraph-examples-1.0.0-for-hadoop-2.0.0-cdh4.1.2-jar-with-dependencies.jar* *for cdh4, successfully. The job start to monitoring the success rate:* *13/10/07 08:28:45 INFO mapred.JobClient: map 0% reduce 0%* *but then* *Error running child java.lang.IllegalStateException: run: Caught an unrecoverable exception java.io.FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist. * *...* *Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist.* 2013/10/5 Silvio Di gregorio silvio.digrego...@gmail.com I ha ve build w/ hadoop_cdh4.1.2 parameter. Something is changed, monday i report the result. Now the farm is closed. Il giorno 05/ott/2013 14:06, Claudio Martella claudio.marte...@gmail.com ha scritto: Oh, right, -vof is in trunk. Anyway it looks like you built giraph for the wrong profile. You mentioned you're running on 2.0, but your giraph is built for 0.20.203. try building with a profile for your hadoop version. On Fri, Oct 4, 2013 at 2:35 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -vof in 1.0 version is -of,--outputFormat arg Vertex output format -op,--outputPath arg Vertex output path 2013/10/4 Claudio Martella claudio.marte...@gmail.com did you try the argument (-vof) i suggested? On Fri, Oct 4, 2013 at 2:13 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: i've specified -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat but the same error was produced Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:984) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:237) at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:94) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) 2013/10/4 Claudio Martella claudio.marte...@gmail.com Hi, you need to specify the vertex outputformat class (-vof option), e.g. org.apache.giraph.io.formats.IdWithValueTextOutputFormat. On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: Hi, I hope I have sent to the right address. i have a graph (directed and unweighted) stored in hdfs like a adjacency list (140Milions of edges 6Milions of vertex) nodetabneighbors 23 2 1343 1 999 99923 909 ... .. hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6 I have executed the giraph-1.0 connected components example, in this fashion hadoop jar /usr/local/giraph/giraph-examples/target/giraph- examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.ConnectedComponentsVertex -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4
Re: connected components example in giraph 1.0
Oh, right, -vof is in trunk. Anyway it looks like you built giraph for the wrong profile. You mentioned you're running on 2.0, but your giraph is built for 0.20.203. try building with a profile for your hadoop version. On Fri, Oct 4, 2013 at 2:35 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -vof in 1.0 version is -of,--outputFormat arg Vertex output format -op,--outputPath arg Vertex output path 2013/10/4 Claudio Martella claudio.marte...@gmail.com did you try the argument (-vof) i suggested? On Fri, Oct 4, 2013 at 2:13 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: i've specified -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat but the same error was produced Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:984) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:237) at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:94) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) 2013/10/4 Claudio Martella claudio.marte...@gmail.com Hi, you need to specify the vertex outputformat class (-vof option), e.g. org.apache.giraph.io.formats.IdWithValueTextOutputFormat. On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: Hi, I hope I have sent to the right address. i have a graph (directed and unweighted) stored in hdfs like a adjacency list (140Milions of edges 6Milions of vertex) nodetabneighbors 23 2 1343 1 999 99923 909 ... .. hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6 I have executed the giraph-1.0 connected components example, in this fashion hadoop jar /usr/local/giraph/giraph-examples/target/giraph- examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.ConnectedComponentsVertex -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4 and then fail with: 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one. 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No output format specified. Ensure your OutputFormat does not require one. 13/10/04 09:28:30 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4) 13/10/04 09:28:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/10/04 09:28:31 INFO mapred.JobClient: Cleaning up the staging area hdfs:// srv-bigdata-dev-01.int.sose.it:8020/user/hdfs/.staging/job_201309181636_0535 Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43) .. Thanks in advance -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: connected components example in giraph 1.0
Hi, you need to specify the vertex outputformat class (-vof option), e.g. org.apache.giraph.io.formats.IdWithValueTextOutputFormat. On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio silvio.digrego...@gmail.com wrote: Hi, I hope I have sent to the right address. i have a graph (directed and unweighted) stored in hdfs like a adjacency list (140Milions of edges 6Milions of vertex) nodetabneighbors 23 2 1343 1 999 99923 909 ... .. hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6 I have executed the giraph-1.0 connected components example, in this fashion hadoop jar /usr/local/giraph/giraph-examples/target/giraph- examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.ConnectedComponentsVertex -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4 and then fail with: 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one. 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No output format specified. Ensure your OutputFormat does not require one. 13/10/04 09:28:30 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4) 13/10/04 09:28:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/10/04 09:28:31 INFO mapred.JobClient: Cleaning up the staging area hdfs:// srv-bigdata-dev-01.int.sose.it:8020/user/hdfs/.staging/job_201309181636_0535 Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43) .. Thanks in advance -- Claudio Martella claudio.marte...@gmail.com
Re: workload used to measure Giraph performance number
Hi Wei, it depends on what you mean by workload for a batch processing system. I believe we can split the problem in two: generating a realistic graph, and using representative algorithms. To generate graphs we have two options in giraph: 1) random graph: you specify the number of vertices and the number of edges for each vertex, and the edges will connect two random vertices. This creates a graph with (i) low clustering coefficient, (ii) low average path length, (ii) a uniform degree distribution 2) watts strogatz: you specify the number of vertices, the number of edges, and a rewire probability beta. giraph will generate a ring lattice (each vertex is connected to k preceeding vertices and k following vertices) and rewire some of the edges randomly. This will create a graph with (i) high clustering coefficient, (ii) low average path length, (iii) poisson-like degree distribution (depends on beta). This graph will resemble a small world graph such as a social network, except for the degree distribution which will not a power law. To use representative algorithms you can choose: 1) PageRank: it's a ranking algorithm where all the vertices are active and send messages along the edges at each superstep (hence you'll have O(V) active vertices and O(E) messages) 2) Shortest Paths: starting from a random vertex you'll visit al the vertices in the graph (some multiple times). This will have an aggregate O(V) active vertices and O(E) messages, but this is only a lower bound. In general you'l have different areas of the graph explored at each superstep, and hence potentially a varying workload across different supersteps. 3) Connected Components: this will have something opposite to (2) as it will have many active vertices at the beginning, where the detection is refined towards the end. Hope this helps, Claudio On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang w...@us.ibm.com wrote: Hi, I am interested in measuring some performance numbers of Giraph on my machine. I am wondering are there some pointers where I can get some (configurable) reasonably large workload to work on ? Thanks! Wei -- Claudio Martella claudio.marte...@gmail.com
Re: Giraph offloadPartition fails creation directory
Weird. This is the code: if (!parent.exists()) { if (!parent.mkdirs()) { LOG.error(offloadPartition: Failed to create directory + parent. getAbsolutePath()); } } Question is why parent.mkdirs() is returning false. Could be a problem of permissions. Could you try to pass a different directory for writing, e.g. /tmp/foobar? On Mon, Sep 23, 2013 at 1:28 PM, Dionysis Logothetis dlogothe...@gmail.comwrote: offloadPartition: Failed to create directory -- Claudio Martella claudio.marte...@gmail.com
Re: Number of threads for vertex compute method
By default Giraph uses one compute thread for each worker. It uses multiple threads for IO like Netty etc. The number of compute threads depends on the number of workers per machine. Imagine you have a machine in your hadoop cluster with 8 cores and 8 mapper tasks (something like the basic setup). Then you don't really need a higher number of compute threads per worker, as your cores will be busy all the time. Increasing the number of compute threads is useful when you have a setup where you have one worker per machine. In that case you'd have one compute thread per core. On Wed, Sep 11, 2013 at 12:00 PM, Christian Krause m...@ckrause.org wrote: Hi, by default, how many threads are used for the compute method? I thought that Giraph would automatically use multiple threads by default, but then I stumbled onto this log message: 2013-09-11 11:51:44,501 INFO org.apache.giraph.graph.GraphTaskManager: execute: 6 partitions to process with 1 compute thread(s), originally 1 thread(s) on superstep 7 Does this really mean that it uses only one thread? Cheers, Christian -- Claudio Martella claudio.marte...@gmail.com
Re: Giraph offloadPartition fails creation directory
Giraph does not offload partitions or messages to HDFS in the out-of-core module. It uses local disk on the computing nodes. By defualt, it uses the tasktracker local directory where for example the distributed cache is stored. Could you provide the stacktrace Giraph is spitting when failing? On Thu, Sep 12, 2013 at 12:54 AM, Alexander Asplund alexaspl...@gmail.comwrote: Hi, I'm still trying to get Giraph to work on a graph that requires more memory that is available. The problem is that when the Workers try to offload partitions, the offloading fails. The DiskBackedPartitionStore fails to create the directory _bsp/_partitions/job-/part-vertices-xxx (roughly from recall). The input or computation will then continue for a while, which I believe is because it is still managing to hold everything in memory - but at some point it reaches the limit where there simply is no more heap space, and it crashes with OOM. Has anybody had this problem with giraph failing to make HDFS directories? -- Claudio Martella claudio.marte...@gmail.com
Re: Out of core execution has no effect on GC crash
As David mentions, even with OOC, the objects are still created (and yes, often soon destroyed after spilled to disk) putting pressure on the GC. Moreover, with the increase in size of the graph, the number of in-memory vertices is not the only increasing chunk of memory, as there are other memory stores around the codebase that get filled, such as caches etc. Try increasing the heap to something reasonable for your machines. On Tue, Sep 10, 2013 at 3:21 AM, David Boyd db...@data-tactics-corp.comwrote: Alexander: You might try turning off the GC Overhead limit (-XX:-UseGCOverheadLimit) Also you could turn on verbose GC logging (-verbose:gc -Xloggc:/tmp/@taskid@.gc) to see what is happening. Because the OOC still has to create and destroy objects I suspect that the heap is just getting really fragmented. There are options that you can set with Java to change the type of garbage collection and how it is scheduled as well. You might up the heap size slightly - what is the default heap size on your cluster? On 9/9/2013 8:33 PM, Alexander Asplund wrote: A small note: I'm not seeing any partitions directory being formed under _bsp, which is where I have understood that they should be appearing. On 9/10/13, Alexander Asplund alexaspl...@gmail.com wrote: Really appreciate the swift responses! Thanks again. I have not both increased mapper tasks and decreased max number of partitions at the same time. I first did tests with increased Mapper heap available, but reset the setting after it apparently caused other, large volume, non-Giraph jobs to crash nodes when reducers also were running. I'm curious why increasing mapper heap is a requirement. Shouldn't the OOC mode be able to work with the amount of heap that is available? Is there some agreement on the minimum amount of heap necessary for OOC to succeed, to guide the choice of Mapper heap amount? Either way, I will try increasing mapper heap again as much as possible, which hopefully will run. On 9/9/13, Claudio Martella claudio.marte...@gmail.com wrote: did you extend the heap available to the mapper tasks? e.g. through mapred.child.java.opts. On Tue, Sep 10, 2013 at 12:50 AM, Alexander Asplund alexaspl...@gmail.comwrote: Thanks for the reply. I tried setting giraph.maxPartitionsInMemory to 1, but I'm still getting OOM: GC limit exceeded. Are there any particular cases the OOC will not be able to handle, or is it supposed to work in all cases? If the latter, it might be that I have made some configuration error. I do have one concern that might indicateI have done something wrong: to allow OOC to activate without crashing I had to modify the trunk code. This was because Giraph relied on guava-12 and DiskBackedPartitionStore used hasInt() - a method which does not exist in guava-11 which hadoop 2 depends on. At runtime guava 11 was being used I suppose this problem might indicate I'm running submitting the job using the wrong binary. Currently I am including the giraph dependencies with the jar, and running using hadoop jar. On 9/7/13, Claudio Martella claudio.marte...@gmail.com wrote: OOC is used also at input superstep. try to decrease the number of partitions kept in memory. On Sat, Sep 7, 2013 at 1:37 AM, Alexander Asplund alexaspl...@gmail.comwrote: Hi, I'm trying to process a graph that is about 3 times the size of available memory. On the other hand, there is plenty of disk space. I have enabled the giraph.useOutOfCoreGraph property, but it still crashes with outOfMemoryError: GC limit exceeded when I try running my job. I'm wondering of the spilling is supposed to work during the input step. If so, are there any additional steps that must be taken to ensure it functions? Regards, Alexander Asplund -- Claudio Martella claudio.marte...@gmail.com -- Alexander Asplund -- Claudio Martella claudio.marte...@gmail.com -- Alexander Asplund -- = mailto:db...@data-tactics.com David W. Boyd Director, Engineering 7901 Jones Branch, Suite 700 Mclean, VA 22102 office: +1-571-279-2122 fax: +1-703-506-6703 cell: +1-703-402-7908 == http://www.data-tactics.com.**com/http://www.data-tactics.com.com/ First Robotic Mentor - FRC, FTC - www.iliterobotics.org President - USSTEM Foundation - www.usstem.org The information contained in this message may be privileged and/or confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and deleting the material from any computer. -- Claudio
Re: Counter limit
one the command line, you can use the -D option after the GiraphRunner class before the GiraphRunner specific parameters, e.g. -D giraph. useSuperstepCounters=false On Tue, Sep 10, 2013 at 1:15 PM, Christian Krause m...@ckrause.org wrote: Thanks a lot. One last question: where do I set options like USE_SUPERSTEP_COUNTERS? Christian 2013/9/9 André Kelpe efeshundert...@googlemail.com On older versions of hadoop, you cannot set the counters to a higher value. That was only introduced later. I had this issue on CDH3 (~1.5 years ago) and my solution was to disable all counters for the giraph job, to make it work. If you use a more modern version of hadoop, it should be possible to increase the limit though. - André 2013/9/9 Avery Ching ach...@apache.org: If you are running out of counters, you can turn off the superstep counters /** Use superstep counters? (boolean) */ BooleanConfOption USE_SUPERSTEP_COUNTERS = new BooleanConfOption(giraph.useSuperstepCounters, true, Use superstep counters? (boolean)); On 9/9/13 6:43 AM, Claudio Martella wrote: No, I used a different counters limit on that hadoop version. Setting mapreduce.job.counters.limit to a higher number and restarting JT and TT worked for me. Maybe 64000 might be too high? Try setting it to 512. Does not look like the case, but who knows. On Mon, Sep 9, 2013 at 2:57 PM, Christian Krause m...@ckrause.org wrote: Sorry, it still doesn't work (I ran into a different problem before I reached the limit). I am using Hadoop 0.20.203.0. Is the limit of 120 counters maybe hardcoded? Cheers Christian Am 09.09.2013 08:29 schrieb Christian Krause m...@ckrause.org: I changed the property name to mapred.job.counters.limit and restarted it again. Now it works. Thanks, Christian 2013/9/7 Claudio Martella claudio.marte...@gmail.com did you restart TT and JT? On Sat, Sep 7, 2013 at 7:09 AM, Christian Krause m...@ckrause.org wrote: Hi, I've increased the counter limit in mapred-site.xml, but I still get the error: Exceeded counter limits - Counters=121 Limit=120. Groups=6 Limit=50. This is my config: cat conf/mapred-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration ... property namemapreduce.job.counters.limit/name value64000/value /property property namemapred.task.timeout/name value240/value /property ... /configuration Any ideas? Cheers, Christian -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: Finding missing links in a lineage graph..
Hi Sashant, you'll have to write your own algorithm that acts depending on the labels along the edges. On Tue, Sep 10, 2013 at 9:46 AM, Sushanta Pradhan sushanta.prad...@talentica.com wrote: Hi, I am trying to create a lineage graph from an incomplete data i.e. few relationships are missing. Example: If I have the following subset of lineage graph: Ram ---child--- Luv Ram ---wife--- Sita The full lineage graph would be: Ram ---child--- Luv Ram ---wife--- Sita Sita ---child--- Luv Luv ---father--- Ram Luv ---mother-- Sita Is their and API in Giraph which takes certain rules as input and can find these missing links and create them? Thanks, Sushant -- Claudio Martella claudio.marte...@gmail.com
Re: Counter limit
you can set it in your giraph-site.xml, but it should work on the command line. On Tue, Sep 10, 2013 at 1:44 PM, Christian Krause m...@ckrause.org wrote: I still see the number of counters increasing in the job tracker :(. Can I also set it in my giraph-site.xml or directly in my MasterCompute class? Cheers, Christian 2013/9/10 Claudio Martella claudio.marte...@gmail.com one the command line, you can use the -D option after the GiraphRunner class before the GiraphRunner specific parameters, e.g. -D giraph. useSuperstepCounters=false On Tue, Sep 10, 2013 at 1:15 PM, Christian Krause m...@ckrause.org wrote: Thanks a lot. One last question: where do I set options like USE_SUPERSTEP_COUNTERS? Christian 2013/9/9 André Kelpe efeshundert...@googlemail.com On older versions of hadoop, you cannot set the counters to a higher value. That was only introduced later. I had this issue on CDH3 (~1.5 years ago) and my solution was to disable all counters for the giraph job, to make it work. If you use a more modern version of hadoop, it should be possible to increase the limit though. - André 2013/9/9 Avery Ching ach...@apache.org: If you are running out of counters, you can turn off the superstep counters /** Use superstep counters? (boolean) */ BooleanConfOption USE_SUPERSTEP_COUNTERS = new BooleanConfOption(giraph.useSuperstepCounters, true, Use superstep counters? (boolean)); On 9/9/13 6:43 AM, Claudio Martella wrote: No, I used a different counters limit on that hadoop version. Setting mapreduce.job.counters.limit to a higher number and restarting JT and TT worked for me. Maybe 64000 might be too high? Try setting it to 512. Does not look like the case, but who knows. On Mon, Sep 9, 2013 at 2:57 PM, Christian Krause m...@ckrause.org wrote: Sorry, it still doesn't work (I ran into a different problem before I reached the limit). I am using Hadoop 0.20.203.0. Is the limit of 120 counters maybe hardcoded? Cheers Christian Am 09.09.2013 08:29 schrieb Christian Krause m...@ckrause.org: I changed the property name to mapred.job.counters.limit and restarted it again. Now it works. Thanks, Christian 2013/9/7 Claudio Martella claudio.marte...@gmail.com did you restart TT and JT? On Sat, Sep 7, 2013 at 7:09 AM, Christian Krause m...@ckrause.org wrote: Hi, I've increased the counter limit in mapred-site.xml, but I still get the error: Exceeded counter limits - Counters=121 Limit=120. Groups=6 Limit=50. This is my config: cat conf/mapred-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration ... property namemapreduce.job.counters.limit/name value64000/value /property property namemapred.task.timeout/name value240/value /property ... /configuration Any ideas? Cheers, Christian -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: Out of core execution has no effect on GC crash
did you extend the heap available to the mapper tasks? e.g. through mapred.child.java.opts. On Tue, Sep 10, 2013 at 12:50 AM, Alexander Asplund alexaspl...@gmail.comwrote: Thanks for the reply. I tried setting giraph.maxPartitionsInMemory to 1, but I'm still getting OOM: GC limit exceeded. Are there any particular cases the OOC will not be able to handle, or is it supposed to work in all cases? If the latter, it might be that I have made some configuration error. I do have one concern that might indicateI have done something wrong: to allow OOC to activate without crashing I had to modify the trunk code. This was because Giraph relied on guava-12 and DiskBackedPartitionStore used hasInt() - a method which does not exist in guava-11 which hadoop 2 depends on. At runtime guava 11 was being used I suppose this problem might indicate I'm running submitting the job using the wrong binary. Currently I am including the giraph dependencies with the jar, and running using hadoop jar. On 9/7/13, Claudio Martella claudio.marte...@gmail.com wrote: OOC is used also at input superstep. try to decrease the number of partitions kept in memory. On Sat, Sep 7, 2013 at 1:37 AM, Alexander Asplund alexaspl...@gmail.comwrote: Hi, I'm trying to process a graph that is about 3 times the size of available memory. On the other hand, there is plenty of disk space. I have enabled the giraph.useOutOfCoreGraph property, but it still crashes with outOfMemoryError: GC limit exceeded when I try running my job. I'm wondering of the spilling is supposed to work during the input step. If so, are there any additional steps that must be taken to ensure it functions? Regards, Alexander Asplund -- Claudio Martella claudio.marte...@gmail.com -- Alexander Asplund -- Claudio Martella claudio.marte...@gmail.com
Re: Out of core execution has no effect on GC crash
OOC is used also at input superstep. try to decrease the number of partitions kept in memory. On Sat, Sep 7, 2013 at 1:37 AM, Alexander Asplund alexaspl...@gmail.comwrote: Hi, I'm trying to process a graph that is about 3 times the size of available memory. On the other hand, there is plenty of disk space. I have enabled the giraph.useOutOfCoreGraph property, but it still crashes with outOfMemoryError: GC limit exceeded when I try running my job. I'm wondering of the spilling is supposed to work during the input step. If so, are there any additional steps that must be taken to ensure it functions? Regards, Alexander Asplund -- Claudio Martella claudio.marte...@gmail.com
Re: MySQL Table
Hi Bu, no, currently we do not have a DBInputFormat. We have an open issue with a google summer of code student working on a GoraInputFormat, which supports also reading from RDBMs through Gora. However, if/when it will get it, it will not provide a rich semantic as DBInputFormat, e.g. you'll be able to only provide scan-like/range queries, instead of ANY query like DBInputFormat. I think that creating an DB[Vertex|Edge]InputFormat starting from the hadoop DBInputFormat should not be too hard and could prove to be a very useful contribution. If you think about providing an implementation, I can provide guidance. Best, Claudio On Fri, Sep 6, 2013 at 1:45 AM, Bu Xiao buxia...@gmail.com wrote: Hi Girapher, I am currently working on algorithm that requires reading the vertices from MySQL table and not from HDFS. I thought that there has to be a way of reading data from SQL table since Giraph is built on top of Hadoop. But I do not seem to figure this part out. Do you have a class similar to the DBInputFormat in Hadoop? Thank you very much for your help. -- Claudio Martella claudio.marte...@gmail.com
Re: Dynamic Graphs
Hi Mirko, this is in general the kind of approach I was suggesting, but looked at in a broader-perspective. I'd tend to avoid calling other tools such as Hive or Pig often to compute injections, as Giraph is still a batch-processing and this could really introduce latency and reduce throughput. I feel that if the injection of vertices and edges would really require such a complexity (such a computing them with M/R), then one could just create a pipeline of jobs. But this is only my superficial analysis/speculation, I can see your point on integration and your proposal is very interesting. On Sun, Aug 25, 2013 at 8:55 AM, Mirko Kämpf mirko.kae...@cloudera.comwrote: Good morning Gentlemen, as far as I understand your thread you are talking about the same topic I was thinking and working some time. I work on a research project focused on evolution of networks and networks dynamics in networks of networks. My understanding of Marco's question is, that he needs to change node properties or even wants to add nodes to the graph while it is processed, right? With the WorkerContext we could construct a Connector to the outside world, not just for loading data from HDFS, which requires a preprocessing step for the data which has to be loaded also. I think about HBase often. All my nodes and edges live in HBase. From there it is quite easy to load new data based on a simple Scan or even if the WorkerContext triggers a Hive or Pig script, one can automatically reorganize or extract relevant new links / nodes which have to be added to the graph. Such an approach means, after n super steps of the Giraph layer an additional utility-step (triggered via WorkerContext, or any other better fitting class form Giraph - not sure jet there to start) is executed. Before such a step the state of the graph is persisted to allow fall back or resume. The utility-step can be a processing (MR, Mahout) or just a load (from HDFS, HBase) operation and it allows a kind of clocked data flow directly into a running Giraph application. I think this is a very important feature in Complex Systems research, as we have interacting layers which change in parallel. In this picture the Giraph steps are the steps of layer A, lets say something whats going on on top of a network and the utility-step expresses the changes in the underlying structure affecting the network it self but based on the data / properties of the second subsystem, e.g. the agents operating on top of the network. I created a tool, which worked like this - but not at scale - and it was at a time before Giraph. What do you think, is there a need for such a kind of extension in the Giraph world? Have a nice Sunday. Best wishes Mirko -- -- Mirko Kämpf *Trainer* @ Cloudera tel: +49 *176 20 63 51 99* skype: *kamir1604* mi...@cloudera.com On Wed, Aug 21, 2013 at 3:30 PM, Claudio Martella claudio.marte...@gmail.com wrote: As I said, the injection of the new vertices/edges would have to be done manually, hence without any support of the infrastructure. I'd suggest you implement a WorkerContext class that supports the reading of a specific file with a specific format (under your control) from HDFS, and that is accessed by this particular special vertex (e.g. based on the vertex ID). Does this make sense? On Wed, Aug 21, 2013 at 2:13 PM, Marco Aurelio Barbosa Fagnani Lotz m.a.b.l...@stu12.qmul.ac.uk wrote: Dear Mr. Martella, Once achieved the conditions for updating the vertex data base, what it the best way for the Injector Vertex to call an input reader again? I am able to access all the HDFS data, but I guess the vertex would need to have access to the input splits and also the vertex input format that I designate. Am I correct? Or there is a way that one can just ask Zookeeper to create new splits and distribute to the workers from given a path in DFS? Best Regards, Marco Lotz -- *From:* Claudio Martella claudio.marte...@gmail.com *Sent:* 14 August 2013 15:25 *To:* user@giraph.apache.org *Subject:* Re: Dynamic Graphs Hi Marco, Giraph currently does not support that. One way of doing this would be by having a specific (pseudo-)vertex to act as the injector of the new vertices and edges For example, it would read a file from HDFS and call the mutable API during the computation, superstep after superstep. On Wed, Aug 14, 2013 at 3:02 PM, Marco Aurelio Barbosa Fagnani Lotz m.a.b.l...@stu12.qmul.ac.uk wrote: Hello all, I would like to know if there is any form to use dynamic graphs with Giraph. By dynamic one can read graphs that may change while Giraph is computing/deliberating. The changes are in the input file and are not caused by the graph computation itself. Is there any way to analyse it using Giraph? If not, anyone has any idea/suggestion if it is possible to modify the framework in order to process it? Best Regards, Marco Lotz
Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.
.. WatchedEvent state:SyncConnected type:None path:null [zk: 127.0.0.1:2181(CONNECTED) 0] ls / [hbase, zookeeper] [zk: 127.0.0.1:2181(CONNECTED) 1] However, I am a bit confused. If I look in the zookeeper log-file I see this port 2181 'Address already in use' error, 2013-09-03 10:52:24,412 [myid:] - INFO [main:ZooKeeperServer@735] - minSessionTimeout set to -1 2013-09-03 10:52:24,413 [myid:] - INFO [main:ZooKeeperServer@744] - maxSessionTimeout set to -1 2013-09-03 10:52:24,436 [myid:] - INFO [main:NIOServerCnxnFactory@99] - binding to port 0.0.0.0/0.0.0.0:2181 2013-09-03 10:52:24,447 [myid:] - ERROR [main:ZooKeeperServerMain@68] - Unexpected exception, exiting abnormally java.net.BindException: Address already in use at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52) at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:100) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:115) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:91) The process listening on port 2181 is 2892, which turns out to be HBase. [root@localhost giraph]# fuser 2181/tcp 2181/tcp: 2892 [root@localhost giraph]# ps aux | grep 2892 hbase 2892 0.1 3.2 719592 119624 ? Sl Aug29 7:35 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx500m -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-master-localhost.localdomain.log -Dhbase.home.dir=/usr/lib/hbase/bin/.. .. So I am not sure what my zookeeper client is connecting to. It seems to be connecting to a zookeeper server but when I do 'ps' I cannot see a zookeeper server running. Here is my zoo.cfg file, maxClientCnxns=50 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/lib/zookeeper # the port at which the clients will connect clientPort=2181 server.1=localhost:2888:3888 Thanks for any help, Ken -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.
(QuorumPeerMain.java:121) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79) [root@localhost giraph]# Thank you for any help, Ken -- From: claudio.marte...@gmail.com Date: Tue, 3 Sep 2013 12:43:59 +0200 Subject: Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist. To: user@giraph.apache.org can you try defining the zookeeper manager directory from the command line? like this -D giraph.zkManagerDirectory=/path/in/hdfs/foobar you'll have to delete this directory by hand before each job. Just to see if it solves the problem. Then I could know how to fix it. On Tue, Sep 3, 2013 at 12:32 PM, Ken Williams zoo9...@hotmail.com wrote: Hi Pradeep, Yes, the zookeeper server is definitely running, I can connect to it with the command-line client [root@localhost giraph]# zkCli.sh -server 127.0.0.1:2181 Connecting to 127.0.0.1:2181 2013-09-03 11:15:45,987 [myid:] - INFO [main:Environment@100] - Client environment:zookeeper.version=3.4.3-cdh4.1.1--1, built on 10/16/2012 17:34 GMT 2013-09-03 11:15:45,990 [myid:] - INFO [main:Environment@100] - Client environment:host.name=localhost.localdomain 2013-09-03 11:15:45,990 [myid:] - INFO [main:Environment@100] - Client environment:java.version=1.6.0_31 .. WatchedEvent state:SyncConnected type:None path:null [zk: 127.0.0.1:2181(CONNECTED) 0] ls / [hbase, zookeeper] [zk: 127.0.0.1:2181(CONNECTED) 1] However, I am a bit confused. If I look in the zookeeper log-file I see this port 2181 'Address already in use' error, 2013-09-03 10:52:24,412 [myid:] - INFO [main:ZooKeeperServer@735] - minSessionTimeout set to -1 2013-09-03 10:52:24,413 [myid:] - INFO [main:ZooKeeperServer@744] - maxSessionTimeout set to -1 2013-09-03 10:52:24,436 [myid:] - INFO [main:NIOServerCnxnFactory@99] - binding to port 0.0.0.0/0.0.0.0:2181 2013-09-03 10:52:24,447 [myid:] - ERROR [main:ZooKeeperServerMain@68] - Unexpected exception, exiting abnormally java.net.BindException: Address already in use at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52) at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:100) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:115) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:91) The process listening on port 2181 is 2892, which turns out to be HBase. [root@localhost giraph]# fuser 2181/tcp 2181/tcp: 2892 [root@localhost giraph]# ps aux | grep 2892 hbase 2892 0.1 3.2 719592 119624 ? Sl Aug29 7:35 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx500m -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-master-localhost.localdomain.log -Dhbase.home.dir=/usr/lib/hbase/bin/.. .. So I am not sure what my zookeeper client is connecting to. It seems to be connecting to a zookeeper server but when I do 'ps' I cannot see a zookeeper server running. Here is my zoo.cfg file, maxClientCnxns=50 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/lib/zookeeper # the port at which the clients will connect clientPort=2181 server.1=localhost:2888:3888 Thanks for any help, Ken -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.
job.GiraphConfigurationValidator: Output format vertex index type is not known13/09/02 17:06:36 WARN job.GiraphConfigurationValidator: Output format vertex value type is not known13/09/02 17:06:36 WARN job.GiraphConfigurationValidator: Output format edge value type is not known13/09/02 17:06:36 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4)13/09/02 17:06:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.13/09/02 17:06:40 INFO mapred.JobClient: Running job: job_201308291126_002913/09/02 17:06:41 INFO mapred.JobClient: map 0% reduce 0%13/09/02 17:06:51 INFO mapred.JobClient: Job complete: job_201308291126_002913/09/02 17:06:51 INFO mapred.JobClient: Counters: 613/09/02 17:06:51 INFO mapred.JobClient: Job Counters 13/09/02 17:06:51 INFO mapred.JobClient: Failed map tasks=113/09/02 17:06:51 INFO mapred.JobClient: Launched map tasks=213/09/02 17:06:51 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=1651513/09/02 17:06:51 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=013/09/02 17:06:51 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=013/09/02 17:06:51 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0[root@localhost giraph]# There are no errors but no output is produced, and in the Web UI I can see the 2 map tasks have both failed.When I look in the log files this is the exception I see thrown, java.lang.IllegalStateException: run: Caught an unrecoverable exception java.io.FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist. at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:102) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.Child.main(Child.java:262)Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist. at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:790) at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java Every time I run a new job, it throws this same error. I have a copy of Zookeeper installed here, [root@localhost giraph]# /usr/lib/zookeeper/bin/zkServer.sh statusJMX enabled by defaultUsing config: /usr/lib/zookeeper/bin/../conf/zoo.cfgMode: standalone[root@localhost giraph]# Any help would be greatly appreciated. Thank you, Ken -- Pradeep Kumar -- Claudio Martella claudio.marte...@gmail.com
Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.
) at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:115) at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:91) The process listening on port 2181 is 2892, which turns out to be HBase. [root@localhost giraph]# fuser 2181/tcp 2181/tcp: 2892 [root@localhost giraph]# ps aux | grep 2892 hbase 2892 0.1 3.2 719592 119624 ? Sl Aug29 7:35 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx500m -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase -Dhbase.log.file=hbase-hbase-master-localhost.localdomain.log -Dhbase.home.dir=/usr/lib/hbase/bin/.. .. So I am not sure what my zookeeper client is connecting to. It seems to be connecting to a zookeeper server but when I do 'ps' I cannot see a zookeeper server running. Here is my zoo.cfg file, maxClientCnxns=50 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/lib/zookeeper # the port at which the clients will connect clientPort=2181 server.1=localhost:2888:3888 Thanks for any help, Ken -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com check.diff Description: Binary data
Re: Passing Custom Arguments for giraph.zkList
zk1 is supposed to be a hostname. On Thu, Aug 29, 2013 at 11:05 PM, Ramani, Arun aram...@paypal.com wrote: Hi, I am trying to pass a zookeeper quorum to my giraph job and it throws the following exception: 13/08/29 13:14:38 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one. 13/08/29 13:14:38 INFO utils.ConfigurationUtils: No output format specified. Ensure your OutputFormat does not require one. 13/08/29 13:14:38 INFO utils.ConfigurationUtils: Setting custom argument [giraph.zkList] to zk1 in GiraphConfiguration Exception in thread main java.lang.IllegalArgumentException: Unable to parse custom argument: zk2:port at org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:288) at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:147) at com.paypal.risk.rd.giraph.AccountPropagation.run(AccountPropagation.java:46) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.paypal.risk.rd.giraph.AccountPropagation.main(AccountPropagation.java:98) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) I pass the zklist like this: Hadoop jar GRAPH.jar CLASSNAME -vip CLASS NAME -vif CLASS NAME -wc CLASS NAME -w worker number -ca giraph.zkList=zk1:port,zk2:port,zk3:port,zk4:port,zk5:port Please suggest what is wrong with this invocation. Thanks Arun Ramani -- Claudio Martella claudio.marte...@gmail.com
Re: Passing Custom Arguments for giraph.zkList
the problem is not the format of the string, but the way you're passing it. Try passing it as -D giraph.zkList=... before the giraphrunner options. that should work. On Thu, Aug 29, 2013 at 11:47 PM, Ramani, Arun aram...@paypal.com wrote: Hi Claudio, Yes zk1, zk2, zk3, zk4 and zk5 are all zookeeper hostnames. These 5 hosts make a zookeeper quorum. Please let me know how to pass this. Thanks Arun Ramani From: Claudio Martella claudio.marte...@gmail.com Reply-To: user@giraph.apache.org user@giraph.apache.org Date: Thursday, August 29, 2013 2:18 PM To: user@giraph.apache.org user@giraph.apache.org Subject: Re: Passing Custom Arguments for giraph.zkList zk1 is supposed to be a hostname. On Thu, Aug 29, 2013 at 11:05 PM, Ramani, Arun aram...@paypal.com wrote: Hi, I am trying to pass a zookeeper quorum to my giraph job and it throws the following exception: 13/08/29 13:14:38 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one. 13/08/29 13:14:38 INFO utils.ConfigurationUtils: No output format specified. Ensure your OutputFormat does not require one. 13/08/29 13:14:38 INFO utils.ConfigurationUtils: Setting custom argument [giraph.zkList] to zk1 in GiraphConfiguration Exception in thread main java.lang.IllegalArgumentException: Unable to parse custom argument: zk2:port at org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:288) at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:147) at com.paypal.risk.rd.giraph.AccountPropagation.run(AccountPropagation.java:46) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.paypal.risk.rd.giraph.AccountPropagation.main(AccountPropagation.java:98) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) I pass the zklist like this: Hadoop jar GRAPH.jar CLASSNAME -vip CLASS NAME -vif CLASS NAME -wc CLASS NAME -w worker number -ca giraph.zkList=zk1:port,zk2:port,zk3:port,zk4:port,zk5:port Please suggest what is wrong with this invocation. Thanks Arun Ramani -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: Help needed for Running my own java programs in Giraph
OK, then I'm going to open an issue for that. On Mon, Aug 26, 2013 at 11:23 AM, Vivek Sembium vivek.semb...@gmail.comwrote: Yes for the zookeeper problem I passed a seperate jar through -libjars command. If I use additional jars zookeeper fails. On Aug 26, 2013 2:51 PM, Claudio Martella claudio.marte...@gmail.com wrote: there must be a misunderstanding. i was referring to the zookeeper problem. On Mon, Aug 26, 2013 at 11:14 AM, Vivek Sembium vivek.semb...@gmail.comwrote: No. I added my files(it was just a copy of one of the example program to a different package) to the jar files of giraph. But it was still giving me classNotFoundException. Can you give me some simple example program with instructions on how to deploy it. So I can start playing with giraph and make changes to the program and learn, then start working on my project in giraph. I will be very thankful if you can help me with this. Thanking you -Vivek Sembium On Mon, Aug 26, 2013 at 2:37 PM, Claudio Martella claudio.marte...@gmail.com wrote: but you were still using an additional jar added through -libjars, right? On Mon, Aug 26, 2013 at 8:43 AM, Vivek Sembium vivek.semb...@gmail.com wrote: @Claudio Martella Your solution didnt work either. I basically tried copying the pageRankBenchmark to my own package, renamed the package. It compiles fine with giraph. But I couldnt run it even if I add those files to giraph jar before deployment. Help? On Sun, Aug 25, 2013 at 6:33 PM, Claudio Martella claudio.marte...@gmail.com wrote: you have this problem when you use two jars (one with giraph and one with your classes) instead of a single fat-jar, correct? I tracked the same problem a few weeks ago, basically zookeeper is run passing the wrong jar. On Sat, Aug 24, 2013 at 4:51 PM, Vivek Sembium vivek.semb...@gmail.com wrote: Thank you for your suggestion. It worked. Its not giving class not found exception. But its giving me a new error Its stopping at map 0% and reduce 0%. Upon inspection I found that its unable to connect to zookeeper service. java.lang.IllegalStateException: run: Caught an unrecoverable exception onlineZooKeeperServers: Failed to connect in 10 tries! at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.IllegalStateException: onlineZooKeeperServers: Failed to connect in 10 tries! at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:727) at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:371) at org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:204) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:59) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:89) ... 7 more Immediately I ran page rank benchmark and it executed successfully both from giraph in lib directory and also from giraphs own directory. Can you give me a very simple java program(finding maximum in a graph or simple page rank program) in giraph along with its jar file and input files which I can place in my lib directory of hadoop and test if its working. And also the command to execute it. This should be added in the documentation as new comers can quickly setup giraph and concentrate on their project. On Sat, Aug 24, 2013 at 7:12 PM, Ahmet Emre Aladağ emre.ala...@agmlab.com wrote: It isn't asking for edge input. It says make sure you don't need it. A warning for the case you may have forgotten to give edge input when you really needed. The cause of your error is what I'm wondering nowadays. I'm having a similar problem. Currently I'm using a workaround: put all the jars (giraph-core and my module giraph-nutch) in the lib folder of hadoop. Then it works. But there should be a clean way of doing this. I should be able to say hadoop jar fat.jar ... Any help appreciated. -- *Kimden: *Vivek Sembium vivek.semb...@gmail.com *Kime: *user@giraph.apache.org *Gönderilenler: *24 Ağustos Cumartesi 2013 11:51:49 *Konu: *Re: Help needed for Running my own java programs in Giraph I tried with and without exporting hadoop classpath. I get the same error. Here's the command that I tried hadoop jar /mnt/a1/sda4/hadoop/giraph/giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -libjars /mnt/a99/d0/vivek/workspace/Giraph/bin/SimplePageRankComputation.jar
Re: Help needed for Running my own java programs in Giraph
you mean by running zookeeper independently? On Mon, Aug 26, 2013 at 3:16 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote: We were also experiencing similar problems when specifying -libjars as opposed to just using a fat jar. I believe we fixed it by setting the giraph.zkList property, but this only appears to work when we list one node as a zookeeper. On Mon, Aug 26, 2013 at 8:55 AM, Claudio Martella claudio.marte...@gmail.com wrote: OK, then I'm going to open an issue for that. On Mon, Aug 26, 2013 at 11:23 AM, Vivek Sembium vivek.semb...@gmail.comwrote: Yes for the zookeeper problem I passed a seperate jar through -libjars command. If I use additional jars zookeeper fails. On Aug 26, 2013 2:51 PM, Claudio Martella claudio.marte...@gmail.com wrote: there must be a misunderstanding. i was referring to the zookeeper problem. On Mon, Aug 26, 2013 at 11:14 AM, Vivek Sembium vivek.semb...@gmail.com wrote: No. I added my files(it was just a copy of one of the example program to a different package) to the jar files of giraph. But it was still giving me classNotFoundException. Can you give me some simple example program with instructions on how to deploy it. So I can start playing with giraph and make changes to the program and learn, then start working on my project in giraph. I will be very thankful if you can help me with this. Thanking you -Vivek Sembium On Mon, Aug 26, 2013 at 2:37 PM, Claudio Martella claudio.marte...@gmail.com wrote: but you were still using an additional jar added through -libjars, right? On Mon, Aug 26, 2013 at 8:43 AM, Vivek Sembium vivek.semb...@gmail.com wrote: @Claudio Martella Your solution didnt work either. I basically tried copying the pageRankBenchmark to my own package, renamed the package. It compiles fine with giraph. But I couldnt run it even if I add those files to giraph jar before deployment. Help? On Sun, Aug 25, 2013 at 6:33 PM, Claudio Martella claudio.marte...@gmail.com wrote: you have this problem when you use two jars (one with giraph and one with your classes) instead of a single fat-jar, correct? I tracked the same problem a few weeks ago, basically zookeeper is run passing the wrong jar. On Sat, Aug 24, 2013 at 4:51 PM, Vivek Sembium vivek.semb...@gmail.com wrote: Thank you for your suggestion. It worked. Its not giving class not found exception. But its giving me a new error Its stopping at map 0% and reduce 0%. Upon inspection I found that its unable to connect to zookeeper service. java.lang.IllegalStateException: run: Caught an unrecoverable exception onlineZooKeeperServers: Failed to connect in 10 tries! at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.IllegalStateException: onlineZooKeeperServers: Failed to connect in 10 tries! at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:727) at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:371) at org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:204) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:59) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:89) ... 7 more Immediately I ran page rank benchmark and it executed successfully both from giraph in lib directory and also from giraphs own directory. Can you give me a very simple java program(finding maximum in a graph or simple page rank program) in giraph along with its jar file and input files which I can place in my lib directory of hadoop and test if its working. And also the command to execute it. This should be added in the documentation as new comers can quickly setup giraph and concentrate on their project. On Sat, Aug 24, 2013 at 7:12 PM, Ahmet Emre Aladağ emre.ala...@agmlab.com wrote: It isn't asking for edge input. It says make sure you don't need it. A warning for the case you may have forgotten to give edge input when you really needed. The cause of your error is what I'm wondering nowadays. I'm having a similar problem. Currently I'm using a workaround: put all the jars (giraph-core and my module giraph-nutch) in the lib folder of hadoop. Then it works. But there should be a clean way of doing this. I should be able to say hadoop jar fat.jar ... Any help appreciated. -- *Kimden: *Vivek Sembium vivek.semb...@gmail.com *Kime: *user@giraph.apache.org *Gönderilenler
Re: Help needed for Running my own java programs in Giraph
yeah. i tracked the problem to what i mentioned earlier. ZK is run with the wrong jar when using -libjars. I have to figure out what's the expected behavior though, because the logic is kind obscure in the code. On Mon, Aug 26, 2013 at 11:24 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote: Yeah, exactly. We couldn't get it to work otherwise. On Mon, Aug 26, 2013 at 11:00 AM, Claudio Martella claudio.marte...@gmail.com wrote: you mean by running zookeeper independently? On Mon, Aug 26, 2013 at 3:16 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote: We were also experiencing similar problems when specifying -libjars as opposed to just using a fat jar. I believe we fixed it by setting the giraph.zkList property, but this only appears to work when we list one node as a zookeeper. On Mon, Aug 26, 2013 at 8:55 AM, Claudio Martella claudio.marte...@gmail.com wrote: OK, then I'm going to open an issue for that. On Mon, Aug 26, 2013 at 11:23 AM, Vivek Sembium vivek.semb...@gmail.com wrote: Yes for the zookeeper problem I passed a seperate jar through -libjars command. If I use additional jars zookeeper fails. On Aug 26, 2013 2:51 PM, Claudio Martella claudio.marte...@gmail.com wrote: there must be a misunderstanding. i was referring to the zookeeper problem. On Mon, Aug 26, 2013 at 11:14 AM, Vivek Sembium vivek.semb...@gmail.com wrote: No. I added my files(it was just a copy of one of the example program to a different package) to the jar files of giraph. But it was still giving me classNotFoundException. Can you give me some simple example program with instructions on how to deploy it. So I can start playing with giraph and make changes to the program and learn, then start working on my project in giraph. I will be very thankful if you can help me with this. Thanking you -Vivek Sembium On Mon, Aug 26, 2013 at 2:37 PM, Claudio Martella claudio.marte...@gmail.com wrote: but you were still using an additional jar added through -libjars, right? On Mon, Aug 26, 2013 at 8:43 AM, Vivek Sembium vivek.semb...@gmail.com wrote: @Claudio Martella Your solution didnt work either. I basically tried copying the pageRankBenchmark to my own package, renamed the package. It compiles fine with giraph. But I couldnt run it even if I add those files to giraph jar before deployment. Help? On Sun, Aug 25, 2013 at 6:33 PM, Claudio Martella claudio.marte...@gmail.com wrote: you have this problem when you use two jars (one with giraph and one with your classes) instead of a single fat-jar, correct? I tracked the same problem a few weeks ago, basically zookeeper is run passing the wrong jar. On Sat, Aug 24, 2013 at 4:51 PM, Vivek Sembium vivek.semb...@gmail.com wrote: Thank you for your suggestion. It worked. Its not giving class not found exception. But its giving me a new error Its stopping at map 0% and reduce 0%. Upon inspection I found that its unable to connect to zookeeper service. java.lang.IllegalStateException: run: Caught an unrecoverable exception onlineZooKeeperServers: Failed to connect in 10 tries! at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.lang.IllegalStateException: onlineZooKeeperServers: Failed to connect in 10 tries! at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:727) at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:371) at org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:204) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:59) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:89) ... 7 more Immediately I ran page rank benchmark and it executed successfully both from giraph in lib directory and also from giraphs own directory. Can you give me a very simple java program(finding maximum in a graph or simple page rank program) in giraph along with its jar file and input files which I can place in my lib directory of hadoop and test if its working. And also the command to execute it. This should be added in the documentation as new comers can quickly setup giraph and concentrate on their project. On Sat, Aug 24, 2013 at 7:12 PM, Ahmet Emre Aladağ emre.ala...@agmlab.com wrote: It isn't asking for edge input. It says make sure you don't need it. A warning for the case you may have forgotten to give edge input when you really needed. The cause of your error is what I'm
Re: How to utilize combiners
Hi Kyle, combiners are set by the user, as you recognized, and called automatically by the infrastructure at different moments in the path. Combined messages are passed transparently to the compute method (namely less messages than a vertex would have received without a combiner). Have a look at the PageRank examples and benchmark code. Best, Claudio On Tue, Aug 20, 2013 at 8:51 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote: Hey all, I was wondering if there was any example code I could look at that uses a combiner. Creating your own Combiner is easy enough, e.g. DoubleSumCombiner, but I am confused as to how/where I would use the classes in my code. For example, say I wanted to utilize the DoubleSumCombiner class to sum up all of the messages arriving at a particular vertex at the beginning of the superstep, and I wanted to do this for each vertex in the graph. Where should I instantiate a DoubleSumCombiner, when should I call the combine() and createInitialMessage() methods, etc. in the compute() method? What further confuses me is that I see that the MasterCompute class has methods for setCombiner() and getCombiner(), and that there is also a command line option -c to specify a Combiner. I'm not really sure if these are even necessary, but if they are, I don't know how these come into play either. Some clarification or direction towards an example would be nice! Thanks, -- Kyle Orlando Computer Engineering Major University of Maryland -- Claudio Martella claudio.marte...@gmail.com
Re: Dynamic Graphs
As I said, the injection of the new vertices/edges would have to be done manually, hence without any support of the infrastructure. I'd suggest you implement a WorkerContext class that supports the reading of a specific file with a specific format (under your control) from HDFS, and that is accessed by this particular special vertex (e.g. based on the vertex ID). Does this make sense? On Wed, Aug 21, 2013 at 2:13 PM, Marco Aurelio Barbosa Fagnani Lotz m.a.b.l...@stu12.qmul.ac.uk wrote: Dear Mr. Martella, Once achieved the conditions for updating the vertex data base, what it the best way for the Injector Vertex to call an input reader again? I am able to access all the HDFS data, but I guess the vertex would need to have access to the input splits and also the vertex input format that I designate. Am I correct? Or there is a way that one can just ask Zookeeper to create new splits and distribute to the workers from given a path in DFS? Best Regards, Marco Lotz -- *From:* Claudio Martella claudio.marte...@gmail.com *Sent:* 14 August 2013 15:25 *To:* user@giraph.apache.org *Subject:* Re: Dynamic Graphs Hi Marco, Giraph currently does not support that. One way of doing this would be by having a specific (pseudo-)vertex to act as the injector of the new vertices and edges For example, it would read a file from HDFS and call the mutable API during the computation, superstep after superstep. On Wed, Aug 14, 2013 at 3:02 PM, Marco Aurelio Barbosa Fagnani Lotz m.a.b.l...@stu12.qmul.ac.uk wrote: Hello all, I would like to know if there is any form to use dynamic graphs with Giraph. By dynamic one can read graphs that may change while Giraph is computing/deliberating. The changes are in the input file and are not caused by the graph computation itself. Is there any way to analyse it using Giraph? If not, anyone has any idea/suggestion if it is possible to modify the framework in order to process it? Best Regards, Marco Lotz -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: Giraph vs good-old PVM/MPI ?
In principle you could implement (and it has been) Pregel through MPI. The idea behind Pregel was precisely to factor out typical patterns of graph processing that used to be based on message-passing and barriers. A framework like Pregel/Giraph hides this complexity through a well-defined API and programming pattern, leaving the user with only the application logics. How the rest is implemented under the hood, is another story that the user does not have to worry about. On Tue, Aug 6, 2013 at 7:19 PM, Yang tedd...@gmail.com wrote: it seems that the paradigm offered by Giraph/Pregel is very similar to the programming paradim of PVM , and to a lesser degree, MPI. using PVM, we often engages in such iterative cycles where all the nodes sync on a barrier and then enters the next cycle. so what is the extra features offered by Giraph/Pregel? I can see persistence/restarting of tasks, and maybe abstraction of the user-code-specific part into the API so that users are not concerned with the actual message passing (message passing is done by the framework). Thanks Yang -- Claudio Martella claudio.marte...@gmail.com
Re: Question regarding bin/giraph and bin/giraph-env
I think the giraph script is currently broken. I remember it now working last time i checked for a similar problem. On Thu, Aug 1, 2013 at 10:16 PM, Eli Reisman apache.mail...@gmail.comwrote: I'm not sure anyone has been running Giraph via the giraph scripts with Hbase input, maybe its messed up. I think those messages are from a time when you could unpack the tar.gz build product in target/ somewhere else and run from that instead of passing the fat jar to hadoop jar command yourself. On Mon, Jul 29, 2013 at 2:08 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote: Hello, I am trying to use the giraph script in $GIRAPH_HOME/bin to run my giraph code. However, I cannot seem to get it to work: I keep getting: No lib directory, assuming dev environment No target directory. Build Giraph jar before proceeding. After looking at the code, I notice that is runs giraph-env. Within giraph-env, I see the following: if [ -d $GIRAPH_HOME/lib ]; then for f in $GIRAPH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f done for f in $GIRAPH_HOME/giraph*.jar ; do if [ -e $f ]; then JAR=$f CLASSPATH=${CLASSPATH}:$f break fi done else echo No lib directory, assuming dev environment if [ ! -d $GIRAPH_HOME/target ]; then echo No target directory. Build Giraph jar before proceeding. exit 1 fi CLASSPATH2=`mvn dependency:build-classpath | grep -v [INFO]` CLASSPATH=$CLASSPATH:$CLASSPATH2 for f in $GIRAPH_HOME/giraph/target/giraph*.jar; do if [ -e $f ]; then JAR=$f break fi done fi This worries me. To obtain my version of giraph, I simply cloned the git repository and used mvn -Phadoop_1.0 clean install -DskipTests in /usr/local/giraph to build everything. It appears that this script sets my GIRAPH_HOME as /usr/local/giraph, but I do not have a /usr/local/giraph/target directory. Instead, I have $GIRAPH_HOME/giraph-core/target, $GIRAPH_HOME/giraph-hbase/target, etc. Are these scripts out of date, or have I built my project incorrectly? Thanks -- Kyle Orlando Computer Engineering Major University of Maryland -- Claudio Martella claudio.marte...@gmail.com
Re: How to retrieve and display the values aggregated by the aggregators?
Hi Kyle, good catch. ALWAYS should be set to 1. Want to write a patch to fix this? Try to set the property on the command line by putting -D giraph.textAggregatorWriter.frequency=-1 right after the GiraphRunner class in your command line. Hope this helps. Best, Claudio On Wed, Jul 24, 2013 at 10:31 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote: Hi Claudio, So I checked out TextAggregatorWriter and was, initially, still a bit confused on how to use it to write to a text file. That's when I noticed that, in org.apache.giraph.utils.ConfigurationUtils, there is an option aw, which corresponds to an AggregatorWriterClass. I tried this out when running the SimplePageRankComputation program using my data as input by specifying this as an option: -aw org.apache.giraph.aggregators.TextAggregatorWriter. Here's the full command: hadoop jar /home/hduser/Documents/combined.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimplePageRankComputation -eif StackExchangeParsee.StackExchangeLongFloatTextEdgeInput -vif StackExchangeParsee.StackExchangeLongDoubleTextVertexValueInput -eip /in/gaming_edges.txt -vip /in/gaming_vertices.txt -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -aw org.apache.giraph.aggregators.TextAggregatorWriter -op /outPR -w 2 -mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute According to TextAggregatorWriter, it, by default, writes to a file called aggregatorValues. I checked my HDFS, and did not see that particular file. That's when I noticed that there is a configuration giraph.textAggregatorWriter.frequency, and that by default the frequency is set to NEVER, which means that nothing is ever created/written to a file for the aggregators. The other two frequencies are AT_THE_END and ALWAYS, which strangely both correspond to the same integer: -1. Could someone explain why this is so? Ignoring the above uncertainty, I surmised that the property giraph.textAggregatorWriter.frequency was to be added to my giraph-site.xml. I wanted the AT_THE_END frequency, which corresponds to the value of -1. Here's the contents of my giraph-site.xml file: configuration property namegiraph.textAggregatorWriter.frequency/name value-1/value /property /configuration I ran the SimplePageRankComputation program again (using the verbose hadoop jar command above), and still, I couldn't find aggregatorValues on my HDFS. Could someone help me out, or at the very least rectify any misconceptions and uncertainties that I have? On Wed, Jul 24, 2013 at 12:25 PM, Claudio Martella claudio.marte...@gmail.com wrote: Hi Kyle, you can check out the AggregatorWriter interface which allows you to do that. As a matter of fact there is already a class that implements what you need (org.apache.giraph.aggregators.TextAggregatorWriter). Hope it helps. On Wed, Jul 24, 2013 at 5:19 PM, Kyle Orlando kyle.r.orla...@gmail.com wrote: Hello, I am new to Giraph and was just wondering how one could retrieve and display the certain global values/statistics that the aggregators keep track of. What classes and methods would I use, and would this be done in a class that extends VertexOutputFormat, or would it be done elsewhere? As an example, in the provided SimplePageRankComputation in org.apache.giraph.examples, there are three aggregators: sum, min, and max. I would like to display all of their final values (after the final superstep) in some way, such as writing them to a text file. -- Kyle Orlando Computer Engineering Major University of Maryland -- Claudio Martella claudio.marte...@gmail.com -- Kyle Orlando Computer Engineering Major University of Maryland -- Claudio Martella claudio.marte...@gmail.com
zookeeper not starting
Am I the only one that recently is experiencing problems with zookeeper? I get the workers failing to connect to zookeeper. I presume it is not starting at all. I'm using trunk and hadoop 1.0.3. Used to work smoothly. -- Claudio Martella claudio.marte...@gmail.com
Re: Global factory for vertex IDs?
you can make use of a WorkerContext. There is one per worker, and you can put your factory there. The factory can make use of the Mapper.Context class from getContext(), and make use of the methods inherited from the TaskAttemptContext class (e.g. the unique task id) to get some form of worker id. hope this helps. On Thu, Jul 4, 2013 at 8:18 AM, Christian Krause m...@ckrause.org wrote: Yes, that would be perfectly fine. How can I do this? Specifically, how do I get the ID of the worker? And can I then just use a counter field in my computation which I increase whenever I need a new ID? (So my global ID would be a pair of the worker ID and the number I derived from incrementing the counter). Cheers, Christian 2013/7/3 Avery Ching ach...@apache.org What are the requirements of your global ids? If they simply need to be unique, you can split the id space across workers and assign them incrementally. On 6/30/13 1:09 AM, Christian Krause wrote: Hi, I was wondering if there is a way to register a global factory for new vertex IDs. Currently, I have to come up with new IDs in my compute method which does work, but with the penality that the required memory for vertex IDs is unnecessarily high. If there was a global vertex ID factory I could just keep a global counter and increase it by one when I need a new ID. Is something like that possible, or does it conflict with the BSP computation model? The thing is, in the end vertex ID collisions are detected by Giraph, so why not allow also a global vertex ID factory... Cheers, Christian -- Claudio Martella claudio.marte...@gmail.com
Re: Are new vertices active?
Hi, inline are my (tentative) answers. On Wed, Jun 26, 2013 at 6:34 PM, Christian Krause m...@ckrause.org wrote: Hi, if I create new vertices, will they be executed in the next superstep? And does it make a difference whether I create them using addVertexRequest() or sendMessage()? The vertex will be active. The case of a sendMessage is intuitive, because a message wakens up a vertex. Another question: if I mutate the graph in superstep X and X is the last superstep, will the changes be executed? It is not clear to me whether the graph changes are executed during or before the next superstep. I'm actually not sure about our internal implementation, somebody can shade light on this, but I'd expect it to be running due to above (presence of active vertices). And related to the last question, if I mutate the graph in superstep X, and I call getTotalNumVertices() in the next step, can I expect the updated number of vertices, or the number of vertices before the mutation? The mutatiations are applied at the end of a superstep and are visibile in the following one. Hence in s+1 you'd see the new number of vertices. Sorry for these very basic questions, but I did not find any documentation on these details. If this is documented somewhere, it would be helpful to get a link. Cheers, Christian -- Claudio Martella claudio.marte...@gmail.com
Re: SimpleShortestPathsComputation with Edge List input file
with the only problem that you picked an abstract class again... I advised you to use an inputformat that has the name of the types in the class name, hence org.apache.giraph.io.formats.IntNullTextEdgeInputFormat should work for you. On Mon, Jun 3, 2013 at 9:34 PM, Peter Holland d1...@mydit.ie wrote: Thank you for the advice Claudio I updated the run command to use different io classes *bin/hadoop jar /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar * * org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation * *-eif org.apache.giraph.io.EdgeInputFormat* * -eip /simpleEdgeList/SimpleEdgeList.tsv* *-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat * *-op /outShortestEdgeList01 * *-w 1* * * This code does start a MapReduce job but progress stays at 0%. The log file for the job has the following IOException error; *MapAttempt TASK_TYPE=MAP TASKID=task_201306031954_0002_m_00 TASK_ATTEMPT_ID=attempt_201306031954_0002_m_00_0 TASK_STATUS=FAILED FINISH_TIME=1370282492527 HOSTNAME=ubuntu-VirtualBox ERROR=java\.lang\.Throwable: Child Error* * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:271)* *Caused by: java\.io\.IOException: Task process exit with nonzero status of 1\.* * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:258)* So, this leaves 3 questions; Is the edge list file format correct? (a tab separated file with a .tsv extension) Is the input class correct? Is the output class correct? Thank you, Peter On 3 June 2013 01:05, Claudio Martella claudio.marte...@gmail.com wrote: Hi Peter, shortly, those are abstract classes, that's why you cannot instantiate them. You'll have to use a specific class extending those classes that are aware of the types of the signature of the vertex (I, V, E, M). check out some classes in the format package that have those types in the class name. On Mon, Jun 3, 2013 at 1:25 AM, Peter Holland d1...@mydit.ie wrote: Hello, I'm new to Giraph and I'm trying to run SimpleShortestPathsComputation using an edge list input file. I have some questions and and error message that hopefully I can get some help with. Edge List File Format What is the correct format for an edge list input file? I have a .tsv file with a vertex represented as an integer. Is this correct? 5 11 1 6 6 9 6 8 8 9 . Input File Class: Is org.apache.giraph.io.formats.*TextEdgeInputFormat *the only input format that can be used for edge lists? Output File Class: Does the output format depend on the job you are running? I have been using org.apache.giraph.io.formats.*TextVertexOutputFormat* for SimpleShortestPathsComputation. Run Command: So this is the command I am using to try to run the SimpleShortestPathsComputation using an edge list input file. *bin/hadoop jar /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation * *-eif org.apache.giraph.io.formats.TextEdgeInputFormat * *-eip /simpleEdgeList/SimpleEdgeList.tsv * *-of org.apache.giraph.io.formats.TextVertexOutputFormat * *-op /outShortest* *-w 3* Error Message When I run the above command I get the following error message: Exception in thread main java.lang.IllegalStateException: newInstance: Couldn't instantiate org.apache.giraph.io.formats.TextVertexOutputFormat Thank you, Peter -- Claudio Martella claudio.marte...@gmail.com -- Claudio Martella claudio.marte...@gmail.com
Re: SimpleShortestPathsComputation with Edge List input file
The reason is that the particular computation (SimpleShortestPathsComputation) is expecting vertices with Long ids, while the EdgeInputFormat is parsing Integers. You have to fix one of the two accordingly. On Mon, Jun 3, 2013 at 11:22 PM, Peter Holland d1...@mydit.ie wrote: Thank you for your response Claudio. I updated the command with the input class you suggested. *bin/hadoop jar /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar * * org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation * *-eif org.apache.giraph.io.formats.IntNullTetxEdgeInputFormat* *-eip /simpleEdgeList/SimpleEdgeList.tsv* *-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat * *-op /outShortestEdgeList01 * *-w 1* Unfortunately I am getting an error message *13/06/03 23:00:08 INFO utils.ConfigurationUtils: No vertex input format specified. Ensure your InputFormat does not require one.* *Exception in thread main java.lang.IllegalArgumentException: checkClassTypes: Vertex index types don't match, vertex - class org.apache.hadoop.io.LongWritable, edge input format - class org.apache.hadoop.io.IntWritable* * at org.apache.giraph.job.GiraphConfigurationValidator.verifyEdgeInputFormatGenericTypes(GiraphConfigurationValidator.java:266) * * at org.apache.giraph.job.GiraphConfigurationValidator.validateConfiguration(GiraphConfigurationValidator.java:125) * * at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:155) * * at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)* * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)* * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)* * at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)* * at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)* * at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) * * at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) * * at java.lang.reflect.Method.invoke(Method.java:597)* * at org.apache.hadoop.util.RunJar.main(RunJar.java:156)* On 3 June 2013 21:00, Claudio Martella claudio.marte...@gmail.com wrote: with the only problem that you picked an abstract class again... I advised you to use an inputformat that has the name of the types in the class name, hence org.apache.giraph.io.formats.IntNullTextEdgeInputFormat should work for you. On Mon, Jun 3, 2013 at 9:34 PM, Peter Holland d1...@mydit.ie wrote: Thank you for the advice Claudio I updated the run command to use different io classes *bin/hadoop jar /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar * * org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsComputation * *-eif org.apache.giraph.io.EdgeInputFormat* * -eip /simpleEdgeList/SimpleEdgeList.tsv* *-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat * *-op /outShortestEdgeList01 * *-w 1* * * This code does start a MapReduce job but progress stays at 0%. The log file for the job has the following IOException error; *MapAttempt TASK_TYPE=MAP TASKID=task_201306031954_0002_m_00 TASK_ATTEMPT_ID=attempt_201306031954_0002_m_00_0 TASK_STATUS=FAILED FINISH_TIME=1370282492527 HOSTNAME=ubuntu-VirtualBox ERROR=java\.lang\.Throwable: Child Error* * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:271)* *Caused by: java\.io\.IOException: Task process exit with nonzero status of 1\.* * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:258)* So, this leaves 3 questions; Is the edge list file format correct? (a tab separated file with a .tsv extension) Is the input class correct? Is the output class correct? Thank you, Peter On 3 June 2013 01:05, Claudio Martella claudio.marte...@gmail.comwrote: Hi Peter, shortly, those are abstract classes, that's why you cannot instantiate them. You'll have to use a specific class extending those classes that are aware of the types of the signature of the vertex (I, V, E, M). check out some classes in the format package that have those types in the class name. On Mon, Jun 3, 2013 at 1:25 AM, Peter Holland d1...@mydit.iewrote: Hello, I'm new to Giraph and I'm trying to run SimpleShortestPathsComputation using an edge list input file. I have some questions and and error message that hopefully I can get some help with. Edge List File Format What is the correct format for an edge list input file? I have a .tsv file with a vertex represented as an integer. Is this correct? 5 11 1 6 6 9 6 8 8 9 . Input File Class: Is org.apache.giraph.io.formats.*TextEdgeInputFormat *the only input format that can be used for edge lists? Output File Class: Does the output format depend on the job you are running? I have been using
Re: External Documentation about Giraph
Hi Yazan, I suggest you insert the tutorial with the user docs in the site/ directory (hence in the Users Docs menu). It is certainly where new users would look for it, and requires less navigation than the community wiki. Thanks! On Sun, Jun 2, 2013 at 7:36 AM, Yazan Boshmaf bosh...@ece.ubc.ca wrote: JIRA issue: https://issues.apache.org/jira/browse/GIRAPH-676 On Sat, Jun 1, 2013 at 10:12 PM, Yazan Boshmaf bosh...@ece.ubc.ca wrote: @Puneer, sure! I'll ping you once we have a draft ready. Thanks! Cheers, Yazan On Sat, Jun 1, 2013 at 9:58 PM, Puneet Agarwal puagar...@yahoo.com wrote: Dear Yazan, I don't know if you need this, still - I volunteer to review such a documentation, from novice users' perspective. I am a newbie on Giraph :) Cheers - Puneet - Original Message - From: Yazan Boshmaf bosh...@ece.ubc.ca To: Maria Stylianou mars...@gmail.com Cc: user@giraph.apache.org Sent: Sunday, June 2, 2013 9:33 AM Subject: Re: External Documentation about Giraph @Maria, this sounds great! I will start drafting one based on your posts + my own experience + know-how from user/dev emails that I have gathered. I will open a JIRA ticket and keep you in the loop. Once you're available, you can give the docs another pass to improve quality. I'm certain that experienced Giraph committers will also add their own input but let's at least get a first version ready. So take your time and good luck on your thesis presentation :) @Avery, should I update Giraph mvn site and generate a patch (as in http://giraph.apache.org/build_site.html) or just update the community's Confluence wiki? On Sat, Jun 1, 2013 at 12:09 PM, Maria Stylianou mars...@gmail.com wrote: Yazan let's do it! But I'm afraid I will be super busy till 1st of July - day of thesis presentation. After that, I can dedicate more time. On Sat, Jun 1, 2013 at 5:29 AM, Avery Ching ach...@apache.org wrote: Improving our documentation is always very nice. Thanks for doing this you two! On 5/31/13 7:32 PM, Yazan Boshmaf wrote: Maria, I can help you with this if you are interested and have the time. If you are busy, please let me know and I will update the site docs with a variant of your tutorial. Thanks! On Thu, May 30, 2013 at 4:13 PM, Roman Shaposhnik r...@apache.org wrote: On Wed, May 29, 2013 at 2:25 PM, Maria Stylianou mars...@gmail.com wrote: Hello guys, This semester I'm doing my master thesis using Giraph in a daily basis. In my blog (marsty5.wordpress.com) I wrote some posts about Giraph, some of the new users may find them useful! And maybe some of the experienced ones can give me feedback and correct any mistakes :D So far, I described: 1. How to set up Giraph 2. What to do next - after setting up Giraph 3. How to run ShortestPaths 4. How to run PageRank Good stuff! As a shameless plug, one more way to install Giraph is via Apache Bigtop. All it takes is hooking one of these files: http://bigtop01.cloudera.org:8080/view/Bigtop-trunk/job/Bigtop-trunk-Repository/label=fedora18/lastSuccessfulBuild/artifact/repo/bigtop.repo http://bigtop01.cloudera.org:8080/view/Bigtop-trunk/job/Bigtop-trunk-Repository/label=opensuse12/lastSuccessfulBuild/artifact/repo/bigtop.repo to your yum/apt system and typing: $ sudo yum install hadoop-conf-pseudo giraph In fact we're about to release Bigtop 0.6.0 with Hadoop 2.0.4.1 and Giraph 1.0 -- so anybody's interested in helping us to test this stuff -- that would be really appreciated. Thanks, Roman. P.S. There's quite a few other platforms available as well: http://bigtop01.cloudera.org:8080/view/Bigtop-trunk/job/Bigtop-trunk-Repository/ -- Maria Stylianou Intern at Telefonica, Barcelona, Spain Master Student of European Master in Distributed Computing marsty5.wordpress.com -- Claudio Martella claudio.marte...@gmail.com
Re: External Documentation about Giraph
This is a good idea. One of the things we actually miss in the new documentation is a tutorial-like entry. Maria, it could be a nice contribution. On Thu, May 30, 2013 at 11:59 PM, Yazan Boshmaf bosh...@ece.ubc.ca wrote: Maria, the posts are very helpful. Thank you. Maybe you can update Giraph's site documentation with your tutorial? On Wed, May 29, 2013 at 2:25 PM, Maria Stylianou mars...@gmail.com wrote: Hello guys, This semester I'm doing my master thesis using Giraph in a daily basis. In my blog (marsty5.wordpress.com) I wrote some posts about Giraph, some of the new users may find them useful! And maybe some of the experienced ones can give me feedback and correct any mistakes :D So far, I described: 1. How to set up Giraph 2. What to do next - after setting up Giraph 3. How to run ShortestPaths 4. How to run PageRank Thank you! Enjoy reading! (hopefully ;p) -- Maria Stylianou Intern at Telefonica, Barcelona, Spain Master Student of European Master in Distributed Computing -- Claudio Martella claudio.marte...@gmail.com
Re: Modifying a benchmark to use real input
You can still use the classes in the examples package, which are similar to those in the benchmark package but are more flexible for your own tests. On Fri, May 24, 2013 at 3:42 PM, Matt Molek mpmo...@gmail.com wrote: Oh, never mind, I think I found it by looking trough GiraphRunner.java GiraphFileInputFormat.addVertexInputPath(conf, new Path(/some/path)); On Thu, May 23, 2013 at 5:22 PM, Matt Molek mpmo...@gmail.com wrote: Hi, I'm just getting started with Giraph, and struggling a bit to understand what exactly is needed to run a minimal Giraph computation on real data, rather than the PseudoRandomVertexInputFormat. Apologies if this is covered somewhere in the docs or mailing list archives. I looked but couldn't find anything applying to the current version, and I couldn't figure out exactly how things have changed through the versions. Some older code that I tried was clearly incompatible with the current version. Trying to learn by example, I copied the current o.a.g.benchmark.ShortestPathsBenchmark and o.a.g.benchmark.ShortestPathsComputation into my own project, and modified them to run on their own without GiraphBenchmark, and BenchmarkOption. Here is the new ShortestPathsBenchmark I ended up with: http://pastebin.com/h3rH6jTm When using the PseudoRandomVertexInputFormat, and some hard coded options for aggregateVertices and edgesPerVertex, this runs fine from my jar with the command: hadoop jar giraph-testing-jar-with-dependencies.jar modified_benchmarks.ShortestPathsBenchmark --workers 10 Now I'd like to use JsonLongDoubleFloatDoubleVertexInputFormat with some real data, but I see no way to specify the input path. If this was plain hadoop, I'd expect to be able to say something like JsonLongDoubleFloatDoubleVertexInputFormat.addInputPath(job, new Path(/some/path)); That's not available though. Could someone point me in the right direction with this? Am I going about this all wrong? Thanks for any help, Matt -- Claudio Martella claudio.marte...@gmail.com
Re: Error compiling Giraph
) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) Caused by: org.apache.maven.project.DependencyResolutionException: Could not resolve dependencies for project org.apache.giraph:giraph-examples:jar:1.0.0: Failure to find org.apache.giraph:giraph-core:jar:tests:1.0.0 in http://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced at org.apache.maven.project.DefaultProjectDependenciesResolver.resolve(DefaultProjectDependenciesResolver.java:189) at org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependencyResolver.java:185) ... 22 more Caused by: org.sonatype.aether.resolution.DependencyResolutionException: Failure to find org.apache.giraph:giraph-core:jar:tests:1.0.0 in http://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced at org.sonatype.aether.impl.internal.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:375) at org.apache.maven.project.DefaultProjectDependenciesResolver.resolve(DefaultProjectDependenciesResolver.java:183) ... 23 more Caused by: org.sonatype.aether.resolution.ArtifactResolutionException: Failure to find org.apache.giraph:giraph-core:jar:tests:1.0.0 in http://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced at org.sonatype.aether.impl.internal.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:538) at org.sonatype.aether.impl.internal.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:216) at org.sonatype.aether.impl.internal.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:358) ... 24 more Caused by: org.sonatype.aether.transfer.ArtifactNotFoundException: Failure to find org.apache.giraph:giraph-core:jar:tests:1.0.0 in http://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced at org.sonatype.aether.impl.internal.DefaultUpdateCheckManager.newException(DefaultUpdateCheckManager.java:230) at org.sonatype.aether.impl.internal.DefaultUpdateCheckManager.checkArtifact(DefaultUpdateCheckManager.java:204) at org.sonatype.aether.impl.internal.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:427) ... 26 more [ERROR] [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :giraph-examples Why is there a dependecy problem? I have not added any code - it's the clean Giraph distribution. Is this Maven's problem? Or the fact that I did not get Giraph directly from the repo, but from the http mirror directly? Thanks in advance, Alexandros -- Claudio Martella claudio.marte...@gmail.com
Re: Extra data on vertex
Keep in mind that you cannot access a neighbors value directly from a vertex. What you are proposing now is possible because you are using the vertex id to store your information (URL), which makes sense in the context of a web page. As soon as you will store data in the vertex value, as Avery suggest, you will have to rely on messages to inform the neighbors of the value. On Tue, May 7, 2013 at 4:47 PM, Ahmet Emre Aladağ emre.ala...@agmlab.comwrote: Hi, 1) What's the best way for storing extra data (such as URL) on a vertex? I thought this would be through a class variable but I could not find the way to access that variable from the neighbor. For example I'd like to remove the duplicate edges going towards the nodes with the same url (Duplicate Removal phase of LinkRank). How can I learn my neighbor's url variable: targetUrl? 2) Is removing edges like this a valid approach? public class LinkRankVertex extends VertexIntWritable, FloatWritable, NullWritable, FloatWritable { public String url; public void removeDuplicateLinks() { int targetId; String targetUrl; SetString urls = new HashSetString(); ArrayListEdgesIntWritable, NullWritable edges = new ArrayListEdgesIntWritable, NullWritable(); for (EdgeIntWritable, NullWritable edge : getEdges()) { targetId = edge.getTargetVertexId().get()**; targetUrl = ...?? if (!urls.contains(targetUrl)) { urls.add(targetUrl); edges.add(edge); } } setEdges(edges); } } Thanks, Emre. -- Claudio Martella claudio.marte...@gmail.com
Google Summer of Code 2013 Giraph + Tinkerpop project
Hello lists, we have added an issue to the Giraph JIRA that we would like to have as a GSoC 2013 project. The idea is to integrate Tinkerpop Bluerprints/Rexter as an input format to Giraph, to run batch computations on data stored in Blueprints-compliant graph databases. Please consider advertising this issue to potential students or people interested in this project. The related issue can be found here: https://issues.apache.org/jira/browse/GIRAPH-549 An entry in the Giraph Wiki will be added soon. Best, Claudio -- Claudio Martella claudio.marte...@gmail.com