from:"Claudio Martella"

Re: Local-only aggregators

2015-03-25 Thread Claudio Martella

Hi,

I'm not sure aggregators require necessarily high traffic. Aggregators are
aggregated locally on the worker before they are aggregated on the
(corresponding) master worker.
Anyway, assuming you want to proceed, my understanding is that you want
vertices on the same worker to share (aggregated) information. In that
case, I'd suggest just using a WorkerContext.

Hope this helps.
Claudio

On Wed, Mar 25, 2015 at 12:47 AM Alessio Arleo ingar...@icloud.com wrote:

 Hello everybody

 I was wondering if it was possible to extend the concept of aggregator
 from a “global” to a “local-only” perspective.

 Normally, aggregators DO cause network traffic because of the cycle:
 Workers - Aggregator Owner- MasterAggregator - AggregatorOwner - Workers

 What if I’d like to fetch and aggregate values as I would normally do with
 aggregators but without causing this traffic? Let’s assume this situation:

 1 - Define a custom partitioning class and let it partition the graph.
 This is the partition used to assign vertices to workers.
 2 - in the computation class, every time che compute method is called on a
 vertex, the data needed for computation is stored inside the vertex
 neighbours but also in non-neighbouring vertices (think about Force
 Directed layout algorithm for example; to compute the forces, is necessary
 the distance between neighbouring and not-neighbouring vertices, applying
 different kind of forces).
 — Given that the compute class is computing on vertex X
 a - I pick information from X neighbours as I would normally do (iterating
 its edges or the incoming messages)
 b - When it comes to non-neighbouring vertices I would like to use data
 from X worker only.

 The first thing I tried to understand before asking this question was:
 does this make any sense? I am probably wrong, but this actually does. If I
 partition my graph to maximize locality, what I am actually trying to do is
 to reduce the network traffic as much as possibile.

 My doubt is that if I use aggregators to achieve the result the network
 traffic would be heavy, probably losing the advantages of the initial
 partitioning. What if I could access and modify an aggregator-like local
 data structure in the same fashion (i.e. “getAggregatedValue”) but without
 broadcasting it (assuming that I do not need the aggregator to be
 accessible to every worker)? Or could it be possibile to manually assign
 partition owners in order to minimise network traffic (if I need to
 aggregate all values from vertices in partition 3 and 3 only, I assign the
 partition 3 aggregator owner to partition 3 worker)?

 I hope in your comprehension and I hope I somehow caught your attention,
 even if for a brief moment. Ask me if something is not clear ;)

 Cheers!

 ~~~

 Ing. Alessio Arleo

 Dottorando in Ingegneria Industriale e dell’Informazione

 Dottore Magistrale in Ingegneria Informatica e dell’Automazione
 Dottore in Ingegneria Informatica ed Elettronica

 Linkedin: it.linkedin.com/in/IngArleo
 Skype: Ing. Alessio Arleo

 Tel: +39 075 5853920
 Cell: +39 349 0575782

 ~~~

Re: Compiling Giraph for Hadoop 2.5.x and 2.6.0 -- SASL_PROPS variable error

2015-01-12 Thread Claudio Martella

I see more and more people getting into this. I guess whether we should add
the fix to the pure_yarn profile by default, as it feel it's going to stay.
Ideas?

On Sat, Jan 10, 2015 at 7:38 PM, Eugene Koontz ekoo...@hiro-tan.org wrote:

  Hi Allesio and Eli,

 Compiling with mvn -Phadoop_yarn -Dhadoop.version=2.6.0 clean will avoid
 the below SASL_PROPS compilation error if you remove the
 STATIC_SASL_SYMBOL from the munge.symbols of the hadoop_yarn profile as
 follows:

 diff --git a/pom.xml b/pom.xml
 index cf0e1f9..8c2a561 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -1194,7 +1194,7 @@ under the License.
/modules
properties

 hadoop.versionSET_HADOOP_VERSION_USING_MVN_DASH_D_OPTION/hadoop.version
 -munge.symbolsPURE_YARN,STATIC_SASL_SYMBOL/munge.symbols
 +munge.symbolsPURE_YARN/munge.symbols
  !-- TODO: add these checks eventually --
  project.enforcer.skiptrue/project.enforcer.skip

 giraph.maven.dependency.plugin.skiptrue/giraph.maven.dependency.plugin.skip

 In other words, when compiling Giraph against newer releases of Hadoop,
 there is no need for this munge symbol.

 The distinction between newer and older seems to be release 2.4.0 of
 Hadoop, as given here:

 https://issues.apache.org/jira/browse/HADOOP-10221  Add a plugin to
 specify SaslProperties for RPC protocol based on connection properties.

  It seems like we need to add some additional profiles to make the pre-2.4
 Hadoop (which requires the munge symbol STATIC_SASL_SYMBOL) and newer
 (which should not).

 -Eugene


 On 1/8/15, 11:13 PM, Eugene Koontz wrote:

 Hi Alessio,

 I am able to reproduce your problem:

 https://gist.github.com/ekoontz/7dbaaf6218abb4fd7832

 I'll try building Hadoop 2.6.0 and getting Giraph to work with it.

 -Eugene


 On 1/8/15, 10:55 AM, Eli Reisman wrote:

 This looks like a munge symbol that needs to be added to the hadoop_yarn
 profile in the pom.xml. I'm thinking this is an issue a couple people have
 been having on 2.5 and 2.6 trying to build the hadoop_yarn profile?

 On Thu, Dec 4, 2014 at 1:01 PM, Dr. Alessio Arleo ingar...@icloud.com
 wrote:

 Hello everybody

 I am trying to compile Giraph release-1.1 for Hadoop 2.5.x and Hadoop
 2.6.0 with Maven profile hadoop_yarn. It works fine up to Hadoop 2.4.1, but
 when trying with a newer version of Hadoop the following error comes up. I
 am working with jdk 1.7 and Maven 3.2.1.
 ST
 [ERROR] COMPILATION ERROR :
 [INFO] -
 [ERROR]
 /home/hadoop/git/giraph/1.1/giraph-core/target/munged/main/org/apache/giraph/comm/netty/SaslNettyClient.java:[84,68]
 cannot find symbol
   symbol:   variable SASL_PROPS
   location: class org.apache.hadoop.security.SaslRpcServer
 [ERROR]
 /home/hadoop/git/giraph/1.1/giraph-core/target/munged/main/org/apache/giraph/comm/netty/SaslNettyServer.java:[105,62]
 cannot find symbol
   symbol:   variable SASL_PROPS
   location: class org.apache.hadoop.security.SaslRpcServer

 Do you have any suggestions? Any would be much appreciated :)

 Kind regards,
 Alessio







-- 
   Claudio Martella

Re: Please welcome our newest committer, Sergey Edunov!

2014-12-04 Thread Claudio Martella

Congrats Sergey and welcome!

On Wed, Dec 3, 2014 at 7:34 PM, Maja Kabiljo majakabi...@fb.com wrote:

  I am happy to announce that the Project Management Committee (PMC) for
 Apache Giraph has elected Sergey Edunov to become a committer, and he
 accepted.

  Sergey has been an active member of Giraph community, finding issues,
 submitting patches and reviewing code. We’re looking forward to Sergey’s
 larger involvement and future work.

  List of his contributions:
  GIRAPH-895: Trim the edges in Giraph
 GIRAPH-896: Memory leak in SuperstepMetricsRegistry
 GIRAPH-897: Add an option to dump only live objects to JMap
 GIRAPH-898: Remove giraph-accumulo from Facebook profile
 GIRAPH-903: Detect crashes on Netty threads
 GIRAPH-924: Fix checkpointing
 GIRAPH-925: Unit tests should pass even if zookeeper port not available
 GIRAPH-927: Decouple netty server threads from message processing
 GIRAPH-933: Checkpointing improvements
 GIRAPH-936: Decouple netty server threads from message processing
 GIRAPH-940: Cleanup the list of supported hadoop versions
 GIRAPH-950: Auto-restart from checkpoint doesn't pick up latest checkpoint
 GIRAPH-963: Aggregators may fail with IllegalArgumentException upon
 deserialization

 Best,
 Maja




-- 
   Claudio Martella

Giraph counters on Yarn

2014-12-03 Thread Claudio Martella

Hello,

is anybody in the list able to get the standard job counters printed at the
end of the jobs when using pure YARN?

I can get the logs, but I cannot find the usual stats printed to the
command line on mapreduce.

Thanks,
Claudio

-- 
   Claudio Martella

Re: Enabling Giraph Level Loggin - Hadoop-2.2.0

2014-11-17 Thread Claudio Martella

For completeness and future reference, where can they be found in you run
it as purely YARN app?

On Sun, Nov 16, 2014 at 9:07 PM, Eli Reisman apache.mail...@gmail.com
wrote:

 If you mean running as a MapReduce application as opposed to running
 directly on YARN, the logs should be where MR is configured to put them,
 and with per-worker logs in the MR task logs for the cluster job.

 On Mon, Nov 10, 2014 at 11:41 AM, Charith Wickramarachchi 
 charith.dhanus...@gmail.com wrote:

 Hi,

 I am running Apache Giraph 1.1.0 in Hadoop 2.2.0  as an
 mapreduce application. But I could not find the Giraph logs.

 It will be great if someone could tell me how to enable Apache giraph
  logging.

 Also, I see that group collects very detailed runtime statistics, how can
 I collect those stats?

 Thanks,
 Charith



 --
 Charith Dhanushka Wickramaarachchi

 Tel  +1 213 447 4253
 Web  http://apache.org/~charith http://www-scf.usc.edu/~cwickram/
 http://charith.wickramaarachchi.org/
 Blog  http://charith.wickramaarachchi.org/
 http://charithwiki.blogspot.com/
 Twitter  @charithwiki https://twitter.com/charithwiki

 This communication may contain privileged or other confidential information
 and is intended exclusively for the addressee/s. If you are not the
 intended recipient/s, or believe that you may have
 received this communication in error, please reply to the sender indicating
 that fact and delete the copy you received and in addition, you should
 not print, copy, retransmit, disseminate, or otherwise use the
 information contained in this communication. Internet communications
 cannot be guaranteed to be timely, secure, error or virus-free. The
 sender does not accept liability for any errors or omissions





-- 
   Claudio Martella

Re: [VOTE] Apache Giraph 1.1.0 RC2

2014-11-13 Thread Claudio Martella

+1.

On Thu, Nov 13, 2014 at 2:28 PM, Roman Shaposhnik ro...@shaposhnik.org
wrote:

 This vote is for Apache Giraph, version 1.1.0 release

 It fixes the following issues:
   http://s.apache.org/a8X

 *** Please download, test and vote by Mon 11/17 noon PST

 Note that we are voting upon the source (tag):
release-1.1.0-RC2

 Source and binary files are available at:
http://people.apache.org/~rvs/giraph-1.1.0-RC2/

 Staged website is available at:
http://people.apache.org/~rvs/giraph-1.1.0-RC2/site/

 Maven staging repo is available at:
https://repository.apache.org/content/repositories/orgapachegiraph-1003

 Please notice, that as per earlier agreement two sets
 of artifacts are published differentiated by the version ID:
   * version ID 1.1.0 corresponds to the artifacts built for
  the hadoop_1 profile
   * version ID 1.1.0-hadoop2 corresponds to the artifacts
  built for hadoop_2 profile.

 The tag to be voted upon (release-1.1.0-RC1):

 https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=log;h=refs/tags/release-1.1.0-RC2

 The KEYS file containing PGP keys we use to sign the release:
http://svn.apache.org/repos/asf/bigtop/dist/KEYS

 Thanks,
 Roman.




-- 
   Claudio Martella

Re: [VOTE] Apache Giraph 1.1.0 RC1

2014-11-10 Thread Claudio Martella

Yes,

I did re-run the build this weekend, and it built succesfully for the
default profile and the hadoop_2 one.
I ran a couple of examples on the cluster, and it ran succesfully.

I'm +1.

On Tue, Nov 4, 2014 at 8:10 PM, Roman Shaposhnik ro...@shaposhnik.org
wrote:

 On Tue, Nov 4, 2014 at 5:47 AM, Claudio Martella
 claudio.marte...@gmail.com wrote:
  I am indeed having some problems. mvn install will fail because the test
 is
  opening too many files:

 [snip]

  I have to investigate why this happens. I'm not using a different ulimit
  than what I have on my Mac OS X by default. Where are you building yours?

 This is really weird. I have not issues whatsoever on Mac OS X with
 the following setup:
$ uname -a
Darwin usxxshaporm1.corp.emc.com 12.4.1 Darwin Kernel Version
 12.4.1: Tue May 21 17:04:50 PDT 2013;
 root:xnu-2050.40.51~1/RELEASE_X86_64 x86_64
$ ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
file size   (blocks, -f) unlimited
max locked memory   (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files  (-n) 2560
pipe size(512 bytes, -p) 1
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) 709
virtual memory  (kbytes, -v) unlimited
$ mvn --version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4;
 2014-08-11T13:58:10-07:00)
Maven home: /Users/shapor/dist/apache-maven-3.2.3
Java version: 1.7.0_51, vendor: Oracle Corporation
Java home:
 /Library/Java/JavaVirtualMachines/jdk1.7.0_51.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: mac os x, version: 10.8.4, arch: x86_64, family: mac


 Thanks,
 Roman.




-- 
   Claudio Martella

Re: Compiling Giraph 1.1

2014-11-09 Thread Claudio Martella

I just built trunk with that command. Are you sure you're building latest
trunk?

On Fri, Nov 7, 2014 at 3:21 PM, Ryan freelanceflashga...@gmail.com wrote:

 Any updated thoughts on this?

 On Tue, Nov 4, 2014 at 5:59 PM, Ryan freelanceflashga...@gmail.com
 wrote:

 It's 'mvn -Phadoop_2 -fae -DskipTests clean install'

 Thanks,
 Ryan

 On Tue, Nov 4, 2014 at 2:02 PM, Roman Shaposhnik ro...@shaposhnik.org
 wrote:

 What's the exact compilation incantation you use?

 Thanks,
 Roman.

 On Tue, Nov 4, 2014 at 9:56 AM, Ryan freelanceflashga...@gmail.com
 wrote:
  I'm attempting to build, compile and install Giraph 1.1 on a server
 running
  CDH5.1.2. A few weeks ago I successfully compiled it by changing the
  hadoop_2 profile version to be 2.3.0-cdh5.1.2. I recently did a fresh
  install and was unable to build, compile and install (perhaps due to
 the
  latest code updates).
 
  The error seems to be related to the SaslNettyClient and
 SaslNettyServer.
  Any idea on fixes?
 
  Here's part of the error log:
 
  [ERROR] Failed to execute goal
  org.apache.maven.plugins:maven-compiler-plugin:3.0:compile
 (default-compile)
  on project giraph-core: Compilation failure: Compilation failure:
  [ERROR]
 
 /[myPath]/giraph/giraph-core/src/main/java/org/apache/giraph/comm/netty/SaslNettyClient.java:[28,34]
  cannot find symbol
  [ERROR] symbol:   class SaslPropertiesResolver
  [ERROR] location: package org.apache.hadoop.security
  ...
  [ERROR]
 
 /[myPath]/giraph/giraph-core/src/main/java/org/apache/giraph/comm/netty/SaslNettyServer.java:[108,11]
  cannot find symbol
  [ERROR] symbol:   variable SaslPropertiesResolver
  [ERROR] location: class org.apache.giraph.comm.netty.SaslNettyServer
 






-- 
   Claudio Martella

Re: [VOTE] Apache Giraph 1.1.0 RC1

2014-11-04 Thread Claudio Martella

I am indeed having some problems. mvn install will fail because the test is
opening too many files:

Caused by: java.io.FileNotFoundException:
/private/var/folders/5b/8yx5dbyn40nbt_70syjs86chgp/T/giraph-hive-1415098102276/metastore_db/seg0/c90.dat
(Too many open files in system)

at java.io.RandomAccessFile.open(Native Method)

at java.io.RandomAccessFile.init(RandomAccessFile.java:241)

at org.apache.derby.impl.io.DirRandomAccessFile.init(Unknown Source)

at org.apache.derby.impl.io.DirRandomAccessFile4.init(Unknown Source)

at org.apache.derby.impl.io.DirFile4.getRandomAccessFile(Unknown Source)

at org.apache.derby.impl.store.raw.data.RAFContainer.run(Unknown Source)

at java.security.AccessController.doPrivileged(Native Method)

at
org.apache.derby.impl.store.raw.data.RAFContainer.createContainer(Unknown
Source)

at
org.apache.derby.impl.store.raw.data.RAFContainer4.createContainer(Unknown
Source)

at org.apache.derby.impl.store.raw.data.FileContainer.createIdent(Unknown
Source)

at org.apache.derby.impl.store.raw.data.RAFContainer.createIdentity(Unknown
Source)

at org.apache.derby.impl.services.cache.ConcurrentCache.create(Unknown
Source)

at
org.apache.derby.impl.store.raw.data.BaseDataFileFactory.addContainer(Unknown
Source)

at org.apache.derby.impl.store.raw.xact.Xact.addContainer(Unknown Source)

at org.apache.derby.impl.store.access.heap.Heap.create(Unknown Source)

at
org.apache.derby.impl.store.access.heap.HeapConglomerateFactory.createConglomerate(Unknown
Source)

at
org.apache.derby.impl.store.access.RAMTransaction.createConglomerate(Unknown
Source)

at
org.apache.derby.impl.sql.catalog.DataDictionaryImpl.createConglomerate(Unknown
Source)

at
org.apache.derby.impl.sql.catalog.DataDictionaryImpl.createDictionaryTables(Unknown
Source)

at org.apache.derby.impl.sql.catalog.DataDictionaryImpl.boot(Unknown Source)

at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)

at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown
Source)

at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown
Source)

at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown
Source)

at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)

at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)

at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown
Source)

at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown
Source)

at
org.apache.derby.impl.services.monitor.BaseMonitor.createPersistentService(Unknown
Source)

at
org.apache.derby.iapi.services.monitor.Monitor.createPersistentService(Unknown
Source)

... 96 more


I have to investigate why this happens. I'm not using a different ulimit
than what I have on my Mac OS X by default. Where are you building yours?



On Sat, Nov 1, 2014 at 11:49 PM, Roman Shaposhnik ro...@shaposhnik.org
wrote:

 Ping! Any progress on testing the current RC?

 Thanks,
 Roman.

 On Fri, Oct 31, 2014 at 9:00 AM, Claudio Martella
 claudio.marte...@gmail.com wrote:
  Oh, thanks for the info!
 
  On Fri, Oct 31, 2014 at 3:06 PM, Roman Shaposhnik ro...@shaposhnik.org
  wrote:
 
  On Fri, Oct 31, 2014 at 3:26 AM, Claudio Martella
  claudio.marte...@gmail.com wrote:
   Hi Roman,
  
   thanks again for this. I have had a look at the staging site so far
 (our
   cluster has been down whole week... universities...), and I was
   wondering if
   you have an insight why some of the docs are missing, e.g. gora and
   rexster
   documentation.
 
  None of them are missing. The links moved to a User Docs - Modules
  though:
 http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/gora.html
 http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/rexster.html
  and so forth.
 
  Thanks,
  Roman.
 
 
 
 
  --
 Claudio Martella
 




-- 
   Claudio Martella

Re: Graph partitioning and data locality

2014-11-04 Thread Claudio Martella

Hi,

answers are inline.

On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns martin.jungha...@gmx.net
wrote:

 Hi group,

 I got a question concerning the graph partitioning step. If I understood
 the code correctly, the graph is distributed to n partitions by using
 vertexID.hashCode()  n. I got two questions concerning that step.

 1) Is the whole graph loaded and partitioned only by the Master? This
 would mean, the whole data has to be moved to that Master map job and then
 moved to the physical node the specific worker for the partition runs on.
 As this sounds like a huge overhead, I further inspected the code:
 I saw that there is also a WorkerGraphPartitioner and I assume he calls
 the partitioning method on his local data (lets say his local HDFS blocks)
 and if the resulting partition for a vertex is not himself, the data gets
 moved to that worker, which reduces the overhead. Is this assumption
 correct?


That is correct, workers forward vertex data to the correct worker who is
responsible for that vertex via hash-partitioning (by default), meaning
that the master is not involved.



 2) Let's say the graph is already partitioned in the file system, e.g.
 blocks on physical nodes contain logical connected graph nodes. Is it
 possible to just read the data as it is and skip the partitioning step? In
 that case I currently assume, that the vertexID should contain the
 partitionID and the custom partitioning would be an identity function in
 that case (instead of hashing or range).


In principle you can. You would need to organize splits so that they
contain all the data for each particular worker, and then assign relevant
splits to the corresponding worker.



 Thanks for your time and help!

 Cheers,
 Martin




-- 
   Claudio Martella

Re: [VOTE] Apache Giraph 1.1.0 RC1

2014-10-31 Thread Claudio Martella

Hi Roman,

thanks again for this. I have had a look at the staging site so far (our
cluster has been down whole week... universities...), and I was wondering
if you have an insight why some of the docs are missing, e.g. gora and
rexster documentation.

Thanks,
Claudio

On Fri, Oct 31, 2014 at 6:38 AM, Roman Shaposhnik ro...@shaposhnik.org
wrote:

 On Wed, Oct 29, 2014 at 6:51 PM, Maja Kabiljo majakabi...@fb.com wrote:
  Roman, again thanks for taking care of the release.
 
  We found one issue https://issues.apache.org/jira/browse/GIRAPH-961 -
 any
  application using MasterLoggingAggregator fails without this fix. Can we
  backport it to the release?

 This looks like a really good idea to me. I will be re-cutting the RC over
 the
 weekend. For now I'd really, really, really ask everybody to once again
 consider issue like GIRAPH-961 so that we don't have to re-cut multiple
 times.

 Thanks,
 Roman.




-- 
   Claudio Martella

Re: [VOTE] Apache Giraph 1.1.0 RC1

2014-10-31 Thread Claudio Martella

Oh, thanks for the info!

On Fri, Oct 31, 2014 at 3:06 PM, Roman Shaposhnik ro...@shaposhnik.org
wrote:

 On Fri, Oct 31, 2014 at 3:26 AM, Claudio Martella
 claudio.marte...@gmail.com wrote:
  Hi Roman,
 
  thanks again for this. I have had a look at the staging site so far (our
  cluster has been down whole week... universities...), and I was
 wondering if
  you have an insight why some of the docs are missing, e.g. gora and
 rexster
  documentation.

 None of them are missing. The links moved to a User Docs - Modules
 though:
http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/gora.html
http://people.apache.org/~rvs/giraph-1.1.0-RC1/site/rexster.html
 and so forth.

 Thanks,
 Roman.




-- 
   Claudio Martella

Re: Resource Allocation Model Of Apache Giraph

2014-10-24 Thread Claudio Martella

giraph.userPartitionCount is the way to go, but not
giraph.maxPartitionsInMemory. That is for the out-of-core graph
functionality.

On Fri, Oct 24, 2014 at 1:23 PM, Matthew Saltz sal...@gmail.com wrote:

 You may set giraph.userPartitionCount=number of workers and 
 giraph.maxPartitionsInMemory=1.
 Like Avery said though, since parallelism occurs on a partition level (each
 thread processes a different partition) if you only have one partition per
 worker you cannot take advantage of multithreading.

 Best,
 Matthew

 On Fri, Oct 24, 2014 at 3:53 AM, Zhang, David (Paypal Risk) 
 pengzh...@ebay.com wrote:

  I think no good solution. You can try to run a java application by
 using FileInputFormat.getSplits to get the size of the array, which number
 you can set to giraph workers.

 Or run a simple map-reduce job by using IdentityMapper to see how many
 mappers there.



 Thanks,

 Zhang, David (Paypal Risk)

 *From:* Charith Wickramarachchi [mailto:charith.dhanus...@gmail.com]
 *Sent:* 2014年10月24日 5:37
 *To:* user
 *Subject:* Re: Resource Allocation Model Of Apache Giraph



 Thanks  Claudio and Avery,



 I find a way way to configure hadoop to have desired number of mappers
 per machine as Claudio mentioned.



 Avery,

 Could you please tell me how I can configure giraph to make each worker
 handle only a single partition?



 Thanks,
 Charith



 On Thu, Oct 23, 2014 at 2:26 PM, Avery Ching ach...@apache.org wrote:

 Regarding your second point, partitions are decoupled from workers.  A
 worker can handle zero or more partitions.  You can make each worker handle
 one partition, but we typically like multiple partitions since we can use
 multi-threading per machine.



 On 10/23/14, 9:04 AM, Claudio Martella wrote:

  the way mappers (or containers) and hence workers are assigned to
 machines is not under the control of giraph, but of the underlying hadoop
 environment (with different responsibilities that depend on the hadoop
 version, e.g. YARN). You'll have to tweak your hadoop configuration to
 control the maximum number of workers assigned to one machine (optimally
 one with multiple threads).



 On Thu, Oct 23, 2014 at 5:53 PM, Charith Wickramarachchi 
 charith.dhanus...@gmail.com wrote:

 Hi Folks,



 I'm wondering what is the resource allocation model for Apache Giraph.



 As I understand each worker is one to one Mapped with a Mapper and a
 worker can process multiple partitions with a user defined number of
 threads.



 Is it possible to make sure that one worker, only process a single
 partition? Also is it possible to control the worker assignment in the
 cluster nodes? (Ex: Make sure only N  workers runs on a single machine,
 assuming we have enough resources)



 Thanks,

 Charith

















 --

 Charith Dhanushka Wickramaarachchi



 Tel  +1 213 447 4253

 Web  http://apache.org/~charith http://www-scf.usc.edu/%7Ecwickram/

 Blog  http://charith.wickramaarachchi.org/
 http://charithwiki.blogspot.com/

 Twitter  @charithwiki https://twitter.com/charithwiki



 This communication may contain privileged or other
 confidential information and is intended exclusively for the addressee/s.
 If you are not the intended recipient/s, or believe that you may have
 received this communication in error, please reply to the
 sender indicating that fact and delete the copy you received and in
 addition, you should not print, copy, retransmit, disseminate, or otherwise
 use the information contained in this communication.
 Internet communications cannot be guaranteed to be timely, secure, error
 or virus-free. The sender does not accept liability for any errors
 or omissions





 --

Claudio Martella








 --

 Charith Dhanushka Wickramaarachchi



 Tel  +1 213 447 4253

 Web  http://apache.org/~charith http://www-scf.usc.edu/~cwickram/

 Blog  http://charith.wickramaarachchi.org/
 http://charithwiki.blogspot.com/

 Twitter  @charithwiki https://twitter.com/charithwiki



 This communication may contain privileged or other
 confidential information and is intended exclusively for the addressee/s.
 If you are not the intended recipient/s, or believe that you may have
 received this communication in error, please reply to the
 sender indicating that fact and delete the copy you received and in
 addition, you should not print, copy, retransmit, disseminate, or otherwise
 use the information contained in this communication.
 Internet communications cannot be guaranteed to be timely, secure, error
 or virus-free. The sender does not accept liability for any errors
 or omissions





-- 
   Claudio Martella

Re: Resource Allocation Model Of Apache Giraph

2014-10-23 Thread Claudio Martella

the way mappers (or containers) and hence workers are assigned to machines
is not under the control of giraph, but of the underlying hadoop
environment (with different responsibilities that depend on the hadoop
version, e.g. YARN). You'll have to tweak your hadoop configuration to
control the maximum number of workers assigned to one machine (optimally
one with multiple threads).

On Thu, Oct 23, 2014 at 5:53 PM, Charith Wickramarachchi 
charith.dhanus...@gmail.com wrote:

 Hi Folks,

 I'm wondering what is the resource allocation model for Apache Giraph.

 As I understand each worker is one to one Mapped with a Mapper and a
 worker can process multiple partitions with a user defined number of
 threads.

 Is it possible to make sure that one worker, only process a single
 partition? Also is it possible to control the worker assignment in the
 cluster nodes? (Ex: Make sure only N  workers runs on a single machine,
 assuming we have enough resources)

 Thanks,
 Charith








 --
 Charith Dhanushka Wickramaarachchi

 Tel  +1 213 447 4253
 Web  http://apache.org/~charith http://www-scf.usc.edu/~cwickram/
 http://charith.wickramaarachchi.org/
 Blog  http://charith.wickramaarachchi.org/
 http://charithwiki.blogspot.com/
 Twitter  @charithwiki https://twitter.com/charithwiki

 This communication may contain privileged or other confidential information
 and is intended exclusively for the addressee/s. If you are not the
 intended recipient/s, or believe that you may have
 received this communication in error, please reply to the sender indicating
 that fact and delete the copy you received and in addition, you should
 not print, copy, retransmit, disseminate, or otherwise use the
 information contained in this communication. Internet communications
 cannot be guaranteed to be timely, secure, error or virus-free. The
 sender does not accept liability for any errors or omissions




-- 
   Claudio Martella

Re: how do I maintain a cached List across supersteps?

2014-09-17 Thread Claudio Martella

I would use a workercontext, it is shared and persistent during computation
by all vertices in a worker. If it's readonly, you won't have to manage
concurrency.

On Tue, Sep 16, 2014 at 9:42 PM, Matthew Cornell m...@matthewcornell.org
wrote:

 Hi Folks. I have a custom argument that's passed into my Giraph job that
 needs parsing. The parsed value is accessed by my Vertex#compute. To avoid
 excessive GC I'd like to cache the parsing results. What's a good way to do
 so? I looked at using the ImmutableClassesGiraphConfiguration returned by
 getConf(), but it supports only String properties. I looked at using my
 custom MasterCompute to manage it, but I couldn't find how to access the
 master compute instance from the vertex. My last idea is to use (abuse?) an
 aggregator to do this. I'd appreciate your thoughts! -- matt

 --
 Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson
 Street, Amherst MA 01002 | matthewcornell.org




-- 
   Claudio Martella

Re: OrientDB Rexster Apache Giraph combination

2014-05-28 Thread Claudio Martella

are you using trunk?

giraph-rexster documentation is here: http://giraph.apache.org/rexster.html


On Wed, May 28, 2014 at 8:41 AM, Arun Km arunkm@gmail.com wrote:

 Hi Claudio
 Im beginner to Giraph. I was following the Apache Girpah Quick Start link
 I could see Giraph-Accumulo, core, examples, hbase, hcatelog, hive
 subprojects only.
 Will you please direct me to the giraph-rexster subproject?

 About the documentation, I was referring to
 http://giraph.apache.org/apidocs/ . will you please direct me to the
 right link.

 Thanks a lot
  Arun

 On 23 May 2014 13:34, Arun Km arunkm@gmail.com wrote:

 Thanks for help, let me look into sub projects.

 Cheers !



 On 22 May 2014 20:20, Claudio Martella claudio.marte...@gmail.comwrote:

 You can have a look at the giraph-rexster subproject within the giraph
 codebase. there is also some documentation on our site.


 On Thu, May 22, 2014 at 3:25 PM, Arun Km arunkm@gmail.com wrote:

 Hello

 I would like to know Your opinion on OrientDB -- Rexster --
 Apache Giraph combination.
 I'm more interested on *input/output formats* to be used between these
 three?


 BR
 Arun




 --
Claudio Martella







-- 
   Claudio Martella

Re: n-ary relationship on Giraph

2014-05-22 Thread Claudio Martella

Well, you don't know of how many supesteps his computation is. What he's
asking is very typical in the semantic web community and he's basically
hinting at indexes on the labels of the edges. Imagine he wants to involve
in a superstep of the computation (of potentially of multiple steps) all
the vertices that have a particular edge. He either scans all the vertices
and lets them decide who has these edges, or you create an supernode that
connects to all the vertices that have such outgoing edge (at loading?
previously with mapreduce?). In a homogenous dataset like Facebook's where
every vertex has more or less the same set of edge labels (friends,
comment, likes, etc.) this indices would be overkill and unnecessary,
but with heterogenous datasets with respect to schema, like DBpedia, this
question may still be relevant.

To answer Sujan's answer, no we don't have indices on edges. You may want
to create specific vertices to do that (but be aware of their degree!), or
store some of these indices in a WorkerContext (or even aggregators). It
really depends on the size of these indices.

Hope this helps,
Claudio


On Thu, May 22, 2014 at 2:57 AM, Pavan Kumar A pava...@outlook.com wrote:

 The state of triplet A CB can be stored in the edge value for C (the
 edge from A - B)
 I would like to remind you that Giraph is a batch processing framework,
 and not a graph database.
 You can do  complex graph processing on the input graph, such questions
 can be answered very trivially. But performance need not be
 great. You must write java code and run a map-reduce job.

 For this case your compute function consists of just 1 superstep
 which filters edges for a vertex based on the criterion and then you can
 write the output back to one of the supported storage formats.

 --
 Date: Wed, 21 May 2014 16:32:44 -0700
 From: sujanu...@yahoo.com
 Subject: Re: n-ary relationship on Giraph

 To: user@giraph.apache.org

 Lets say I have node A and B, linked with edge C.
 Now I have properties which belongs to this A - C - B triplet. For
 example I have property 'date created'. 'date created' property belongs to
 A- C- B.
 Can I represent this in Giraph. Also does giraph has querying mechanism?
 So that I can retrieve triplets which are created before particular date?

 Sujan Perera
   On Wednesday, May 21, 2014 3:51 PM, Pavan Kumar A pava...@outlook.com
 wrote:



 Can you please provide more context.

 vertex - edge (edge value can store any properties required of that edge)
 - vertex (vertex value can store any property required for the vertex)
 --
 Date: Wed, 21 May 2014 13:50:34 -0700
 From: sujanu...@yahoo.com
 Subject: n-ary relationship on Giraph
 To: user@giraph.apache.org


 Hi,

 Does Giraph supports n-ary relationships? I need to store some properties
 of triplet vertex - edge - vertex and be able to query with those
 properties.

 Sujan Perera





-- 
   Claudio Martella

Re: OrientDB Rexster Apache Giraph combination

2014-05-22 Thread Claudio Martella

You can have a look at the giraph-rexster subproject within the giraph
codebase. there is also some documentation on our site.


On Thu, May 22, 2014 at 3:25 PM, Arun Km arunkm@gmail.com wrote:

 Hello

 I would like to know Your opinion on OrientDB -- Rexster --
 Apache Giraph combination.
 I'm more interested on *input/output formats* to be used between these
 three?


 BR
 Arun




-- 
   Claudio Martella

Re: Superstep duration increases

2014-05-16 Thread Claudio Martella

I'd start by taking HBase out of the equation.


On Thu, May 8, 2014 at 1:46 PM, Pascal Jäger pas...@pascaljaeger.de wrote:

 Hi all,

 I have implemented a label propagation algorithm to find clusters in a
 graph.
 I just realized that the time the algorithm takes for one superstep is
 increasing and I don’t know why.

 The graph is static and the number of messages is the same throughout all
 supersteps.
 During every superstep each node sends its label to its neighbors which
 then calculate their label based on the received messages and then again
 send their label.
 At the end of each superstep each node writes a nodeID - label pair to an
 HBase table.

 Do you have any general hints where I can look at?

 I absolutely have no clue where to start

 Thanks for your help!

 Regards

 Pascal




-- 
   Claudio Martella

Re: MessageCombiner

2014-05-12 Thread Claudio Martella

you can check the combiner used by the shortest paths algorithm, that has
the inverted semantics as yours, as it is using the minimum value.


On Mon, May 12, 2014 at 8:03 AM, nishant gandhi
nishantgandh...@gmail.comwrote:

 Let say I have 5 nodes which are sending message to 6th node.
 each 5 node sending one message to 6th node containing some value.
 I want to intercept all those 5 messages going towards 6th node.
 after that I want to find the maximum value contained among those 5 nodes.
 and send single message from combiner to 6th node which contain only the
 maximum value among those messages.


 I have tried to write a simple code for it but it seems not working. I
 dont know what I am doing wrong.

 In the code below, node 0 trying to send 3 message to node 1.
 My final goal is, in superstep=1, node 1 should receive only one message
 with contain 4 only.
 My current code receiving all the 3 messages and hence I am getting final
 value of variable:test as 7.


 public class CombinerTest extends BasicComputationLongWritable,
 DoubleWritable, FloatWritable,DoubleWritable
  {

 @Override
 public void compute(VertexLongWritable, DoubleWritable,
 FloatWritable arg0,IterableDoubleWritable arg1) throws IOException {



 if(getSuperstep()==0  (arg0.getId().get()==0))
 {
 sendMessage(new LongWritable(1),new DoubleWritable(1));
 sendMessage(new LongWritable(1),new DoubleWritable(2));
 sendMessage(new LongWritable(1),new DoubleWritable(4));
 }
 DoubleWritable test=new DoubleWritable(0);
 if(getSuperstep()==1)
 {
 for(DoubleWritable messages :arg1)
 {
 test.set(test.get()+messages.get());
 //changed=true;
 }
 arg0.setValue(test);
 }
 arg0.voteToHalt();
 }

 public void combine(LongWritable vertexIndex, DoubleWritable
 originalMessage, DoubleWritable messageToCombine)
 {
 if(originalMessage.get()  messageToCombine.get())
 {
   originalMessage.set(messageToCombine.get());
 }
 }
  public DoubleWritable createInitialMessage()
  {
 return new DoubleWritable(Double.MAX_VALUE);
  }


 }




 please help me to figure out correct way to write code.
 Thanks Maria.

 Nishant Gandhi
 M.Tech. CSE
 IIT Patna


 On Mon, May 12, 2014 at 11:11 AM, Maria Stylianou mars...@gmail.comwrote:

 Hi Nishant,

 Can you be more specific? Are you trying to combine all incoming messages
 of a vertex into one message? What do you mean combine? Add values? Or
 append to a list?
 The message can be a list so you can put all values together.

 Maria


 On Sunday, May 11, 2014, nishant gandhi nishantgandh...@gmail.com
 wrote:

 Hi,
 I am trying to write code that use Combiner. I want to combine all
 message into one for each vertex. That one message contains message value
 bigger than all the other message values.
 Please help.

 Nishant Gandhi
 M.Tech. CSE
 IIT Patna



 --
 Sent from Android Mobile





-- 
   Claudio Martella

Re: Blogpost: Large-scale graph partitioning with Apache Giraph

2014-04-24 Thread Claudio Martella

Very interesting. We recently wrote an article about a very similar
technique: http://arxiv.org/pdf/1404.3861v1.pdf and we also evaluated it on
1B vertices.
It would be nice to test it on your graph.


On Tue, Apr 22, 2014 at 8:24 PM, Avery Ching ach...@apache.org wrote:

 Hi Giraphers,

 Recently, a few internal Giraph users at Facebook published a really cool
 blog post on how we partition huge graphs (1.15 billion people and 150
 billion friendships - 300B directed edges).

 https://code.facebook.com/posts/274771932683700/large-
 scale-graph-partitioning-with-apache-giraph/

 Avery




-- 
   Claudio Martella

Re: Using out of core messages

2014-04-24 Thread Claudio Martella

Answers are inline.


On Thu, Apr 24, 2014 at 4:21 PM, Pascal Jäger pas...@pascaljaeger.dewrote:

  Hi all,

  I am struggling with the settings to use out of core messages.
 I have 3 nodes with 16 GB RAM each ( one master, two workers).

  I ran into a java heap space OOM Error.

  First question is: Where do I set the mapred.child.java-opts Options?
 Do I need to add them via the -ca  mapred.child….“ option or by using
 „-Dmapred.child…..“
 I tried both, but nothing seems to work out.
 I run it on a cloudera cluster, and when looking in the web frontend I
 see, that it only uses 3 GB of my 16 GB RAM.
 Are those even the right options ?


You can use both, but the correct parameter name is mapred.child.java.opts.



  giraph.maxMessagesInMemory - is it per worker? Or what exactly is
 counted here? and how does it correlate to giraph.messagesBufferSize?


It is per worker, and it tells the maximum number of messages each worker
should keep in main memory. The messageBufferSize defines the buffer used
to read and write messages to the disk and you can probably keep the
current value.



  I am really lost right now. My graph has currently only 8000 nodes and
 7 edges.
 During one step I need to send more than 15 000 000 messages and this is
 when I get the OOM error.

  I turned on the out of core messages feature without changing the above
 mentioned options and my computation really slowed down.
 I guess because it was writing 14 000 000 messages to disk


Each worker is currently keeping 1M messages in memory (if you have
activated OOC messages but have not played with maxMessagesInMemory). In
your case, it's something around 1/8 of the messages a worker receives.
Once you're able to increase the heap and use all your 16GB of ram on your
workers, you should be able to increase that parameter, depending on the
message size.



  Hope you can help me.

  Regards Pascal



Hope this helps.


-- 
   Claudio Martella

Re: Starting a second computation

2014-04-19 Thread Claudio Martella

Hi,

there is currently now way to re-active the vertices from the master. One
thing you could do is use an aggregator, instead of actually voting to
halt. For example, with a sum aggregator VERTEX_FINISHED, where vertices
add 1 when they would vote to halt, you can see from the master when all
the vertices are finished, and then switch to a new computation.


On Sat, Apr 19, 2014 at 1:36 AM, Schweiger, Tom thschwei...@ebay.comwrote:


 Hello Giraph list,

 I have a problem that has two steps.  Step 2 needs to start after step 1
 completes.  Step 1 is completed when all the vertices have voted to halt
 and there are no more messages.

 I know I can switch my computes using a MasterCompute, but it is unclear
 how I re-awaken all the vertices.

 Has anyone else solved a problem like this?  If so, how did you do it?  Is
 there an easier way to do this?

 Basically I'm thinking this:

 class TwoStep {

   class TwoStepMaster extends DefaultMasterCompute {

public final void compute() {
   //
   // switch from StepOne to StepTwo  if StepOne is done
   //
   if (this .isHalted 
  this.getComputation().equals(StepOne.class);) {
setComputation(StepTwo.class());
 // send a message to all vertices???
 // unhalt somehow??
 // suggestions anyone??
}
}
}

class StepOne extends BasicComputation {
 public void compute(...) {
   // do step one stuff
  vertex.voteToHalt();
 }
}

class StepTwo extends BasicComputation {
 public void compute(...) {
   // do step two stuff
  vertex.voteTpHalt();
 }
  }

  }







-- 
   Claudio Martella

Re: Changing index of a graph

2014-04-15 Thread Claudio Martella

The only solution i know is usually done via a so-called dictionary outside
of giraph (e.g. for semantic web graphs which also have URIs as IDs),
through a datastore like HBase/Cassandra, basically the hashmap you
mentioned.
While initially computationally expensive, it allows you to scale in the
long run, because adding an edge is just incrementing a counter in the
store and add the mapping.


On Tue, Apr 15, 2014 at 3:33 PM, Martin Neumann mneum...@spotify.comwrote:

 Hej,

 I have a huge edgelist (several billion edges) where node ID's are URL's.
 The algorithm I want to run needs the ID's to be long and there should be
 no holes in the ID space (so I cant simply hash the URL's).

 Is anyone aware of a simple solution that does not require a impractical
 huge hash map?

 My idea currently is to load the graph into another giraph job and then
 assigning a number to each node. This way the mapping of number to URL
 would be stored in the Node.
 Problem is that I have to assign the numbers in a sequential way to ensure
 there are no holes and numbers are unique. No Idea if this is even possible
 in Giraph.

 Any input is welcome

 cheers Martin




-- 
   Claudio Martella

Powered-by Giraph page

2014-04-09 Thread Claudio Martella

Hello giraphers,

as Giraph is getting more visibility and users, I think it would be nice to
add a Powered-by page on our site, were we collect names of companies
that (want to share that) are using Giraph.

So, this is basically a small survey about who is using Giraph. For those
that I know:

- Facebook


Anybody else?

Thanks!

Claudio

-- 
   Claudio Martella

Re: How to set giraph runtime parameters?

2014-04-09 Thread Claudio Martella

you can set them on the command line, by using -D (e.g.
-Dgiraph.isStaticGraph=true) after GiraphRunner,


On Wed, Apr 9, 2014 at 4:44 PM, Suijian Zhou suijian.z...@gmail.com wrote:

 Hi,
   Does anybody know how to set runtime parameters in giraph? It should be set 
 in command line

  or in a *.xml file?  I tried -Dgiraph.zkSessionMsecTimeout=90( googled) 
 in command line but failed. Thanks!

   Best Regards,
   Suijian







-- 
   Claudio Martella

Re: How to set more zooKeeper nodes in giraph.

2014-04-08 Thread Claudio Martella

you don't need to recompile it, you can set it runtime by setting
giraph.zkServerCount
accordingly.
but are you trying to get giraph to start multiple instances of zookeeper
in multiple nodes, or do you want giraph to use multiple of your existing
zookeepers?


On Tue, Apr 8, 2014 at 11:17 PM, Suijian Zhou suijian.z...@gmail.comwrote:

 Hi,
   Does anybody know how to set more zooKeeper nodes in giraph? I tried to
 modify
 ZOOKEEPER_SERVER_COUNT in file:
 giraph-core/target/munged/main/org/apache/giraph/conf/GiraphConstants.java

 but recompilation of giraph shows no effect at all( giraph seems always
 use 1 zookeeper node?) and when it failed(e.g, due to timeout), the client
 could not connect and finally the giraph job failed too. It's also strange
 that although I see negotiated timeout = 60 which means the session
 is supposed to be running for 10 minutes , but why the job failed to
 connect to it only after ~1 minutes?

 14/04/08 15:58:18 INFO zookeeper.ZooKeeper: Initiating client connection,
 connectString=compute-0-23.local:22181 sessionTimeout=6
 watcher=org.apache.giraph.job.JobProgressTracker@69b28a51
 14/04/08 15:58:18 INFO mapred.JobClient: Running job: job_201404081444_0011
 14/04/08 15:58:18 INFO zookeeper.ClientCnxn: Opening socket connection to
 server compute-0-23.local/10.1.255.231:22181. Will not attempt to
 authenticate using SASL (unknown error)
 14/04/08 15:58:18 INFO zookeeper.ClientCnxn: Socket connection established
 to compute-0-23.local/10.1.255.231:22181, initiating session
 14/04/08 15:58:18 INFO zookeeper.ClientCnxn: Session establishment
 complete on server compute-0-23.local/10.1.255.231:22181, sessionid =
 0x14543222b640009, negotiated timeout = 60
 
 
 14/04/08 15:59:48 INFO job.JobProgressTracker: Data from 8 workers -
 Compute superstep 2: 0 out of 4847571 vertices computed; 0 out of 64
 partitions computed; min free memory on worker 8 - 152.48MB, average
 217.58MB
 14/04/08 15:59:51 INFO zookeeper.ClientCnxn: Unable to read additional
 data from server sessionid 0x14543222b640009, likely server has closed
 socket, closing socket connection and attempting reconnect
 14/04/08 15:59:52 INFO zookeeper.ClientCnxn: Opening socket connection to
 server compute-0-23.local/10.1.255.231:22181. Will not attempt to
 authenticate using SASL (unknown error)
 14/04/08 15:59:52 WARN zookeeper.ClientCnxn: Session 0x14543222b640009 for
 server null, unexpected error, closing socket connection and attempting
 reconnect
 java.net.ConnectException: Connection refused

   Best Regards,
   Suijian







-- 
   Claudio Martella

Re: Information

2014-03-26 Thread Claudio Martella

It looks like you're expecting to use Giraph in an online fashion, such as
you would use a database to answer queries within milliseconds or seconds.
Giraph is an offline batch processing system.


On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote:

 Hi there

 In my project I have to implement a routing system with good performance;
 at the beginning this system should be able in giving routes information
 only for one italian region (Lombardia) but it could be used for the whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create my
 own graph in the best format i can use it; then I need to use Dijkstra (or
 any other algorithm) in order to propose to the user K possible paths from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I need to
 take care of external events that can modify the weights on my built graph
 and this implies that I should create the contracted graph once again and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million of
 edges and I was thinking to use Giraph to solve my issue (I saw this link
 http://giraph.apache.org/intro.html where you talked about shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

- does it make sense to use giraph for my scenario?
- must i respect some graph format to pass to the giraph algorithm in
order to have K shortest paths from point A to point B? If sowhich
format should I respect?
- what would be perfomance by using giraph? I know that Dijstra
algorithm problem is that it is slow.by using giraph will I be able in
improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to giraph
 and I'm trying to understand it

 Thank you
 Angelo




-- 
   Claudio Martella

Re: Information

2014-03-26 Thread Claudio Martella

Nope, you can think about Giraph as MapReduce for graphs. Probably neo4j 
C is the way to go for you.


On Wed, Mar 26, 2014 at 3:18 PM, Angelo Immediata angelo...@gmail.comwrote:

 hi Sebastian

 OK...I got itI was thinking I could use it for an online scenario..
 Thank you

 Angelo


 2014-03-26 14:52 GMT+01:00 Sebastian Schelter s...@apache.org:

 Hi Angelo,

 It very much depends on your use case. Do you want to precompute paths
 offline in batch or are you looking for a system that answers online?
 Giraph has been built for the first scenario.

 --sebastian


 On 03/26/2014 02:48 PM, Angelo Immediata wrote:

 hi Claudio

 so, if I understood correctly, it has no sense to use Giraph for shortest
 path calculation in my scenario

 Am I right?


 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com
 :

  It looks like you're expecting to use Giraph in an online fashion, such
 as
 you would use a database to answer queries within milliseconds or
 seconds.
 Giraph is an offline batch processing system.


 On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.com
 wrote:

  Hi there

 In my project I have to implement a routing system with good
 performance;
 at the beginning this system should be able in giving routes
 information
 only for one italian region (Lombardia) but it could be used for the
 whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create
 my
 own graph in the best format i can use it; then I need to use Dijkstra
 (or
 any other algorithm) in order to propose to the user K possible paths
 from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I
 need to
 take care of external events that can modify the weights on my built
 graph
 and this implies that I should create the contracted graph once
 again and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million
 of
 edges and I was thinking to use Giraph to solve my issue (I saw this
 link
 http://giraph.apache.org/intro.html where you talked about
 shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

 - does it make sense to use giraph for my scenario?
 - must i respect some graph format to pass to the giraph algorithm
 in

 order to have K shortest paths from point A to point B? If
 sowhich
 format should I respect?
 - what would be perfomance by using giraph? I know that Dijstra

 algorithm problem is that it is slow.by using giraph will I be
 able in
 improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to
 giraph
 and I'm trying to understand it

 Thank you
 Angelo




 --
 Claudio Martella








-- 
   Claudio Martella

Re: Is it possible to know the mapper task a particular vertex is assigned to?

2014-03-06 Thread Claudio Martella

by default vertices stay where they are when they are loaded.


On Thu, Mar 6, 2014 at 7:31 AM, Pankaj Malhotra pankajiit...@gmail.comwrote:

 There is a vertex with a large outgoing edge-list. I wanted to compare the
 memory usage, number of messages, and few other statistics for the worker
 with this vertex and the average statistics across workers.

 Does the mapping change within the same job?

 Thanks,
 Pankaj


 On 6 March 2014 11:38, Roman Shaposhnik shaposh...@gmail.com wrote:

 On Wed, Mar 5, 2014 at 9:53 PM, Pankaj Malhotra pankajiit...@gmail.com
 wrote:
  Hi,
 
  How can I find the mapper task a particular vertex is assigned to?
  I can do this by doing a sysout and then looking at the logs. But there
 must
  be a smarter way to do this. Please suggest.

 That mapping is not static and can change. In theory you can rely on
 the info in ZK, but that would be relying on what is, essentially, an
 implementation detail of Giraph.

 What's the reason for you to need this info?

 Thanks,
 Roman.





-- 
   Claudio Martella

Re: Giraph program stucks.

2014-03-06 Thread Claudio Martella

did you actually increase the heap?


On Thu, Mar 6, 2014 at 11:43 PM, Suijian Zhou suijian.z...@gmail.comwrote:

 Hi,
   I tried to process only 2 of the input files, i.e, 2GB + 2GB input, the
 program finished successfully in 6 minutes. But as I have 39 nodes, they
 should be enough to load  and process the 8*2GB=16GB size graph? Can
 somebody help to give some hints( Will all the nodes participate in graph
 loading from HDFS or only master node load the graph?)? Thanks!

   Best Regards,
   Suijian



 2014-03-06 16:24 GMT-06:00 Suijian Zhou suijian.z...@gmail.com:

 Hi, Experts,
   I'm trying to process a graph by pagerank in giraph, but the program
 always stucks there.
 There are 8 input files, each one is with size ~2GB and all copied onto
 HDFS. I use 39 nodes and each node has 16GB Mem and 8 cores. It keeps
 printing the same info(as the following) on the screen after 2 hours, looks
 no progress at all. What are the possible reasons? Testing small example
 files run without problems. Thanks!

 14/03/06 16:17:42 INFO job.JobProgressTracker: Data from 39 workers -
 Compute superstep 0: 5854829 out of 4920 vertices computed; 181 out of
 1521 partitions computed
 14/03/06 16:17:47 INFO job.JobProgressTracker: Data from 39 workers -
 Compute superstep 0: 5854829 out of 4920 vertices computed; 181 out of
 1521 partitions computed

   Best Regards,
   Suijian





-- 
   Claudio Martella

Re: To process a BIG input graph in giraph.

2014-03-05 Thread Claudio Martella

-vip /user/hadoop/input should be enough.


On Wed, Mar 5, 2014 at 5:31 PM, Suijian Zhou suijian.z...@gmail.com wrote:

 Hi, Experts,
   Could anybody remind me how to load mutiple input files in a giraph
 command line? The following do not work, they only load the first input
 file:
 -vip /user/hadoop/input/ttt.txt   /user/hadoop/input/ttt2.txt
 or
 -vip /user/hadoop/input/ttt.txt  -vip /user/hadoop/input/ttt2.txt

   Best Regards,
   Suijian




 2014-03-01 16:12 GMT-06:00 Suijian Zhou suijian.z...@gmail.com:

 Hi,
   Here I'm trying to process a very big input file through giraph, ~70GB.
 I'm running the giraph program on a 40 nodes linux cluster but the program
 just get stuck there after it read in a small fraction of the input file.
 Although each node has 16GB mem, it looks that only one node read the input
 file which is on HDFS(into its memory). As the input file is so big, is
 there a way to scatter the input file on all the nodes so each node will
 read in  a fraction of the file then start processing the graph? Will it be
 helpful if we split the single big input file into many smaller files and
 let each node read in one of them to process( of course the overall
 stucture of the graph should be kept)? Thanks!

   Best Regards,
   Suijian





-- 
   Claudio Martella

Re: Giraph talks at Hadoop Summit

2014-02-28 Thread Claudio Martella

Btw guys my talk at the Hadoop summit in amsterdam this April was accepted.
So we ll have another one there.

On Friday, February 28, 2014, Avery Ching ach...@apache.org wrote:

 That's great Roman!  I certainly hope it gets accepted.  We also have a
 submission.  Hopefully there will be at least one Giraph talk at the Hadoop
 Summit.

 https://hadoopsummit.uservoice.com/forums/242790-
 committer-track/suggestions/5568083-dynamic-graph-
 iterative-computation-on-apache-gira

 Avery

 On 2/27/14, 2:19 PM, Roman Shaposhnik wrote:

 Hi!

 not sure if anybody from the Giraph community
 submitted any talks to Hadoop Summit, but
 here's the one I submitted:
  https://hadoopsummit.uservoice.com/forums/242790-
 committer-track/suggestions/5568061-apache-giraph-start-
 analyzing-graph-relationships

 Feel free to upvote if you feel like Giraph deserves
 to be well represented at Hadoop Summit.

 Thanks,
 Roman.




-- 
   Claudio Martella

Re: Giraph avro input format

2014-02-17 Thread Claudio Martella

I'm not sure about what I'm going to say, but Gora should read from Avro,
and we do support reading transparently through Gora. you could check that
out.


On Mon, Feb 17, 2014 at 1:32 PM, Martin Neumann mneum...@spotify.comwrote:

 Hej,

 Is there an avro input format for Giraph? I saw some older (july 2013)
 entries on the mailing list and none existed by then. Have things changed
 since then, or do I have to write my own?

 When I write my own what's a good base class to start from?

 cheers Martin




-- 
   Claudio Martella

Re: Basic questions about Giraph internals

2014-02-07 Thread Claudio Martella

Yes, Giraph hijacks mapper tasks, and then does everything else on its
own.


On Fri, Feb 7, 2014 at 12:39 PM, Alexander Frolov
alexndr.fro...@gmail.comwrote:




 On Fri, Feb 7, 2014 at 2:30 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:




 On Fri, Feb 7, 2014 at 9:44 AM, Alexander Frolov 
 alexndr.fro...@gmail.com wrote:

  Thank you, I will try to do this. As I understood I should set number
 of threads manually through Giraph API.

 BTW, what is conceptual difference between running multiple workers on
 the TaskTracker and running single worker and multiple threads? In terms of
 vertex fetching, memory sharing etc.


 Basically, better usage of resources: one single JVM, no duplication of
 core data structures, less netty threads and communication points, more
 locality (less messages over the network), less actors accessing zookeeper
 etc.



  Also I would like to ask how message transfer between vertices is
 implemented in terms of Hadoop primitives? Source code reference will be
 enough.


 Communication does not happen via Hadoop primitives, but ad-hoc via
 netty.


 Ok. It seams that Hadoop has minimalistic influence on Giraph application
 execution after graph is loaded into memory (that is mapping is done).




-- 
   Claudio Martella

Re: Basic questions about Giraph internals

2014-02-06 Thread Claudio Martella

Hi Alex,

answers are inline.


On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov
alexndr.fro...@gmail.comwrote:

 Hi, folks!

 I have started small research of Giraph framework and I have not much
 experience with Giraph and Hadoop :-(.

 I would like to ask several questions about how things are working in
 Giraph which are not straightforward for me. I am trying to use the sources
 but sometimes it is not too easy ;-)

 So here they are:

 1) How Workers are assigned to TaskTrackers?


Each worker is a mapper, and mapper tasks are assigned to tasktrackers by
the jobtracker. There's no control by Giraph there, and because Giraph
doesn't need data-locality like Mapreduce does, basically nothing is done.



 2) How vertices are assigned to Workers? Does it depend on distribution of
 input file on DataNodes? Is there available any choice of distribution
 politics or no?


In the default scheme, vertices are assigned through modulo hash
partitioning. Given k workers, vertex v is assigned to worker i according
to hash(v) % k = i.



 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)?
 (1:n)?


It's 1:1. Each worker is implemented by a mapper task. The master is
usually (but does not need to) implemented by an additional mapper.



 4) Can Workers migrate from one TaskTracker to the other?


Workers does not migrate. A Giraph computation is not dynamic wrt to
assignment and size of the tasks.



 5) What is the best way to monitor Giraph app execution (progress, worker
 assignment, load balancing etc.)?


Just like you would for a standard Mapreduce job. Go to the job page on the
jobtracker http page.



 I think this is all for the moment. Thank you.

 Testbed description:
 Hardware: 8 node dual-CPU cluster with IB FDR.
 Giraph: release-1.0.0-RC2-152-g585511f
 Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8

 Best,
Alex




-- 
   Claudio Martella

Re: Basic questions about Giraph internals

2014-02-06 Thread Claudio Martella

On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov
alexndr.fro...@gmail.comwrote:

 Hi Claudio,

 thank you.

 If I understood correctly, mapper and mapper task is the same thing.


More or less. A mapper is a functional element of the programming model,
while the mapper task is the task that executes the mapper function on the
records.




 On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:

 Hi Alex,

 answers are inline.


 On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov 
 alexndr.fro...@gmail.com wrote:

 Hi, folks!

 I have started small research of Giraph framework and I have not much
 experience with Giraph and Hadoop :-(.

 I would like to ask several questions about how things are working in
 Giraph which are not straightforward for me. I am trying to use the sources
 but sometimes it is not too easy ;-)

 So here they are:

 1) How Workers are assigned to TaskTrackers?


 Each worker is a mapper, and mapper tasks are assigned to tasktrackers by
 the jobtracker.


 That is each Worker is created at the beginning of superstep and then
 dies. In the next superstep all Workers are created again. Is it correct?


Nope. The workers are created at the beginning of the computation, and
destroyed at the end of the computation. A computation is persistent
throughout the computation.




 There's no control by Giraph there, and because Giraph doesn't need
 data-locality like Mapreduce does, basically nothing is done.


 This is important for me. So Giraph Worker (a.k.a Hadoop mapper) fetches
 vertex with corresponding index from the HDFS and perform computation. What
 does it do next with it? As I understood Giraph is fully in-memory
 framework and in the next superstep this vertex should be fetched from the
 memory by the same Worker. Where the vertices are stored between
 supersteps? In HDFS or in memory?


As I said, the workers are persistent (in-memory) between supersteps, so
they keep everything in memory.






 2) How vertices are assigned to Workers? Does it depend on distribution
 of input file on DataNodes? Is there available any choice of distribution
 politics or no?


 In the default scheme, vertices are assigned through modulo hash
 partitioning. Given k workers, vertex v is assigned to worker i according
 to hash(v) % k = i.




 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)?
 (1:n)?


 It's 1:1. Each worker is implemented by a mapper task. The master is
 usually (but does not need to) implemented by an additional mapper

 .



 4) Can Workers migrate from one TaskTracker to the other?


 Workers does not migrate. A Giraph computation is not dynamic wrt to
 assignment and size of the tasks.




 5) What is the best way to monitor Giraph app execution (progress,
 worker assignment, load balancing etc.)?


 Just like you would for a standard Mapreduce job. Go to the job page on
 the jobtracker http page.



 I think this is all for the moment. Thank you.

 Testbed description:
 Hardware: 8 node dual-CPU cluster with IB FDR.
 Giraph: release-1.0.0-RC2-152-g585511f
 Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8

 Best,
Alex




 --
Claudio Martella






-- 
   Claudio Martella

Re: Basic questions about Giraph internals

2014-02-06 Thread Claudio Martella

On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov
alexndr.fro...@gmail.comwrote:




 On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:




 On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov 
 alexndr.fro...@gmail.com wrote:

 Hi Claudio,

 thank you.

 If I understood correctly, mapper and mapper task is the same thing.


 More or less. A mapper is a functional element of the programming model,
 while the mapper task is the task that executes the mapper function on the
 records.


 Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a maximum number
 of Workers [or Workers + Master] which will be created at the same node.

 That is if I have 8 node cluster
 with mapred.tasktracker.map.tasks.maximum=4, then I can run up to 31
 Workers + 1 Master.

 Is it correct?


That is correct. However, if you have total control over your cluster, you
may want to run one worker per node (hence setting the max number of map
tasks per machine to 1), and use multiple threads (input, compute, output).
This is going to make better use of resources.







 On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:

 Hi Alex,

 answers are inline.


 On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov 
 alexndr.fro...@gmail.com wrote:

 Hi, folks!

 I have started small research of Giraph framework and I have not much
 experience with Giraph and Hadoop :-(.

 I would like to ask several questions about how things are working in
 Giraph which are not straightforward for me. I am trying to use the 
 sources
 but sometimes it is not too easy ;-)

 So here they are:

 1) How Workers are assigned to TaskTrackers?


 Each worker is a mapper, and mapper tasks are assigned to tasktrackers
 by the jobtracker.


 That is each Worker is created at the beginning of superstep and then
 dies. In the next superstep all Workers are created again. Is it correct?


 Nope. The workers are created at the beginning of the computation, and
 destroyed at the end of the computation. A computation is persistent
 throughout the computation.




 There's no control by Giraph there, and because Giraph doesn't need
 data-locality like Mapreduce does, basically nothing is done.


 This is important for me. So Giraph Worker (a.k.a Hadoop mapper) fetches
 vertex with corresponding index from the HDFS and perform computation. What
 does it do next with it? As I understood Giraph is fully in-memory
 framework and in the next superstep this vertex should be fetched from the
 memory by the same Worker. Where the vertices are stored between
 supersteps? In HDFS or in memory?


 As I said, the workers are persistent (in-memory) between supersteps, so
 they keep everything in memory.


 Ok.

 Is there any means to see assignment of Workers to TaskTrackers during or
 after the computation?


The jobtracker http interface will show you the mapper running, hence i'd
check there



 And is there any means to see assignment of vertices to Workers (as
 distribution function, histogram etc.)?


You can check the worker logs, I think the information should be there.










 2) How vertices are assigned to Workers? Does it depend on
 distribution of input file on DataNodes? Is there available any choice of
 distribution politics or no?


 In the default scheme, vertices are assigned through modulo hash
 partitioning. Given k workers, vertex v is assigned to worker i according
 to hash(v) % k = i.




 3) How Workers and Map tasks are related to each other? (1:1)? (n:1)?
 (1:n)?


 It's 1:1. Each worker is implemented by a mapper task. The master is
 usually (but does not need to) implemented by an additional mapper

 .



 4) Can Workers migrate from one TaskTracker to the other?


 Workers does not migrate. A Giraph computation is not dynamic wrt to
 assignment and size of the tasks.




 5) What is the best way to monitor Giraph app execution (progress,
 worker assignment, load balancing etc.)?


 Just like you would for a standard Mapreduce job. Go to the job page on
 the jobtracker http page.



 I think this is all for the moment. Thank you.

 Testbed description:
 Hardware: 8 node dual-CPU cluster with IB FDR.
 Giraph: release-1.0.0-RC2-152-g585511f
 Hadoop: hadoop-0.20.203.0, hadoop-rdma-0.9.8

 Best,
Alex




 --
Claudio Martella






 --
Claudio Martella






-- 
   Claudio Martella

Re: Basic questions about Giraph internals

2014-02-06 Thread Claudio Martella

On Thu, Feb 6, 2014 at 3:04 PM, Alexander Frolov
alexndr.fro...@gmail.comwrote:


 Claudio,
 thank you very much for your help.

 On Thu, Feb 6, 2014 at 4:06 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:




 On Thu, Feb 6, 2014 at 12:15 PM, Alexander Frolov 
 alexndr.fro...@gmail.com wrote:




 On Thu, Feb 6, 2014 at 3:00 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:




 On Thu, Feb 6, 2014 at 11:56 AM, Alexander Frolov 
 alexndr.fro...@gmail.com wrote:

 Hi Claudio,

 thank you.

 If I understood correctly, mapper and mapper task is the same thing.


 More or less. A mapper is a functional element of the programming
 model, while the mapper task is the task that executes the mapper function
 on the records.


 Ok, I see. Then mapred.tasktracker.map.tasks.maximum is a maximum number
 of Workers [or Workers + Master] which will be created at the same node.

 That is if I have 8 node cluster
 with mapred.tasktracker.map.tasks.maximum=4, then I can run up to 31
 Workers + 1 Master.

 Is it correct?


 That is correct. However, if you have total control over your cluster,
 you may want to run one worker per node (hence setting the max number of
 map tasks per machine to 1), and use multiple threads (input, compute,
 output).
 This is going to make better use of resources.


 Should I explicitly force Giraph to use multiple threads for input,
 compute, output? Only three threads, I suppose? But I have 12 cores
 available in each node (24 if HT is enabled).


You're right, I was not clear. I suggest you use N threads for each of
those three classes, where N is something close to the number of processing
units (e.g. cores) you have available on each machine.
Consider that Giraph has a number of other threads running in the
background, for example to handle communication etc. I suggest you try
different setups through benchmarking.












 On Thu, Feb 6, 2014 at 2:28 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:

 Hi Alex,

 answers are inline.


 On Thu, Feb 6, 2014 at 11:22 AM, Alexander Frolov 
 alexndr.fro...@gmail.com wrote:

 Hi, folks!

 I have started small research of Giraph framework and I have not
 much experience with Giraph and Hadoop :-(.

 I would like to ask several questions about how things are working
 in Giraph which are not straightforward for me. I am trying to use the
 sources but sometimes it is not too easy ;-)

 So here they are:

 1) How Workers are assigned to TaskTrackers?


 Each worker is a mapper, and mapper tasks are assigned to
 tasktrackers by the jobtracker.


 That is each Worker is created at the beginning of superstep and then
 dies. In the next superstep all Workers are created again. Is it correct?


 Nope. The workers are created at the beginning of the computation, and
 destroyed at the end of the computation. A computation is persistent
 throughout the computation.




 There's no control by Giraph there, and because Giraph doesn't need
 data-locality like Mapreduce does, basically nothing is done.


 This is important for me. So Giraph Worker (a.k.a Hadoop mapper)
 fetches vertex with corresponding index from the HDFS and perform
 computation. What does it do next with it? As I understood Giraph is fully
 in-memory framework and in the next superstep this vertex should be 
 fetched
 from the memory by the same Worker. Where the vertices are stored between
 supersteps? In HDFS or in memory?


 As I said, the workers are persistent (in-memory) between supersteps,
 so they keep everything in memory.


 Ok.

 Is there any means to see assignment of Workers to TaskTrackers during
 or after the computation?


 The jobtracker http interface will show you the mapper running, hence i'd
 check there



 And is there any means to see assignment of vertices to Workers (as
 distribution function, histogram etc.)?


 You can check the worker logs, I think the information should be there.










 2) How vertices are assigned to Workers? Does it depend on
 distribution of input file on DataNodes? Is there available any choice 
 of
 distribution politics or no?


 In the default scheme, vertices are assigned through modulo hash
 partitioning. Given k workers, vertex v is assigned to worker i according
 to hash(v) % k = i.




 3) How Workers and Map tasks are related to each other? (1:1)?
 (n:1)? (1:n)?


 It's 1:1. Each worker is implemented by a mapper task. The master is
 usually (but does not need to) implemented by an additional mapper

 .



 4) Can Workers migrate from one TaskTracker to the other?


 Workers does not migrate. A Giraph computation is not dynamic wrt to
 assignment and size of the tasks.




 5) What is the best way to monitor Giraph app execution (progress,
 worker assignment, load balancing etc.)?


 Just like you would for a standard Mapreduce job. Go to the job page
 on the jobtracker http page.



 I think this is all for the moment. Thank you.

 Testbed description:
 Hardware: 8 node dual-CPU cluster with IB FDR

Re: Giraph installation without internet connection

2014-02-04 Thread Claudio Martella

you should not have problems if you build the jar with dependencies
elsewhere and then deploy it to your cluster.


On Tue, Feb 4, 2014 at 2:40 PM, Alexander Frolov
alexndr.fro...@gmail.comwrote:

 Hi,

 Is it possible to build Giraph w/o Internet? Target cluster has not got
 internet connection.

 Best,
   Alex




-- 
   Claudio Martella

Re: constraint about no of supersteps

2014-01-29 Thread Claudio Martella

looks like one of your workers died. If you expect such a long job, I'd
suggest you turn checkpointing on.


On Wed, Jan 29, 2014 at 5:30 PM, Jyoti Yadav rao.jyoti26ya...@gmail.comwrote:

 Thanks all for your reply..
 Actually i am working with an algorithm in which single source shortest
 path  algorithm  runs for thousands of vertices .suppose on an average for
 one vertex this algo takes 5-6 supersteps,then for thousands of
 vertices,count of superstep is extremely large..In that case at run time
 following error is thrown...

  ERROR org.apache.giraph.master.BspServiceMaster:
 superstepChosenWorkerAlive: Missing chosen worker
 Worker(hostname=kanha-Vostro-1014, MRtaskID=1, port=30001) on superstep
 19528
 2014-01-28 05:11:36,852 INFO org.apache.giraph.master.MasterThread:
 masterThread: Coordination of superstep 19528 took 636.831 seconds ended
 with state WORKER_FAILURE and is now on superstep 19528
 2014-01-28 05:11:39,446 ERROR org.apache.giraph.master.MasterThread:
 masterThread: Master algorithm failed with ArrayIndexOutOfBoundsException
 java.lang.ArrayIndexOutOfBoundsException: -1

 Any ideas??

 Thanks
 Jyoti


 On Wed, Jan 29, 2014 at 8:55 PM, Peter Grman peter.gr...@gmail.comwrote:

 Yes but you can disable the counters per superstep, if you don't need the
 data, and than I had around 2000 after which my algorithm stopped.

 Cheers
 Peter
 On Jan 29, 2014 4:22 PM, Claudio Martella claudio.marte...@gmail.com
 wrote:

 the limit is currently defined by the maximum number of counters your
 jobtracker allows. Hence, by default the max number of supersteps is around
 90.

 check http://giraph.apache.org/faq.html to see how to increase it.


 On Wed, Jan 29, 2014 at 4:12 PM, Jyoti Yadav rao.jyoti26ya...@gmail.com
  wrote:

 Hi folks..

 Is there any limit for maximum no of supersteps while running a giraph
 job??

 Thanks
 Jyoti




 --
Claudio Martella






-- 
   Claudio Martella

Re: out of core option

2014-01-23 Thread Claudio Martella

*
 ${hadoop.tmp.dir}/mapred/staging*mapred.queue.names* default
 *dfs.access.time.precision*360*fs.hsftp.impl*
 org.apache.hadoop.hdfs.HsftpFileSystem
 *mapred.task.tracker.http.address*0.0.0.0:50060
 *mapred.reduce.parallel.copies* 5*io.seqfile.lazydecompress*true
 *mapred.output.dir*/user/hduser/output/shortestpaths *io.sort.mb*100
 *ipc.client.connection.maxidletime*1*mapred.compress.map.output*false
 *hadoop.security.uid.cache.secs*14400
 *mapred.task.tracker.report.address*127.0.0.1:0
 *mapred.healthChecker.interval*6*ipc.client.kill.max*10
 *ipc.client.connect.max.retries* 10*ipc.ping.interval*30
 *mapreduce.user.classpath.first*true *mapreduce.map.class*
 org.apache.giraph.graph.GraphMapper*fs.s3.impl*
 org.apache.hadoop.fs.s3.S3FileSystem*mapred.user.jobconf.limit* 5242880
 *mapred.job.tracker.http.address*0.0.0.0:50030*io.file.buffer.size*
 4096*mapred.jobtracker.restart.recover*false*io.serializations*
 org.apache.hadoop.io.serializer.WritableSerialization
 *dfs.datanode.handler.count*3*mapred.reduce.copy.backoff*300
 *mapred.task.profile* false*dfs.replication.considerLoad*true
 *jobclient.output.filter*FAILED
 *dfs.namenode.delegation.token.max-lifetime*60480
 *mapred.tasktracker.map.tasks.maximum*4*io.compression.codecs*
 org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
 *fs.checkpoint.size*67108864

 Additionally, if I have more than one worker I get an Exception, too?
 Are my configurations wrong?


 best regards,
 Sebastian







-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: About LineRank algo ..

2014-01-20 Thread Claudio Martella

do you plan to share it when you're done? :)


On Mon, Jan 20, 2014 at 9:15 AM, Sebastian Schelter s...@apache.org wrote:

 I have a student working on an implementation, do you have questions?


 On 01/20/2014 08:11 AM, Jyoti Yadav wrote:

 Hi..
 Is there anyone who is working with linerank algorithm??

 Thanks
 Jyoti





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Intermediate output

2014-01-18 Thread Claudio Martella

you can use giraph.doOutputDuringComputation.

If you use this option, instead of having saving vertices in the end of
application, saveVertex will be called right after each vertex.compute() is
called.NOTE: This feature doesn't work well with checkpointing - if you
restart from a checkpoint you won't have any ouptut from previous
supresteps.


On Sat, Jan 18, 2014 at 11:02 AM, Sebastian Schelter s...@apache.org wrote:

 Hi,

 Did we have a way to write out the state of the graph after each
 superstep? I have an algorithm that requires this and I don't want to
 buffer the intermediate results in memory until the algorithm finishes.

 --sebastian




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Release date for 1.1.0

2014-01-18 Thread Claudio Martella

I agree, in particular considering that the current patch does not even
apply to trunk.


On Sat, Jan 18, 2014 at 4:43 PM, Sebastian Schelter s...@apache.org wrote:

 Hi,

 I had a look at the list and noticed that https://issues.apache.org/
 jira/browse/GIRAPH-818 (which is based on a research paper) is marked for
 1.1.0.

 Given that this issue proposes to change the programming and execution
 model of Giraph, I don't see it in the scope of the upcoming release.

 --sebastian


 On 01/18/2014 04:07 PM, Roman Shaposhnik wrote:

 The easiest way to start helping with a release would be to
 take a look at the JIRAs in the link I sent to the list a few
 days ago.

 Thanks,
 Roman,

 On Fri, Jan 17, 2014 at 5:54 PM, Rob Paul urlop...@gmail.com wrote:

 Hi Roman,

 I will be happy to extend my help in the new release. If you allow, I
 can initiate it and then you can jump in, as and when your time
 permits.
 Thanks

 On Wed, Jan 15, 2014 at 5:57 PM, Roman Shaposhnik r...@apache.org
 wrote:

 It is the usual community-driven ASF process. Somebody
 familiar with the project has to step forward as a Release Manager
 and drive the release.

 I did a few months back, but since then I went through a career
 change that made it very difficult for me to find free cycles to
 drive this release. I fully intend to pick up the slack begging
 of  Feb. Given that I think beginning of March should be a
 realistic deadline, but it all depends on the availability of
 the Giraph PMC members to cast votes on the release
 candidate.

 That said, if there's anybody else who would want to
 speed up this release I'd be more than happy to yield.

 By and large though, ASF project typically don't give any
 schedule for future releases. The way to speed it up is
 to join the community, start contributing and volunteering
 as RM.

 Thanks,
 Roman.

 On Wed, Jan 15, 2014 at 5:02 PM, Zhu, Xia xia@intel.com wrote:

 Is it possible to release 1.1.0 before March 2014?


 Thanks,
 Xia
 -Original Message-
 From: Zhu, Xia [mailto:xia@intel.com]
 Sent: Wednesday, January 15, 2014 4:36 PM
 To: user@giraph.apache.org
 Subject: RE: Release date for 1.1.0

 May I know what are the Giraph release process?


 Thanks,
 Ivy
 -Original Message-
 From: shaposh...@gmail.com [mailto:shaposh...@gmail.com] On Behalf Of
 Roman Shaposhnik
 Sent: Monday, January 06, 2014 9:22 PM
 To: user@giraph.apache.org
 Subject: Re: Release date for 1.1.0

 On Mon, Jan 6, 2014 at 6:13 AM, Ahmet Emre Aladağ 
 aladage...@gmail.com wrote:

 Hi,

 Are there any advances so far on the 1.1.0 release schedule?


 Unfortunately, with my recent job change driving 1.1.0 release dropped
 from my list. I'll try to pick it up back this month. Still very much 
 would
 like to help make it happen.

 Thanks,
 Roman.





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Release date for 1.1.0

2014-01-18 Thread Claudio Martella

I think that after 1.1.0 we could consider this big change, and that there
should be voting for this. It's changing Giraph's shape at the core.


On Sat, Jan 18, 2014 at 5:15 PM, Mirko Kämpf mirko.kae...@cloudera.comwrote:

 Hi,

 also think this major change (or enhancement) might be something which
 goes into Giraph in a later release.

 Will there be a voting for such issues?



 Mirko




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: minLocalEdgesRatio is PseudoRandomLocalEdgesHelper

2013-12-15 Thread Claudio Martella

it's the ratio of edges that connect two vertices stored in the same worker.


On Sun, Dec 15, 2013 at 8:17 PM, Pushparaj Motamari
pushpara...@gmail.comwrote:

 Hi,

 Could anyone explain what is the siginificance of minLocalEdgesRatio
 field in PseudoRandomInputFormat , way of generating the graph

 Thanks

 Pushparaj




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Running Giraph on YARN (0.23)

2013-11-21 Thread Claudio Martella

 15:22:55,111 WARN [main] org.apache.hadoop.conf.Configuration:
 job.xml:an attempt to override final parameter:
 yarn.app.mapreduce.am.job.client.port-range;  Ignoring.
 2013-11-20 15:22:55,111 WARN [main] org.apache.hadoop.conf.Configuration:
 job.xml:an attempt to override final parameter:
 mapreduce.admin.reduce.child.java.opts;  Ignoring.
 2013-11-20 15:22:55,111 WARN [main] org.apache.hadoop.conf.Configuration:
 job.xml:an attempt to override final parameter: hadoop.tmp.dir;  Ignoring.
 2013-11-20 15:22:55,117 WARN [main]
 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Could not delete
 hdfs://zaniumtan-nn1.tan.ygrid.yahoo.com:8020/user/bordino/test-giraph-tmp/_temporary/1/_temporary/attempt_1382563758657_470916_m_56_0
 2013-11-20 
 http://zaniumtan-nn1.tan.ygrid.yahoo.com:8020/user/bordino/test-giraph-tmp/_temporary/1/_temporary/attempt_1382563758657_470916_m_56_02013-11-20
  15:22:55,121 INFO [main]
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics
 system...
 2013-11-20 15:22:55,122 INFO [main]
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system
 stopped.
 2013-11-20 15:22:55,122 INFO [main]
 org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system
 shutdown complete.



 Cheres,
 --
 Gianmarco




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Using the RandomEdge ... RandomVertex InputFormat

2013-11-04 Thread Claudio Martella

Every inputformat has a IVE signature with the type of vertex index, value
and edge value. They have to match the signature of the computation class
you're using.
In your case, the inputformat generates vertices with Long ids, while the
computation class expects floats.


On Mon, Nov 4, 2013 at 10:57 AM, Mirko Kämpf mirko.kae...@cloudera.comwrote:

 Hello,

 I try to use the RandomInputFormat. My giraph-job is submitted via the
 following command:

 hadoop jar giraph-ex.jar org.apache.giraph.GiraphRunner -Dgiraph.zkList=
 127.0.0.1:2181 -libjars giraph-core.jar
 org.apache.giraph.examples.SimpleShortestPathsVertex
 -eif org.apache.giraph.io.formats.PseudoRandomEdgeInputFormat
 -vif org.apache.giraph.io.formats.PseudoRandomVertexInputFormat
 -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat
  -op /user/cloudera/goutput/shortestpaths_rand_$NOW -w 1
 -ca giraph.pseudoRandomInputFormat.edgesPerVertex=10

 but I get the following exception:

  13/11/04 01:28:54 INFO utils.ConfigurationUtils: Setting custom argument
 [giraph.pseudoRandomInputFormat.edgesPerVertex] to [10] in
 GiraphConfiguration
 13/11/04 01:28:54 INFO utils.ConfigurationUtils: No input path for vertex
 data was specified. Ensure your InputFormat does not require one.
 13/11/04 01:28:54 INFO utils.ConfigurationUtils: No input path for edge
 data was specified. Ensure your InputFormat does not require one.
 Exception in thread main java.lang.IllegalArgumentException:
 checkClassTypes: Edge value types don't match, vertex - class
 org.apache.hadoop.io.FloatWritable, vertex input format - class
 org.apache.hadoop.io.DoubleWritable
  at
 org.apache.giraph.job.GiraphConfigurationValidator.verifyVertexInputFormatGenericTypes(GiraphConfigurationValidator.java:245)
 at
 org.apache.giraph.job.GiraphConfigurationValidator.validateConfiguration(GiraphConfigurationValidator.java:122)
  at
 org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:154)
 at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
  at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

 I think that I have not all required command line parameters set. But the
 problem is, I can not find any docu, which explains how to run giraph with
 random networks, generated on the fly.
 The job runs with the tiny_graph.txt file (and appropriate parameters) but
 nit with the random format.



 Could anybody please help me to find out how to use the *random graph*and the 
 *watts
 strogatz model*
 which are mentioned by Claudio in this mail:
 *http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3cof4e9a3736.19e56fe9-on85257bff.000b4243-85257bff.000d8...@us.ibm.com%3E
 http://mail-archives.apache.org/mod_mbox/giraph-user/201310.mbox/%3cof4e9a3736.19e56fe9-on85257bff.000b4243-85257bff.000d8...@us.ibm.com%3E*

 Can I use the RandomVertex and the RandomEdgeInputFormat to build random
 graphs on the fly?

 Thanks a lot in advance.

 Best wishes
 Mirko




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Using the RandomEdge ... RandomVertex InputFormat

2013-11-04 Thread Claudio Martella

Yes, you'll have to make sure that the pseudorandomedgeinputformat provides
the right types.
The code for the watts strogatz model is the same package as the
pseudorandom... but in trunk and not in 1.0.


On Mon, Nov 4, 2013 at 12:14 PM, Mirko Kämpf mirko.kae...@cloudera.comwrote:

 Thanks, Claudio.

 I conclude from your mail, I have to create my own
 PseudoRandomEdgeInputFormat and PseudoRandomVertexInputFormat with types,
 which fit to the algorithm I want to use. So I misunderstood the concept
 and not all InputFormats fit to any given implemented algorithm. I this
 right?

 But what about the *config parameters*, I have to provide for the
 PseudoRandom ... InputFormat and where is the code for the *watts
 strogatz model* you mentioned in a previous post?

 Best wishes
  Mirko






-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Link Prediction with Giraph

2013-10-31 Thread Claudio Martella

I would assume that it depends on your data. A graph is a very general
structure, and it is difficult to attack this problem in general. The most
obvious one is transitive closure (if A is connected to B and B to C then A
could be conntected to C). The triangle counting example in our codebase
(although the name is misleading) is based on these kinds of assumptions.


On Thu, Oct 31, 2013 at 1:26 PM, Pascal Jäger pas...@pascaljaeger.dewrote:

 Hi,

 Does anyone happen to know a paper about link prediction using a pregel
 like framework like Giraph?
 Or has someone an idea about how link prediction could be accomplished
 with Giraph?

 Any input is highly appreciated :)

 Thanks

 Pascal




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Release date for 1.1.0

2013-10-29 Thread Claudio Martella

I actually agree, we should start heading towards 1.1.0 with a plan.
Avery, what do you think?


On Tue, Oct 29, 2013 at 2:13 PM, Ahmet Emre Aladağ aladage...@gmail.comwrote:

 Hi all,

 Is there an expected date for 1.1.0? There has been a lot of way taken
 since 1.0.0.

 --
 Ahmet Emre Aladağ




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: master knowing about message traffic

2013-10-21 Thread Claudio Martella

The most simple solution is to use an aggregator.


On Mon, Oct 21, 2013 at 3:48 PM, Jyoti Yadav rao.jyoti26ya...@gmail.comwrote:

 Is there any way for the master to know about how much message traffic is
 there?
 In my algo, I have to implement something when there is no msg flowing.

 Any ideas are really appreciated..
 Regards
  Jyoti




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Problem running the PageRank example in a cluster

2013-10-21 Thread Claudio Martella

 authentication.
 2013-10-21 10:12:15,692 WARN
 org.apache.giraph.comm.netty.handler.ResponseClientHandler:
 exceptionCaught: Channel failed with remote address null
 java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
 at
 org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:404)
 at
 org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:366)
 at
 org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:282)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 2013-10-21 10:12:15,693 INFO org.apache.giraph.comm.netty.NettyClient:
 connectAllAddresses: Successfully added 0 connections, (0 total
 connected) 1 failed, 6 failures total.
 2013-10-21 10:12:15,693 WARN org.apache.giraph.comm.netty.NettyClient:
 connectAllAddresses: Future failed to connect with
 hdnode02/172.24.10.72:30001 with 6 failures because of
 java.net.ConnectException: Connection refused
 2013-10-21 10:12:15,693 INFO org.apache.giraph.comm.netty.NettyClient:
 Using Netty without authentication.
 2013-10-21 10:12:15,694 WARN
 org.apache.giraph.comm.netty.handler.ResponseClientHandler:
 exceptionCaught: Channel failed with remote address null
 java.net.ConnectException: Connection refused
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)
 at
 org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:404)
 at
 org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:366)
 at
 org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:282)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: How to specify parameters in order to run giraph job in parallel

2013-10-19 Thread Claudio Martella

how many mapper tasks do you have set for each node? how many workers are
you using for giraph?


On Fri, Oct 18, 2013 at 7:12 PM, YAN Da ya...@ust.hk wrote:

 Dear Claudio Martella,

 I don't quite get what you mean. Our cluster has 15 servers each with 24
 cores, so ideally there can be 15*24 threads/partitions work in parallel,
 right? (Perhaps deduct one for ZooKeeper)

 However, when we set the -Dgiraph.numComputeThreads option, we find that
 we cannot have even 20 threads, and when set to 10, the CPU usage is just
 a little bit doubles that of the default setting, not anything close to
 100*numComputeThreads%.

 How can we set it to work on our server to utilize all the processors?

 Regards,
 Da Yan

  It actually depends on the setup of your cluster.
 
  Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node
  (ideally to run giraph), so that you would have 14 workers, one per
  computing node, plus one for master+zookeeper. Once that is reached, you
  would have a number of compute threads equals to the number of threads
  that
  you can run on each node (24 in your case).
 
  Does this make sense to you?
 
 
  On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote:
 
  Hi,
 
  I have a computer cluster consisting of 15 slave machines and 1 master
  machine.
 
  On each slave machine, there are two Xeon E5-2620 CPUs. With the help of
  HT, there are 24 threads.
 
  I am wondering how to specify parameters in order to run giraph job in
  parallel on my cluster.
 
  I am using the following parameters to run a pagerank algorithm.
 
  hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner
  SimplePageRank -vif PageRankInputFormat -vip /input -vof
  PageRankOutputFormat -op /pagerank -w 1 -mc
  SimplePageRank\$SimplePageRankMasterCompute -wc
  SimplePageRank\$SimplePageRankWorkerContext
 
  In particular,
 
  1)I know I can use “-w” to specify the number of workers. In my opinion,
  the number of workers equals to the number of mappers in hadoop except
  zookeeper. Therefore, in my case(15 slave machine), which number should
  be
  chosen? Is 15 a good choice? Since, I find if I input a large number,
  e.g.
  100, the mappers will hang.
 
  2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex
  computing thread number. However, if I specify it to 10, the total
  runtime
  is much longer than default. I think the default is 1, which is found in
  the source code. I wonder if I want to use this parameter, which number
  should be chosen.
 
  3)When the giraph job is running, I use “top” command to monitor my cpu
  usage on slave machines. I find that the java process can use 200%-300%
  cpu
  resource. However, if I change the number of vertex computing threads to
  10, the java process can use 800% cpu resource. I think it is not a
  linear
  relation and I want to know why.
 
 
  Thanks for your help.
 
  Best,
 
  -Yi
 
 
 
 
  --
 Claudio Martella
 claudio.marte...@gmail.com
 





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: how to use out of core options

2013-10-19 Thread Claudio Martella

 to test the out of core performance of my cluster.

 Thanks very much,
 Jian




 --
 Best Regards,
 Jyotirmoy Sundi
 Data Engineer,
 Admobius

 San Francisco, CA 94158






 --
 Best Regards,
 Jyotirmoy Sundi
 Data Engineer,
 Admobius

 San Francisco, CA 94158





 --
 Best Regards,
 Jyotirmoy Sundi
 Data Engineer,
 Admobius

 San Francisco, CA 94158





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: How to specify parameters in order to run giraph job in parallel

2013-10-17 Thread Claudio Martella

It actually depends on the setup of your cluster.

Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node
(ideally to run giraph), so that you would have 14 workers, one per
computing node, plus one for master+zookeeper. Once that is reached, you
would have a number of compute threads equals to the number of threads that
you can run on each node (24 in your case).

Does this make sense to you?


On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote:

 Hi,

 I have a computer cluster consisting of 15 slave machines and 1 master
 machine.

 On each slave machine, there are two Xeon E5-2620 CPUs. With the help of
 HT, there are 24 threads.

 I am wondering how to specify parameters in order to run giraph job in
 parallel on my cluster.

 I am using the following parameters to run a pagerank algorithm.

 hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner
 SimplePageRank -vif PageRankInputFormat -vip /input -vof
 PageRankOutputFormat -op /pagerank -w 1 -mc
 SimplePageRank\$SimplePageRankMasterCompute -wc
 SimplePageRank\$SimplePageRankWorkerContext

 In particular,

 1)I know I can use “-w” to specify the number of workers. In my opinion,
 the number of workers equals to the number of mappers in hadoop except
 zookeeper. Therefore, in my case(15 slave machine), which number should be
 chosen? Is 15 a good choice? Since, I find if I input a large number, e.g.
 100, the mappers will hang.

 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex
 computing thread number. However, if I specify it to 10, the total runtime
 is much longer than default. I think the default is 1, which is found in
 the source code. I wonder if I want to use this parameter, which number
 should be chosen.

 3)When the giraph job is running, I use “top” command to monitor my cpu
 usage on slave machines. I find that the java process can use 200%-300% cpu
 resource. However, if I change the number of vertex computing threads to
 10, the java process can use 800% cpu resource. I think it is not a linear
 relation and I want to know why.


 Thanks for your help.

 Best,

 -Yi




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: knowing about the vertex id of the sender of the message.

2013-10-17 Thread Claudio Martella

No, you'll have to add it to the message data.


On Thu, Oct 17, 2013 at 6:10 PM, Jyoti Yadav rao.jyoti26ya...@gmail.comwrote:

 Hi..
 In vertex computation code,at the start of the superstep every vertex
 processes its received messages.. Is there any way for the vertex to know
 who is the sender of the message it is currenty processing.?

 Thanks
 Jyoti




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Running the example in http://giraph.apache.org/quick_start.html

2013-10-09 Thread Claudio Martella

   format
   edge value type is not known
   13/10/09 14:21:36 INFO job.GiraphJob: run: Since checkpointing is
   disabled (default), do not allow any task retries (setting
   mapred.map.max.attempts = 0, old value = 4)
   13/10/09 14:21:37 INFO mapred.JobClient: Running job:
   job_201310091401_0002
   13/10/09 14:21:38 INFO mapred.JobClient:  map 0% reduce 0%
   13/10/09 14:21:52 INFO mapred.JobClient:  map 50% reduce 0%
   13/10/09 14:21:58 INFO mapred.JobClient:  map 100% reduce 0%
   13/10/09 14:21:59 INFO mapred.JobClient:  map 50% reduce 0%
   13/10/09 14:32:01 INFO mapred.JobClient:  map 0% reduce 0%
   13/10/09 14:32:02 INFO mapred.JobClient: Job complete:
   job_201310091401_0002
   13/10/09 14:32:02 INFO mapred.JobClient: Counters: 6
   13/10/09 14:32:02 INFO mapred.JobClient:   Job Counters
   13/10/09 14:32:02 INFO mapred.JobClient:
 SLOTS_MILLIS_MAPS=622821
   13/10/09 14:32:02 INFO mapred.JobClient: Total time spent by all
   reduces waiting after reserving slots (ms)=0
   13/10/09 14:32:02 INFO mapred.JobClient: Total time spent by all
   maps waiting after reserving slots (ms)=0
   13/10/09 14:32:02 INFO mapred.JobClient: Launched map tasks=2
   13/10/09 14:32:02 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
   13/10/09 14:32:02 INFO mapred.JobClient: Failed map tasks=1
  
  
   I appreciate any help. Maybe I did it wrong.
   Andro.
  
  
  
  
 
 




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: connected components example in giraph 1.0

2013-10-07 Thread Claudio Martella

Can you try applying this one first?

http://www.mail-archive.com/user@giraph.apache.org/msg00945/check.diff


On Mon, Oct 7, 2013 at 8:40 AM, Silvio Di gregorio 
silvio.digrego...@gmail.com wrote:

 *As i said i have builded*

 *giraph-examples-1.0.0-for-hadoop-2.0.0-cdh4.1.2-jar-with-dependencies.jar*

 *for cdh4, successfully. The job start to monitoring the success rate:*

 *13/10/07 08:28:45 INFO mapred.JobClient:  map 0% reduce 0%*

 *but then*

 *Error running child
 java.lang.IllegalStateException: run: Caught an unrecoverable exception 
 java.io.FileNotFoundException: File 
 _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist.
 *

 *...*

 *Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File 
 _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist.*




 2013/10/5 Silvio Di gregorio silvio.digrego...@gmail.com

 I ha ve build w/ hadoop_cdh4.1.2 parameter.
 Something is changed, monday i report the result. Now the farm is closed.
 Il giorno 05/ott/2013 14:06, Claudio Martella 
 claudio.marte...@gmail.com ha scritto:

  Oh, right, -vof is in trunk. Anyway it looks like you built giraph for
 the wrong profile. You mentioned you're running on 2.0, but your giraph is
 built for 0.20.203. try building with a profile for your hadoop version.


 On Fri, Oct 4, 2013 at 2:35 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 org.apache.commons.cli.UnrecognizedOptionException: Unrecognized
 option: -vof

 in 1.0 version is
  -of,--outputFormat arg   Vertex output format
  -op,--outputPath arg Vertex output path



 2013/10/4 Claudio Martella claudio.marte...@gmail.com

 did you try the argument (-vof) i suggested?


 On Fri, Oct 4, 2013 at 2:13 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 i've specified  -of
  org.apache.giraph.io.formats.IdWithValueTextOutputFormat

 but the same error was produced

 Exception in thread main java.lang.IncompatibleClassChangeError:
 Found interface org.apache.hadoop.mapreduce.JobContext, but class was
 expected
 at
 org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43)
 at
 org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:984)
 at
 org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
 at
 org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
 at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:237)
 at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:94)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
 at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)



 2013/10/4 Claudio Martella claudio.marte...@gmail.com

 Hi,

 you need to specify the vertex outputformat class (-vof option),
 e.g. org.apache.giraph.io.formats.IdWithValueTextOutputFormat.


 On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 Hi,

 I hope I have sent to the right address.

 i have a graph (directed and unweighted) stored in hdfs like a
 adjacency list (140Milions of edges 6Milions of vertex)

 nodetabneighbors

 23   2   1343

 1 999 99923 909 ...

 ..

 hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6


 I have executed the giraph-1.0 connected components example, in
 this fashion

  hadoop jar /usr/local/giraph/giraph-examples/target/giraph-
 examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner 
 org.apache.giraph.examples.ConnectedComponentsVertex
 -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip
 /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4



 and then fail with:

 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No edge input
 format specified. Ensure your InputFormat does not require one.



 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No output format
 specified. Ensure your OutputFormat does not require one.



 13/10/04 09:28:30 INFO job.GiraphJob: run: Since checkpointing is
 disabled (default), do not allow any task retries (setting
 mapred.map.max.attempts = 0, old value = 4

Re: connected components example in giraph 1.0

2013-10-07 Thread Claudio Martella

OK, thanks.

I really have to push that patch in.


On Mon, Oct 7, 2013 at 12:17 PM, Silvio Di gregorio 
silvio.digrego...@gmail.com wrote:

 yes i do,
 i have seen this in your post in:
 http://www.mail-archive.com/user@giraph.apache.org/msg00957.html

 excuse me if  i had checked in the mail-achive first I would have avoided
 the last post.

 Now zk issue are resolved.


 2013/10/7 Claudio Martella claudio.marte...@gmail.com

 Can you try applying this one first?

 http://www.mail-archive.com/user@giraph.apache.org/msg00945/check.diff


 On Mon, Oct 7, 2013 at 8:40 AM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:



 *As i said i have builded*


 *giraph-examples-1.0.0-for-hadoop-2.0.0-cdh4.1.2-jar-with-dependencies.jar*


 *for cdh4, successfully. The job start to monitoring the success rate:*


 *13/10/07 08:28:45 INFO mapred.JobClient:  map 0% reduce 0%*


 *but then*


 *Error running child
 java.lang.IllegalStateException: run: Caught an unrecoverable exception 
 java.io.FileNotFoundException: File 
 _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist.
 *


 *...*



 *Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: File 
 _bsp/_defaultZkManagerDir/job_201309181636_0678/_zkServer does not exist.*






 2013/10/5 Silvio Di gregorio silvio.digrego...@gmail.com

 I ha ve build w/ hadoop_cdh4.1.2 parameter.
 Something is changed, monday i report the result. Now the farm is
 closed.
 Il giorno 05/ott/2013 14:06, Claudio Martella 
 claudio.marte...@gmail.com ha scritto:

  Oh, right, -vof is in trunk. Anyway it looks like you built giraph
 for the wrong profile. You mentioned you're running on 2.0, but your 
 giraph
 is built for 0.20.203. try building with a profile for your hadoop
 version.


 On Fri, Oct 4, 2013 at 2:35 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 org.apache.commons.cli.UnrecognizedOptionException: Unrecognized
 option: -vof

 in 1.0 version is
  -of,--outputFormat arg   Vertex output format
  -op,--outputPath arg Vertex output path



 2013/10/4 Claudio Martella claudio.marte...@gmail.com

 did you try the argument (-vof) i suggested?


 On Fri, Oct 4, 2013 at 2:13 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 i've specified  -of
  org.apache.giraph.io.formats.IdWithValueTextOutputFormat

 but the same error was produced

 Exception in thread main java.lang.IncompatibleClassChangeError:
 Found interface org.apache.hadoop.mapreduce.JobContext, but class was
 expected
 at
 org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43)
 at
 org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:984)
 at
 org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
 at java.security.AccessController.doPrivileged(Native
 Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
 at
 org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
 at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:237)
 at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:94)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
 at
 org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
 Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)



 2013/10/4 Claudio Martella claudio.marte...@gmail.com

 Hi,

 you need to specify the vertex outputformat class (-vof option),
 e.g. org.apache.giraph.io.formats.IdWithValueTextOutputFormat.


 On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 Hi,

 I hope I have sent to the right address.

 i have a graph (directed and unweighted) stored in hdfs like a
 adjacency list (140Milions of edges 6Milions of vertex)

 nodetabneighbors

 23   2   1343

 1 999 99923 909 ...

 ..

 hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6


 I have executed the giraph-1.0 connected components example, in
 this fashion

  hadoop jar /usr/local/giraph/giraph-examples/target/giraph-
 examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner 
 org.apache.giraph.examples.ConnectedComponentsVertex
 -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat
 -vip /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4

Re: connected components example in giraph 1.0

2013-10-05 Thread Claudio Martella

Oh, right, -vof is in trunk. Anyway it looks like you built giraph for the
wrong profile. You mentioned you're running on 2.0, but your giraph is
built for 0.20.203. try building with a profile for your hadoop version.


On Fri, Oct 4, 2013 at 2:35 PM, Silvio Di gregorio 
silvio.digrego...@gmail.com wrote:


 org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option:
 -vof

 in 1.0 version is
  -of,--outputFormat arg   Vertex output format
  -op,--outputPath arg Vertex output path



 2013/10/4 Claudio Martella claudio.marte...@gmail.com

 did you try the argument (-vof) i suggested?


 On Fri, Oct 4, 2013 at 2:13 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 i've specified  -of
  org.apache.giraph.io.formats.IdWithValueTextOutputFormat

 but the same error was produced

 Exception in thread main java.lang.IncompatibleClassChangeError: Found
 interface org.apache.hadoop.mapreduce.JobContext, but class was expected
 at
 org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:984)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
 at
 org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
 at org.apache.giraph.job.GiraphJob.run(GiraphJob.java:237)
 at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:94)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
 at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)



 2013/10/4 Claudio Martella claudio.marte...@gmail.com

 Hi,

 you need to specify the vertex outputformat class (-vof option), e.g.
 org.apache.giraph.io.formats.IdWithValueTextOutputFormat.


 On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio 
 silvio.digrego...@gmail.com wrote:


 Hi,

 I hope I have sent to the right address.

 i have a graph (directed and unweighted) stored in hdfs like a
 adjacency list (140Milions of edges 6Milions of vertex)

 nodetabneighbors

 23   2   1343

 1 999 99923 909 ...

 ..

 hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6


 I have executed the giraph-1.0 connected components example, in this
 fashion

  hadoop jar /usr/local/giraph/giraph-examples/target/giraph-
 examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner 
 org.apache.giraph.examples.ConnectedComponentsVertex
 -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip
 /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4



 and then fail with:

 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No edge input format
 specified. Ensure your InputFormat does not require one.



 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No output format
 specified. Ensure your OutputFormat does not require one.



 13/10/04 09:28:30 INFO job.GiraphJob: run: Since checkpointing is
 disabled (default), do not allow any task retries (setting
 mapred.map.max.attempts = 0, old value = 4)



 13/10/04 09:28:31 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.



 13/10/04 09:28:31 INFO mapred.JobClient: Cleaning up the staging area
 hdfs://
 srv-bigdata-dev-01.int.sose.it:8020/user/hdfs/.staging/job_201309181636_0535

 Exception in thread main java.lang.IncompatibleClassChangeError:
 Found interface org.apache.hadoop.mapreduce.JobContext, but class was
 expected

 at
 org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43)

 ..





 Thanks in advance




 --
Claudio Martella
claudio.marte...@gmail.com





 --
Claudio Martella
claudio.marte...@gmail.com





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: connected components example in giraph 1.0

2013-10-04 Thread Claudio Martella

Hi,

you need to specify the vertex outputformat class (-vof option), e.g.
org.apache.giraph.io.formats.IdWithValueTextOutputFormat.


On Fri, Oct 4, 2013 at 1:06 PM, Silvio Di gregorio 
silvio.digrego...@gmail.com wrote:


 Hi,

 I hope I have sent to the right address.

 i have a graph (directed and unweighted) stored in hdfs like a adjacency
 list (140Milions of edges 6Milions of vertex)

 nodetabneighbors

 23   2   1343

 1 999 99923 909 ...

 ..

 hadoop version Hadoop 2.0.0-cdh4.3.0 - java 1.6


 I have executed the giraph-1.0 connected components example, in this
 fashion

  hadoop jar /usr/local/giraph/giraph-examples/target/giraph-
 examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner 
 org.apache.giraph.examples.ConnectedComponentsVertex
 -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -vip
 /user/hdfs/lista_adj_txt -op connectedgiraph --workers 4



 and then fail with:

 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No edge input format
 specified. Ensure your InputFormat does not require one.



 13/10/04 09:28:29 INFO utils.ConfigurationUtils: No output format
 specified. Ensure your OutputFormat does not require one.



 13/10/04 09:28:30 INFO job.GiraphJob: run: Since checkpointing is disabled
 (default), do not allow any task retries (setting mapred.map.max.attempts =
 0, old value = 4)



 13/10/04 09:28:31 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.



 13/10/04 09:28:31 INFO mapred.JobClient: Cleaning up the staging area
 hdfs://
 srv-bigdata-dev-01.int.sose.it:8020/user/hdfs/.staging/job_201309181636_0535

 Exception in thread main java.lang.IncompatibleClassChangeError: Found
 interface org.apache.hadoop.mapreduce.JobContext, but class was expected

 at
 org.apache.giraph.bsp.BspOutputFormat.checkOutputSpecs(BspOutputFormat.java:43)

 ..





 Thanks in advance




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: workload used to measure Giraph performance number

2013-10-02 Thread Claudio Martella

Hi Wei,

it depends on what you mean by workload for a batch processing system. I
believe we can split the problem in two: generating a realistic graph, and
using representative algorithms.

To generate graphs we have two options in giraph:

1) random graph: you specify the number of vertices and the number of edges
for each vertex, and the edges will connect two random vertices. This
creates a graph with (i) low clustering coefficient, (ii) low average path
length, (ii) a uniform degree distribution

2) watts strogatz: you specify the number of vertices, the number of edges,
and a rewire probability beta. giraph will generate a ring lattice (each
vertex is connected to k preceeding vertices and k following vertices) and
rewire some of the edges randomly. This will create a graph with (i) high
clustering coefficient, (ii) low average path length, (iii) poisson-like
degree distribution (depends on beta). This graph will resemble a small
world graph such as a social network, except for the degree distribution
which will not a power law.

To use representative algorithms you can choose:

1) PageRank: it's a ranking algorithm where all the vertices are active and
send messages along the edges at each superstep (hence you'll have O(V)
active vertices and O(E) messages)

2) Shortest Paths: starting from a random vertex you'll visit al the
vertices in the graph (some multiple times). This will have an aggregate
O(V) active vertices and O(E) messages, but this is only a lower bound. In
general you'l have different areas of the graph explored at each superstep,
and hence potentially a varying workload across different supersteps.

3) Connected Components: this will have something opposite to (2) as it
will have many active vertices at the beginning, where the detection is
refined towards the end.


Hope this helps,
Claudio


On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang w...@us.ibm.com wrote:

 Hi,

 I am interested in measuring some performance numbers of Giraph on my
 machine.

 I am wondering are there some pointers where I can get some (configurable)
 reasonably large workload to work on ?

 Thanks!

 Wei




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Giraph offloadPartition fails creation directory

2013-09-23 Thread Claudio Martella

Weird.

This is the code:

if (!parent.exists()) {

  if (!parent.mkdirs()) {

LOG.error(offloadPartition: Failed to create directory  + parent.
getAbsolutePath());

  }

}


Question is why parent.mkdirs() is returning false. Could be a problem of
permissions. Could you try to pass a different directory for writing, e.g.
/tmp/foobar?


On Mon, Sep 23, 2013 at 1:28 PM, Dionysis Logothetis
dlogothe...@gmail.comwrote:

 offloadPartition: Failed to create directory





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Number of threads for vertex compute method

2013-09-11 Thread Claudio Martella

By default Giraph uses one compute thread for each worker. It uses multiple
threads for IO like Netty etc. The number of compute threads depends on the
number of workers per machine. Imagine you have a machine in your hadoop
cluster with 8 cores and 8 mapper tasks (something like the basic setup).
Then you don't really need a higher number of compute threads per worker,
as your cores will be busy all the time. Increasing the number of compute
threads is useful when you have a setup where you have one worker per
machine. In that case you'd have one compute thread per core.


On Wed, Sep 11, 2013 at 12:00 PM, Christian Krause m...@ckrause.org wrote:

 Hi,
 by default, how many threads are used for the compute method? I thought
 that Giraph would automatically use multiple threads by default, but then I
 stumbled onto this log message:


 2013-09-11 11:51:44,501 INFO org.apache.giraph.graph.GraphTaskManager: 
 execute: 6 partitions to process with 1 compute thread(s), originally 1 
 thread(s) on superstep 7

 Does this really mean that it uses only one thread?

 Cheers,
 Christian




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Giraph offloadPartition fails creation directory

2013-09-11 Thread Claudio Martella

Giraph does not offload partitions or messages to HDFS in the out-of-core
module. It uses local disk on the computing nodes. By defualt, it uses the
tasktracker local directory where for example the distributed cache is
stored.

Could you provide the stacktrace Giraph is spitting when failing?


On Thu, Sep 12, 2013 at 12:54 AM, Alexander Asplund
alexaspl...@gmail.comwrote:

 Hi,

 I'm still trying to get Giraph to work on a graph that requires more
 memory that is available. The problem is that when the Workers try to
 offload partitions, the offloading fails. The DiskBackedPartitionStore
 fails to create the directory
 _bsp/_partitions/job-/part-vertices-xxx (roughly from recall).

 The input or computation will then continue for a while, which I
 believe is because it is still managing to hold everything in memory -
 but at some point it reaches the limit where there simply is no more
 heap space, and it crashes with OOM.

 Has anybody had this problem with giraph failing to make HDFS directories?




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Out of core execution has no effect on GC crash

2013-09-10 Thread Claudio Martella

As David mentions, even with OOC, the objects are still created (and yes,
often soon destroyed after spilled to disk) putting pressure on the GC.
Moreover, with the increase in size of the graph, the number of in-memory
vertices is not the only increasing chunk of memory, as there are other
memory stores around the codebase that get filled, such as caches etc.

Try increasing the heap to something reasonable for your machines.


On Tue, Sep 10, 2013 at 3:21 AM, David Boyd db...@data-tactics-corp.comwrote:

 Alexander:
 You might try turning off the GC Overhead limit
 (-XX:-UseGCOverheadLimit)
 Also you could turn on verbose GC logging (-verbose:gc
 -Xloggc:/tmp/@taskid@.gc)
 to see what is happening.
 Because the OOC still has to create and destroy objects I suspect that the
 heap is just
 getting really fragmented.

 There are options that you can set with Java to change the type of garbage
 collection and
 how it is scheduled as well.

 You might up the heap size slightly - what is the default heap size on
 your cluster?


 On 9/9/2013 8:33 PM, Alexander Asplund wrote:

 A small note: I'm not seeing any partitions directory being formed
 under _bsp, which is where I have understood that they should be
 appearing.

 On 9/10/13, Alexander Asplund alexaspl...@gmail.com wrote:

 Really appreciate the swift responses! Thanks again.

 I have not both increased mapper tasks and decreased max number of
 partitions at the same time. I first did tests with increased Mapper
 heap available, but reset the setting after it apparently caused
 other, large volume, non-Giraph jobs to crash nodes when reducers also
 were running.

 I'm curious why increasing mapper heap is a requirement. Shouldn't the
 OOC mode be able to work with the amount of heap that is available? Is
 there some agreement on the minimum amount of heap necessary for OOC
 to succeed, to guide the choice of Mapper heap amount?

 Either way, I will try increasing mapper heap again as much as
 possible, which hopefully will run.

 On 9/9/13, Claudio Martella claudio.marte...@gmail.com wrote:

 did you extend the heap available to the mapper tasks? e.g. through
 mapred.child.java.opts.


 On Tue, Sep 10, 2013 at 12:50 AM, Alexander Asplund
 alexaspl...@gmail.comwrote:

  Thanks for the reply.

 I tried setting giraph.maxPartitionsInMemory to 1, but I'm still
 getting OOM: GC limit exceeded.

 Are there any particular cases the OOC will not be able to handle, or
 is it supposed to work in all cases? If the latter, it might be that I
 have made some configuration error.

 I do have one concern that might indicateI have done something wrong:
 to allow OOC to activate without crashing I had to modify the trunk
 code. This was because Giraph relied on guava-12 and
 DiskBackedPartitionStore used hasInt() - a method which does not exist
 in guava-11 which hadoop 2 depends on. At runtime guava 11 was being
 used

 I suppose this problem might indicate I'm running submitting the job
 using the wrong binary. Currently I am including the giraph
 dependencies with the jar, and running using hadoop jar.

 On 9/7/13, Claudio Martella claudio.marte...@gmail.com wrote:

 OOC is used also at input superstep. try to decrease the number of
 partitions kept in memory.


 On Sat, Sep 7, 2013 at 1:37 AM, Alexander Asplund
 alexaspl...@gmail.comwrote:

  Hi,

 I'm trying to process a graph that is about 3 times the size of
 available memory. On the other hand, there is plenty of disk space. I
 have enabled the giraph.useOutOfCoreGraph property, but it still
 crashes with outOfMemoryError: GC limit exceeded when I try running
 my
 job.

 I'm wondering of the spilling is supposed to work during the input
 step. If so, are there any additional steps that must be taken to
 ensure it functions?

 Regards,
 Alexander Asplund



 --
 Claudio Martella
 claudio.marte...@gmail.com


 --
 Alexander Asplund



 --
 Claudio Martella
 claudio.marte...@gmail.com


 --
 Alexander Asplund




 --
 = mailto:db...@data-tactics.com 
 David W. Boyd
 Director, Engineering
 7901 Jones Branch, Suite 700
 Mclean, VA 22102
 office:   +1-571-279-2122
 fax: +1-703-506-6703
 cell: +1-703-402-7908
 == 
 http://www.data-tactics.com.**com/http://www.data-tactics.com.com/
 First Robotic Mentor - FRC, FTC - www.iliterobotics.org
 President - USSTEM Foundation - www.usstem.org

 The information contained in this message may be privileged
 and/or confidential and protected from disclosure.
 If the reader of this message is not the intended recipient
 or an employee or agent responsible for delivering this message
 to the intended recipient, you are hereby notified that any
 dissemination, distribution or copying of this communication
 is strictly prohibited.  If you have received this communication
 in error, please notify the sender immediately by replying to
 this message and deleting the material from any computer.






-- 
   Claudio

Re: Counter limit

2013-09-10 Thread Claudio Martella

one the command line, you can use the -D option after the GiraphRunner
class before the GiraphRunner specific parameters, e.g. -D giraph.
useSuperstepCounters=false


On Tue, Sep 10, 2013 at 1:15 PM, Christian Krause m...@ckrause.org wrote:

 Thanks a lot. One last question: where do I set options like
 USE_SUPERSTEP_COUNTERS?

 Christian


 2013/9/9 André Kelpe efeshundert...@googlemail.com

 On older versions of hadoop, you cannot set the counters to a higher
 value. That was only introduced later. I had this issue on CDH3 (~1.5
 years ago) and my solution was to disable all counters for the giraph
 job, to make it work. If you use a more modern version of  hadoop, it
 should be possible to increase the limit though.

 - André

 2013/9/9 Avery Ching ach...@apache.org:
  If you are running out of counters, you can turn off the superstep
 counters
 
/** Use superstep counters? (boolean) */
BooleanConfOption USE_SUPERSTEP_COUNTERS =
new BooleanConfOption(giraph.useSuperstepCounters, true,
Use superstep counters? (boolean));
 
 
  On 9/9/13 6:43 AM, Claudio Martella wrote:
 
  No, I used a different counters limit on that hadoop version. Setting
  mapreduce.job.counters.limit to a higher number and restarting JT and TT
  worked for me. Maybe 64000 might be too high? Try setting it to 512.
 Does
  not look like the case, but who knows.
 
 
  On Mon, Sep 9, 2013 at 2:57 PM, Christian Krause m...@ckrause.org
 wrote:
 
  Sorry, it still doesn't work (I ran into a different problem before I
  reached the limit).
 
  I am using Hadoop 0.20.203.0. Is the limit of 120 counters maybe
  hardcoded?
 
  Cheers
  Christian
 
  Am 09.09.2013 08:29 schrieb Christian Krause m...@ckrause.org:
 
  I changed the property name to mapred.job.counters.limit and
 restarted it
  again. Now it works.
 
  Thanks,
  Christian
 
 
  2013/9/7 Claudio Martella claudio.marte...@gmail.com
 
  did you restart TT and JT?
 
 
  On Sat, Sep 7, 2013 at 7:09 AM, Christian Krause m...@ckrause.org
 wrote:
 
  Hi,
  I've increased the counter limit in mapred-site.xml, but I still get
  the error: Exceeded counter limits - Counters=121 Limit=120.
 Groups=6
  Limit=50.
 
  This is my config:
 
   cat conf/mapred-site.xml
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
  !-- Put site-specific property overrides in this file. --
 
  configuration
  ...
  property
  namemapreduce.job.counters.limit/name
  value64000/value
  /property
  property
  namemapred.task.timeout/name
  value240/value
  /property
  ...
  /configuration
 
  Any ideas?
 
  Cheers,
  Christian
 
 
 
 
  --
 Claudio Martella
 claudio.marte...@gmail.com
 
 
 
 
 
  --
 Claudio Martella
 claudio.marte...@gmail.com
 
 





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Finding missing links in a lineage graph..

2013-09-10 Thread Claudio Martella

Hi Sashant,

you'll have to write your own algorithm that acts depending on the labels
along the edges.


On Tue, Sep 10, 2013 at 9:46 AM, Sushanta Pradhan 
sushanta.prad...@talentica.com wrote:

 Hi,

 I am trying to create a lineage graph from an incomplete data i.e. few
 relationships are missing.

 Example:
 If I have the following subset of lineage graph:

 Ram ---child--- Luv
 Ram ---wife--- Sita

 The full lineage graph would be:

 Ram ---child--- Luv
 Ram ---wife--- Sita
 Sita ---child--- Luv
 Luv ---father--- Ram
 Luv ---mother-- Sita



 Is their and API in Giraph which takes certain rules as input and can find
 these missing links and create them?


 Thanks,
 Sushant




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Counter limit

2013-09-10 Thread Claudio Martella

you can set it in your giraph-site.xml, but it should work on the command
line.


On Tue, Sep 10, 2013 at 1:44 PM, Christian Krause m...@ckrause.org wrote:

 I still see the number of counters increasing in the job tracker :(. Can I
 also set it in my giraph-site.xml or directly in my MasterCompute class?

 Cheers,
 Christian


 2013/9/10 Claudio Martella claudio.marte...@gmail.com

 one the command line, you can use the -D option after the GiraphRunner
 class before the GiraphRunner specific parameters, e.g. -D giraph.
 useSuperstepCounters=false


 On Tue, Sep 10, 2013 at 1:15 PM, Christian Krause m...@ckrause.org wrote:

 Thanks a lot. One last question: where do I set options like
 USE_SUPERSTEP_COUNTERS?

 Christian


 2013/9/9 André Kelpe efeshundert...@googlemail.com

 On older versions of hadoop, you cannot set the counters to a higher
 value. That was only introduced later. I had this issue on CDH3 (~1.5
 years ago) and my solution was to disable all counters for the giraph
 job, to make it work. If you use a more modern version of  hadoop, it
 should be possible to increase the limit though.

 - André

 2013/9/9 Avery Ching ach...@apache.org:
  If you are running out of counters, you can turn off the superstep
 counters
 
/** Use superstep counters? (boolean) */
BooleanConfOption USE_SUPERSTEP_COUNTERS =
new BooleanConfOption(giraph.useSuperstepCounters, true,
Use superstep counters? (boolean));
 
 
  On 9/9/13 6:43 AM, Claudio Martella wrote:
 
  No, I used a different counters limit on that hadoop version. Setting
  mapreduce.job.counters.limit to a higher number and restarting JT and
 TT
  worked for me. Maybe 64000 might be too high? Try setting it to 512.
 Does
  not look like the case, but who knows.
 
 
  On Mon, Sep 9, 2013 at 2:57 PM, Christian Krause m...@ckrause.org
 wrote:
 
  Sorry, it still doesn't work (I ran into a different problem before I
  reached the limit).
 
  I am using Hadoop 0.20.203.0. Is the limit of 120 counters maybe
  hardcoded?
 
  Cheers
  Christian
 
  Am 09.09.2013 08:29 schrieb Christian Krause m...@ckrause.org:
 
  I changed the property name to mapred.job.counters.limit and
 restarted it
  again. Now it works.
 
  Thanks,
  Christian
 
 
  2013/9/7 Claudio Martella claudio.marte...@gmail.com
 
  did you restart TT and JT?
 
 
  On Sat, Sep 7, 2013 at 7:09 AM, Christian Krause m...@ckrause.org
 wrote:
 
  Hi,
  I've increased the counter limit in mapred-site.xml, but I still
 get
  the error: Exceeded counter limits - Counters=121 Limit=120.
 Groups=6
  Limit=50.
 
  This is my config:
 
   cat conf/mapred-site.xml
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
  !-- Put site-specific property overrides in this file. --
 
  configuration
  ...
  property
  namemapreduce.job.counters.limit/name
  value64000/value
  /property
  property
  namemapred.task.timeout/name
  value240/value
  /property
  ...
  /configuration
 
  Any ideas?
 
  Cheers,
  Christian
 
 
 
 
  --
 Claudio Martella
 claudio.marte...@gmail.com
 
 
 
 
 
  --
 Claudio Martella
 claudio.marte...@gmail.com
 
 





 --
Claudio Martella
claudio.marte...@gmail.com





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Out of core execution has no effect on GC crash

2013-09-09 Thread Claudio Martella

did you extend the heap available to the mapper tasks? e.g. through
mapred.child.java.opts.


On Tue, Sep 10, 2013 at 12:50 AM, Alexander Asplund
alexaspl...@gmail.comwrote:

 Thanks for the reply.

 I tried setting giraph.maxPartitionsInMemory to 1, but I'm still
 getting OOM: GC limit exceeded.

 Are there any particular cases the OOC will not be able to handle, or
 is it supposed to work in all cases? If the latter, it might be that I
 have made some configuration error.

 I do have one concern that might indicateI have done something wrong:
 to allow OOC to activate without crashing I had to modify the trunk
 code. This was because Giraph relied on guava-12 and
 DiskBackedPartitionStore used hasInt() - a method which does not exist
 in guava-11 which hadoop 2 depends on. At runtime guava 11 was being
 used

 I suppose this problem might indicate I'm running submitting the job
 using the wrong binary. Currently I am including the giraph
 dependencies with the jar, and running using hadoop jar.

 On 9/7/13, Claudio Martella claudio.marte...@gmail.com wrote:
  OOC is used also at input superstep. try to decrease the number of
  partitions kept in memory.
 
 
  On Sat, Sep 7, 2013 at 1:37 AM, Alexander Asplund
  alexaspl...@gmail.comwrote:
 
  Hi,
 
  I'm trying to process a graph that is about 3 times the size of
  available memory. On the other hand, there is plenty of disk space. I
  have enabled the giraph.useOutOfCoreGraph property, but it still
  crashes with outOfMemoryError: GC limit exceeded when I try running my
  job.
 
  I'm wondering of the spilling is supposed to work during the input
  step. If so, are there any additional steps that must be taken to
  ensure it functions?
 
  Regards,
  Alexander Asplund
 
 
 
 
  --
 Claudio Martella
 claudio.marte...@gmail.com
 


 --
 Alexander Asplund




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Out of core execution has no effect on GC crash

2013-09-07 Thread Claudio Martella

OOC is used also at input superstep. try to decrease the number of
partitions kept in memory.


On Sat, Sep 7, 2013 at 1:37 AM, Alexander Asplund alexaspl...@gmail.comwrote:

 Hi,

 I'm trying to process a graph that is about 3 times the size of
 available memory. On the other hand, there is plenty of disk space. I
 have enabled the giraph.useOutOfCoreGraph property, but it still
 crashes with outOfMemoryError: GC limit exceeded when I try running my
 job.

 I'm wondering of the spilling is supposed to work during the input
 step. If so, are there any additional steps that must be taken to
 ensure it functions?

 Regards,
 Alexander Asplund




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: MySQL Table

2013-09-06 Thread Claudio Martella

Hi Bu,

no, currently we do not have a DBInputFormat. We have an open issue with a
google summer of code student working on a GoraInputFormat, which supports
also reading from RDBMs through Gora. However, if/when it will get it, it
will not provide a rich semantic as DBInputFormat, e.g. you'll be able to
only provide scan-like/range queries, instead of ANY query like
DBInputFormat.

I think that creating an DB[Vertex|Edge]InputFormat starting from the
hadoop DBInputFormat should not be too hard and could prove to be a very
useful contribution. If you think about providing an implementation, I can
provide guidance.

Best,
Claudio


On Fri, Sep 6, 2013 at 1:45 AM, Bu Xiao buxia...@gmail.com wrote:

 Hi Girapher,

I am currently working on algorithm that requires reading the
 vertices from MySQL table and not from HDFS. I thought that there has to be
 a way of reading data from SQL table since Giraph is built on top of
 Hadoop. But I do not seem to figure this part out. Do you have a class
 similar to the DBInputFormat in Hadoop? Thank you very much for your help.





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Dynamic Graphs

2013-09-06 Thread Claudio Martella

Hi Mirko,

this is in general the kind of approach I was suggesting, but looked at in
a broader-perspective. I'd tend to avoid calling other tools such as Hive
or Pig often to compute injections, as Giraph is still a batch-processing
and this could really introduce latency and reduce throughput. I feel that
if the injection of vertices and edges would really require such a
complexity (such a computing them with M/R), then one could just create a
pipeline of jobs. But this is only my superficial analysis/speculation, I
can see your point on integration and your proposal is very interesting.


On Sun, Aug 25, 2013 at 8:55 AM, Mirko Kämpf mirko.kae...@cloudera.comwrote:

 Good morning Gentlemen,

 as far as I understand your thread you are talking about the same topic I
 was thinking and working some time.
 I work on a research project focused on evolution of networks and networks
 dynamics in networks of networks.

 My understanding of Marco's question is, that he needs to change node
 properties or even wants to add nodes to the graph while it is processed,
 right?

 With the WorkerContext we could construct a Connector to the outside
 world, not just for loading data from HDFS, which requires a preprocessing
 step for the data which has to be loaded also. I think about HBase often.
 All my nodes and edges live in HBase. From there it is quite easy to load
 new data based on a simple Scan or even if the WorkerContext triggers a
 Hive or Pig script, one can automatically reorganize or extract relevant
 new links / nodes which have to be added to the graph.

 Such an approach means, after n super steps of the Giraph layer an
 additional utility-step (triggered via WorkerContext, or any other better
 fitting class form Giraph - not sure jet there to start) is executed.
 Before such a step the state of the graph is persisted to allow fall back
 or resume. The utility-step can be a processing (MR, Mahout) or just a load
 (from HDFS, HBase) operation and it allows a kind of clocked data flow
 directly into a running Giraph application. I think this is a very
 important feature in Complex Systems research, as we have interacting
 layers which change in parallel. In this picture the Giraph steps are the
 steps of layer A, lets say something whats going on on top of a network and
 the utility-step expresses the changes in the underlying structure
 affecting the network it self but based on the data / properties of the
 second subsystem, e.g. the agents operating on top of the network.

 I created a tool, which worked like this - but not at scale - and it was
 at a time before Giraph. What do you think, is there a need for such a kind
 of extension in the Giraph world?

 Have a nice Sunday.

 Best wishes
 Mirko

 --
 --
 Mirko Kämpf

 *Trainer* @ Cloudera

 tel: +49 *176 20 63 51 99*
 skype: *kamir1604*
 mi...@cloudera.com



 On Wed, Aug 21, 2013 at 3:30 PM, Claudio Martella 
 claudio.marte...@gmail.com wrote:

 As I said, the injection of the new vertices/edges would have to be done
 manually, hence without any support of the infrastructure. I'd suggest
 you implement a WorkerContext class that supports the reading of a specific
 file with a specific format (under your control) from HDFS, and that is
 accessed by this particular special vertex (e.g. based on the vertex ID).

 Does this make sense?


 On Wed, Aug 21, 2013 at 2:13 PM, Marco Aurelio Barbosa Fagnani Lotz 
 m.a.b.l...@stu12.qmul.ac.uk wrote:

  Dear Mr. Martella,

 Once achieved the conditions for updating the vertex data base, what it
 the best way for the Injector Vertex to call an input reader again?

 I am able to access all the HDFS data, but I guess the vertex would need
 to have access to the input splits and also the vertex input format that I
 designate. Am I correct? Or there is a way that one can just ask Zookeeper
 to create new splits and distribute to the workers from given a path in DFS?

 Best Regards,
 Marco Lotz
  --
 *From:* Claudio Martella claudio.marte...@gmail.com
 *Sent:* 14 August 2013 15:25
 *To:* user@giraph.apache.org
 *Subject:* Re: Dynamic Graphs

  Hi Marco,

  Giraph currently does not support that. One way of doing this would be
 by having a specific (pseudo-)vertex to act as the injector of the new
 vertices and edges For example, it would read a file from HDFS and call the
 mutable API during the computation, superstep after superstep.


 On Wed, Aug 14, 2013 at 3:02 PM, Marco Aurelio Barbosa Fagnani Lotz 
 m.a.b.l...@stu12.qmul.ac.uk wrote:

  Hello all,

 I would like to know if there is any form to use dynamic graphs with
 Giraph. By dynamic one can read graphs that may change while Giraph is
 computing/deliberating. The changes are in the input file and are not
 caused by the graph computation itself.

 Is there any way to analyse it using Giraph? If not, anyone has any
 idea/suggestion if it is possible to modify the framework in order to
 process it?

 Best Regards,
 Marco Lotz

Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.

2013-09-04 Thread Claudio Martella

 ..
 WatchedEvent state:SyncConnected type:None path:null
 [zk: 127.0.0.1:2181(CONNECTED) 0] ls /
 [hbase, zookeeper]
 [zk: 127.0.0.1:2181(CONNECTED) 1]


 However, I am a bit confused.
 If I look in the zookeeper log-file I see this port 2181 'Address already
 in use' error,

 2013-09-03 10:52:24,412 [myid:] - INFO  [main:ZooKeeperServer@735] -
 minSessionTimeout set to -1
 2013-09-03 10:52:24,413 [myid:] - INFO  [main:ZooKeeperServer@744] -
 maxSessionTimeout set to -1
 2013-09-03 10:52:24,436 [myid:] - INFO  [main:NIOServerCnxnFactory@99] -
 binding to port 0.0.0.0/0.0.0.0:2181
 2013-09-03 10:52:24,447 [myid:] - ERROR [main:ZooKeeperServerMain@68] -
 Unexpected exception, exiting abnormally
 java.net.BindException: Address already in use
 at sun.nio.ch.Net.bind(Native Method)
  at
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
 at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52)
 at
 org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:100)
  at
 org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:115)
 at
 org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:91)

 The process listening on port 2181 is 2892, which turns out to be HBase.

 [root@localhost giraph]# fuser 2181/tcp
 2181/tcp: 2892
 [root@localhost giraph]# ps aux | grep 2892
 hbase 2892  0.1  3.2 719592 119624 ?   Sl   Aug29   7:35
 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx500m
 -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase
 -Dhbase.log.file=hbase-hbase-master-localhost.localdomain.log
 -Dhbase.home.dir=/usr/lib/hbase/bin/..
 ..

 So I am not sure what my zookeeper client is connecting to.
 It seems to be connecting to a zookeeper server but when I do 'ps' I
 cannot see
 a zookeeper server running.
 Here is my zoo.cfg file,

 maxClientCnxns=50
 # The number of milliseconds of each tick
 tickTime=2000
 # The number of ticks that the initial synchronization phase can take
 initLimit=10
 # The number of ticks that can pass between
 # sending a request and getting an acknowledgement
 syncLimit=5
 # the directory where the snapshot is stored.
 dataDir=/var/lib/zookeeper
 # the port at which the clients will connect
 clientPort=2181
 server.1=localhost:2888:3888

 Thanks for any help,

 Ken



 --
Claudio Martella
claudio.marte...@gmail.com




 --
Claudio Martella
claudio.marte...@gmail.com




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.

2013-09-04 Thread Claudio Martella

(QuorumPeerMain.java:121)
 at
 org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
 [root@localhost giraph]#


   Thank you for any help,

 Ken




 --
 From: claudio.marte...@gmail.com
 Date: Tue, 3 Sep 2013 12:43:59 +0200

 Subject: Re: FileNotFoundException: File
 _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.
 To: user@giraph.apache.org


 can you try defining the zookeeper manager directory from the command
 line? like this -D giraph.zkManagerDirectory=/path/in/hdfs/foobar

 you'll have to delete this directory by hand before each job. Just to see
 if it solves the problem. Then I could know how to fix it.


 On Tue, Sep 3, 2013 at 12:32 PM, Ken Williams zoo9...@hotmail.com wrote:

 Hi Pradeep,

 Yes, the zookeeper server is definitely running, I can connect to it with
 the
 command-line client

 [root@localhost giraph]# zkCli.sh  -server 127.0.0.1:2181
 Connecting to 127.0.0.1:2181
 2013-09-03 11:15:45,987 [myid:] - INFO  [main:Environment@100] - Client
 environment:zookeeper.version=3.4.3-cdh4.1.1--1, built on 10/16/2012 17:34
 GMT
 2013-09-03 11:15:45,990 [myid:] - INFO  [main:Environment@100] - Client
 environment:host.name=localhost.localdomain
 2013-09-03 11:15:45,990 [myid:] - INFO  [main:Environment@100] - Client
 environment:java.version=1.6.0_31
 ..
 WatchedEvent state:SyncConnected type:None path:null
 [zk: 127.0.0.1:2181(CONNECTED) 0] ls /
 [hbase, zookeeper]
 [zk: 127.0.0.1:2181(CONNECTED) 1]


 However, I am a bit confused.
 If I look in the zookeeper log-file I see this port 2181 'Address already
 in use' error,

 2013-09-03 10:52:24,412 [myid:] - INFO  [main:ZooKeeperServer@735] -
 minSessionTimeout set to -1
 2013-09-03 10:52:24,413 [myid:] - INFO  [main:ZooKeeperServer@744] -
 maxSessionTimeout set to -1
 2013-09-03 10:52:24,436 [myid:] - INFO  [main:NIOServerCnxnFactory@99] -
 binding to port 0.0.0.0/0.0.0.0:2181
 2013-09-03 10:52:24,447 [myid:] - ERROR [main:ZooKeeperServerMain@68] -
 Unexpected exception, exiting abnormally
 java.net.BindException: Address already in use
 at sun.nio.ch.Net.bind(Native Method)
  at
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
 at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
  at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52)
 at
 org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:100)
  at
 org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:115)
 at
 org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:91)

 The process listening on port 2181 is 2892, which turns out to be HBase.

 [root@localhost giraph]# fuser 2181/tcp
 2181/tcp: 2892
 [root@localhost giraph]# ps aux | grep 2892
 hbase 2892  0.1  3.2 719592 119624 ?   Sl   Aug29   7:35
 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx500m
 -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase
 -Dhbase.log.file=hbase-hbase-master-localhost.localdomain.log
 -Dhbase.home.dir=/usr/lib/hbase/bin/..
 ..

 So I am not sure what my zookeeper client is connecting to.
 It seems to be connecting to a zookeeper server but when I do 'ps' I
 cannot see
 a zookeeper server running.
 Here is my zoo.cfg file,

 maxClientCnxns=50
 # The number of milliseconds of each tick
 tickTime=2000
 # The number of ticks that the initial synchronization phase can take
 initLimit=10
 # The number of ticks that can pass between
 # sending a request and getting an acknowledgement
 syncLimit=5
 # the directory where the snapshot is stored.
 dataDir=/var/lib/zookeeper
 # the port at which the clients will connect
 clientPort=2181
 server.1=localhost:2888:3888

 Thanks for any help,

 Ken



 --
Claudio Martella
claudio.marte...@gmail.com




 --
Claudio Martella
claudio.marte...@gmail.com




 --
Claudio Martella
claudio.marte...@gmail.com




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.

2013-09-03 Thread Claudio Martella

   job.GiraphConfigurationValidator: Output format vertex index type is
 not
   known13/09/02 17:06:36 WARN job.GiraphConfigurationValidator: Output
 format
   vertex value type is not known13/09/02 17:06:36 WARN
   job.GiraphConfigurationValidator: Output format edge value type is not
   known13/09/02 17:06:36 INFO job.GiraphJob: run: Since checkpointing is
   disabled (default), do not allow any task retries (setting
   mapred.map.max.attempts = 0, old value = 4)13/09/02 17:06:37 WARN
   mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
   Applications should implement Tool for the same.13/09/02 17:06:40 INFO
   mapred.JobClient: Running job: job_201308291126_002913/09/02 17:06:41
 INFO
   mapred.JobClient: map 0% reduce 0%13/09/02 17:06:51 INFO
 mapred.JobClient:
   Job complete: job_201308291126_002913/09/02 17:06:51 INFO
 mapred.JobClient:
   Counters: 613/09/02 17:06:51 INFO mapred.JobClient: Job Counters
 13/09/02
   17:06:51 INFO mapred.JobClient: Failed map tasks=113/09/02 17:06:51
 INFO
   mapred.JobClient: Launched map tasks=213/09/02 17:06:51 INFO
   mapred.JobClient: Total time spent by all maps in occupied slots
   (ms)=1651513/09/02 17:06:51 INFO mapred.JobClient: Total time spent by
   all reduces in occupied slots (ms)=013/09/02 17:06:51 INFO
 mapred.JobClient:
   Total time spent by all maps waiting after reserving slots
   (ms)=013/09/02 17:06:51 INFO mapred.JobClient: Total time spent by all
   reduces waiting after reserving slots (ms)=0[root@localhost giraph]#
  
   There are no errors but no output is produced, and in the Web UI I can
 see
   the 2 map tasks have both failed.When I look in the log files this is
 the
   exception I see thrown,
   java.lang.IllegalStateException: run: Caught an unrecoverable exception
   java.io.FileNotFoundException: File
   _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not
 exist. at
   org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:102) at
   org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645) at
   org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at
   org.apache.hadoop.mapred.Child$4.run(Child.java:268) at
   java.security.AccessController.doPrivileged(Native Method) at
   javax.security.auth.Subject.doAs(Subject.java:396) at
  
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
 at
   org.apache.hadoop.mapred.Child.main(Child.java:262)Caused by:
   java.lang.RuntimeException: java.io.FileNotFoundException: File
   _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not
 exist. at
  
 org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:790)
 at
  
 org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java
   Every time I run a new job, it throws this same error.
   I have a copy of Zookeeper installed here,
   [root@localhost giraph]# /usr/lib/zookeeper/bin/zkServer.sh statusJMX
   enabled by defaultUsing config:
 /usr/lib/zookeeper/bin/../conf/zoo.cfgMode:
   standalone[root@localhost giraph]#
   Any help would be greatly appreciated.
   Thank you,
   Ken
  
  
 
 
  --
  Pradeep Kumar




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: FileNotFoundException: File _bsp/_defaultZkManagerDir/job_201308291126_0029/_zkServer does not exist.

2013-09-03 Thread Claudio Martella

)
  at
 org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:115)
 at
 org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:91)

 The process listening on port 2181 is 2892, which turns out to be HBase.

 [root@localhost giraph]# fuser 2181/tcp
 2181/tcp: 2892
 [root@localhost giraph]# ps aux | grep 2892
 hbase 2892  0.1  3.2 719592 119624 ?   Sl   Aug29   7:35
 /usr/java/jdk1.6.0_31/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx500m
 -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/var/log/hbase
 -Dhbase.log.file=hbase-hbase-master-localhost.localdomain.log
 -Dhbase.home.dir=/usr/lib/hbase/bin/..
 ..

 So I am not sure what my zookeeper client is connecting to.
 It seems to be connecting to a zookeeper server but when I do 'ps' I
 cannot see
 a zookeeper server running.
 Here is my zoo.cfg file,

 maxClientCnxns=50
 # The number of milliseconds of each tick
 tickTime=2000
 # The number of ticks that the initial synchronization phase can take
 initLimit=10
 # The number of ticks that can pass between
 # sending a request and getting an acknowledgement
 syncLimit=5
 # the directory where the snapshot is stored.
 dataDir=/var/lib/zookeeper
 # the port at which the clients will connect
 clientPort=2181
 server.1=localhost:2888:3888

 Thanks for any help,

 Ken



 --
Claudio Martella
claudio.marte...@gmail.com




-- 
   Claudio Martella
   claudio.marte...@gmail.com


check.diff
Description: Binary data

Re: Passing Custom Arguments for giraph.zkList

2013-08-29 Thread Claudio Martella

zk1 is supposed to be a hostname.


On Thu, Aug 29, 2013 at 11:05 PM, Ramani, Arun aram...@paypal.com wrote:

  Hi,

  I am trying to pass a zookeeper quorum to my giraph job and it throws
 the following exception:

  13/08/29 13:14:38 INFO utils.ConfigurationUtils: No edge input format
 specified. Ensure your InputFormat does not require one.
 13/08/29 13:14:38 INFO utils.ConfigurationUtils: No output format
 specified. Ensure your OutputFormat does not require one.
 13/08/29 13:14:38 INFO utils.ConfigurationUtils: Setting custom argument
 [giraph.zkList] to zk1 in GiraphConfiguration
 Exception in thread main java.lang.IllegalArgumentException: Unable to
 parse custom  argument: zk2:port
 at
 org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:288)
 at
 org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:147)
 at
 com.paypal.risk.rd.giraph.AccountPropagation.run(AccountPropagation.java:46)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at
 com.paypal.risk.rd.giraph.AccountPropagation.main(AccountPropagation.java:98)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

  I pass the zklist like this:

  Hadoop jar GRAPH.jar CLASSNAME -vip CLASS NAME -vif CLASS NAME -wc
 CLASS NAME -w worker number -ca
 giraph.zkList=zk1:port,zk2:port,zk3:port,zk4:port,zk5:port

  Please suggest what is wrong with this invocation.

  Thanks
 Arun Ramani




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Passing Custom Arguments for giraph.zkList

2013-08-29 Thread Claudio Martella

the problem is not the format of the string, but the way you're passing it.
Try passing it as -D giraph.zkList=... before the giraphrunner options.
that should work.


On Thu, Aug 29, 2013 at 11:47 PM, Ramani, Arun aram...@paypal.com wrote:

  Hi Claudio,

  Yes zk1, zk2, zk3, zk4 and zk5 are all zookeeper hostnames. These 5
 hosts make a zookeeper quorum. Please let me know how to pass this.

  Thanks
 Arun Ramani

   From: Claudio Martella claudio.marte...@gmail.com
 Reply-To: user@giraph.apache.org user@giraph.apache.org
 Date: Thursday, August 29, 2013 2:18 PM
 To: user@giraph.apache.org user@giraph.apache.org
 Subject: Re: Passing Custom Arguments for giraph.zkList

   zk1 is supposed to be a hostname.


 On Thu, Aug 29, 2013 at 11:05 PM, Ramani, Arun aram...@paypal.com wrote:

  Hi,

  I am trying to pass a zookeeper quorum to my giraph job and it throws
 the following exception:

  13/08/29 13:14:38 INFO utils.ConfigurationUtils: No edge input format
 specified. Ensure your InputFormat does not require one.
 13/08/29 13:14:38 INFO utils.ConfigurationUtils: No output format
 specified. Ensure your OutputFormat does not require one.
 13/08/29 13:14:38 INFO utils.ConfigurationUtils: Setting custom argument
 [giraph.zkList] to zk1 in GiraphConfiguration
 Exception in thread main java.lang.IllegalArgumentException: Unable to
 parse custom  argument: zk2:port
 at
 org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:288)
 at
 org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:147)
 at
 com.paypal.risk.rd.giraph.AccountPropagation.run(AccountPropagation.java:46)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at
 com.paypal.risk.rd.giraph.AccountPropagation.main(AccountPropagation.java:98)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:197)

  I pass the zklist like this:

  Hadoop jar GRAPH.jar CLASSNAME -vip CLASS NAME -vif CLASS NAME
 -wc CLASS NAME -w worker number -ca
 giraph.zkList=zk1:port,zk2:port,zk3:port,zk4:port,zk5:port

  Please suggest what is wrong with this invocation.

  Thanks
  Arun Ramani




  --
Claudio Martella
claudio.marte...@gmail.com




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Help needed for Running my own java programs in Giraph

2013-08-26 Thread Claudio Martella

OK, then I'm going to open an issue for that.

On Mon, Aug 26, 2013 at 11:23 AM, Vivek Sembium vivek.semb...@gmail.comwrote:

Yes for the zookeeper problem I passed a seperate jar through -libjars
command. If I use additional jars zookeeper fails.
On Aug 26, 2013 2:51 PM, Claudio Martella claudio.marte...@gmail.com
wrote:

there must be a misunderstanding. i was referring to the zookeeper
problem.

On Mon, Aug 26, 2013 at 11:14 AM, Vivek Sembium
vivek.semb...@gmail.comwrote:

No. I added my files(it was just a copy of one of the example program to
a different package) to the jar files of giraph. But it was still giving me
classNotFoundException. Can you give me some simple example program with
instructions on how to deploy it. So I can start playing with giraph and
make changes to the program and learn, then start working on my project in
giraph.
I will be very thankful if you can help me with this.

Thanking you
-Vivek Sembium

On Mon, Aug 26, 2013 at 2:37 PM, Claudio Martella
claudio.marte...@gmail.com wrote:

but you were still using an additional jar added through -libjars,
right?

On Mon, Aug 26, 2013 at 8:43 AM, Vivek Sembium vivek.semb...@gmail.com
wrote:

@Claudio Martella Your solution didnt work either. I basically tried
copying the pageRankBenchmark to my own package, renamed the package. It
compiles fine with giraph. But I couldnt run it even if I add those files
to giraph jar before deployment. Help?

On Sun, Aug 25, 2013 at 6:33 PM, Claudio Martella
claudio.marte...@gmail.com wrote:

you have this problem when you use two jars (one with giraph and one
with your classes) instead of a single fat-jar, correct? I tracked the
same
problem a few weeks ago, basically zookeeper is run passing the wrong
jar.

On Sat, Aug 24, 2013 at 4:51 PM, Vivek Sembium
vivek.semb...@gmail.com wrote:

Thank you for your suggestion. It worked. Its not giving class not
found exception. But its giving me a new error
Its stopping at map 0% and reduce 0%. Upon inspection I found that
its unable to connect to zookeeper service.

java.lang.IllegalStateException: run: Caught an unrecoverable
exception onlineZooKeeperServers: Failed to connect in 10 tries!
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.IllegalStateException: onlineZooKeeperServers:
Failed to connect in 10 tries!
at
org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:727)
at
org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:371)
at
org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:204)
at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:59)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:89)
... 7 more

Immediately I ran page rank benchmark and it executed successfully
both from giraph in lib directory and also from giraphs own directory.

Can you give me a very simple java program(finding maximum in a
graph or simple page rank program) in giraph along with its jar file and
input files which I can place in my lib directory of hadoop and test if
its working. And also the command to execute it. This should be added in
the documentation as new comers can quickly setup giraph and
concentrate on
their project.

On Sat, Aug 24, 2013 at 7:12 PM, Ahmet Emre Aladağ
emre.ala...@agmlab.com wrote:

It isn't asking for edge input. It says make sure you don't need
it. A warning for the case you may have forgotten to give edge input
when
you really needed.

The cause of your error is what I'm wondering nowadays. I'm having
a similar problem. Currently I'm using a workaround: put all the jars
(giraph-core and my module giraph-nutch) in the lib folder of hadoop.
Then
it works. But there should be a clean way of doing this.

I should be able to say hadoop jar fat.jar ...

Any help appreciated.

--
*Kimden: *Vivek Sembium vivek.semb...@gmail.com
*Kime: *user@giraph.apache.org
*Gönderilenler: *24 Ağustos Cumartesi 2013 11:51:49
*Konu: *Re: Help needed for Running my own java programs in Giraph

I tried with and without exporting hadoop classpath. I get the same
error.

Here's the command that I tried
hadoop jar
/mnt/a1/sda4/hadoop/giraph/giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner -libjars
/mnt/a99/d0/vivek/workspace/Giraph/bin/SimplePageRankComputation.jar

Re: Help needed for Running my own java programs in Giraph

2013-08-26 Thread Claudio Martella

you mean by running zookeeper independently?

On Mon, Aug 26, 2013 at 3:16 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote:

We were also experiencing similar problems when specifying -libjars as
opposed to just using a fat jar. I believe we fixed it by setting the
giraph.zkList property, but this only appears to work when we list one node
as a zookeeper.

On Mon, Aug 26, 2013 at 8:55 AM, Claudio Martella
claudio.marte...@gmail.com wrote:

OK, then I'm going to open an issue for that.

On Mon, Aug 26, 2013 at 11:23 AM, Vivek Sembium
vivek.semb...@gmail.comwrote:

Yes for the zookeeper problem I passed a seperate jar through -libjars
command. If I use additional jars zookeeper fails.
On Aug 26, 2013 2:51 PM, Claudio Martella claudio.marte...@gmail.com
wrote:

there must be a misunderstanding. i was referring to the zookeeper
problem.

On Mon, Aug 26, 2013 at 11:14 AM, Vivek Sembium
vivek.semb...@gmail.com wrote:

No. I added my files(it was just a copy of one of the example program
to a different package) to the jar files of giraph. But it was still
giving
me classNotFoundException. Can you give me some simple example program
with
instructions on how to deploy it. So I can start playing with giraph and
make changes to the program and learn, then start working on my project in
giraph.
I will be very thankful if you can help me with this.

Thanking you
-Vivek Sembium

On Mon, Aug 26, 2013 at 2:37 PM, Claudio Martella
claudio.marte...@gmail.com wrote:

but you were still using an additional jar added through -libjars,
right?

On Mon, Aug 26, 2013 at 8:43 AM, Vivek Sembium
vivek.semb...@gmail.com wrote:

On Sun, Aug 25, 2013 at 6:33 PM, Claudio Martella
claudio.marte...@gmail.com wrote:

you have this problem when you use two jars (one with giraph and
one with your classes) instead of a single fat-jar, correct? I tracked
the
same problem a few weeks ago, basically zookeeper is run passing the
wrong
jar.

On Sat, Aug 24, 2013 at 4:51 PM, Vivek Sembium
vivek.semb...@gmail.com wrote:

java.lang.IllegalStateException: run: Caught an unrecoverable
exception onlineZooKeeperServers: Failed to connect in 10 tries!
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.IllegalStateException:
onlineZooKeeperServers: Failed to connect in 10 tries!
at
org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:727)
at
org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:371)
at
org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:204)
at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:59)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:89)
... 7 more

Immediately I ran page rank benchmark and it executed successfully
both from giraph in lib directory and also from giraphs own directory.

Can you give me a very simple java program(finding maximum in a
graph or simple page rank program) in giraph along with its jar file
and
input files which I can place in my lib directory of hadoop and test
if
its working. And also the command to execute it. This should be added
in
the documentation as new comers can quickly setup giraph and
concentrate on
their project.

On Sat, Aug 24, 2013 at 7:12 PM, Ahmet Emre Aladağ
emre.ala...@agmlab.com wrote:

It isn't asking for edge input. It says make sure you don't need
it. A warning for the case you may have forgotten to give edge input
when
you really needed.

The cause of your error is what I'm wondering nowadays. I'm
having a similar problem. Currently I'm using a workaround: put all
the
jars (giraph-core and my module giraph-nutch) in the lib folder of
hadoop.
Then it works. But there should be a clean way of doing this.

I should be able to say hadoop jar fat.jar ...

Any help appreciated.

--
*Kimden: *Vivek Sembium vivek.semb...@gmail.com
*Kime: *user@giraph.apache.org
*Gönderilenler

Re: Help needed for Running my own java programs in Giraph

2013-08-26 Thread Claudio Martella

yeah. i tracked the problem to what i mentioned earlier. ZK is run with the
wrong jar when using -libjars. I have to figure out what's the expected
behavior though, because the logic is kind obscure in the code.

On Mon, Aug 26, 2013 at 11:24 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote:

Yeah, exactly. We couldn't get it to work otherwise.

On Mon, Aug 26, 2013 at 11:00 AM, Claudio Martella
claudio.marte...@gmail.com wrote:

you mean by running zookeeper independently?

On Mon, Aug 26, 2013 at 3:16 PM, Kyle Orlando
kyle.r.orla...@gmail.comwrote:

On Mon, Aug 26, 2013 at 8:55 AM, Claudio Martella
claudio.marte...@gmail.com wrote:

OK, then I'm going to open an issue for that.

On Mon, Aug 26, 2013 at 11:23 AM, Vivek Sembium
vivek.semb...@gmail.com wrote:

Yes for the zookeeper problem I passed a seperate jar through -libjars
command. If I use additional jars zookeeper fails.
On Aug 26, 2013 2:51 PM, Claudio Martella
claudio.marte...@gmail.com wrote:

there must be a misunderstanding. i was referring to the zookeeper
problem.

On Mon, Aug 26, 2013 at 11:14 AM, Vivek Sembium
vivek.semb...@gmail.com wrote:

No. I added my files(it was just a copy of one of the example
program to a different package) to the jar files of giraph. But it was
still giving me classNotFoundException. Can you give me some simple
example
program with instructions on how to deploy it. So I can start playing
with
giraph and make changes to the program and learn, then start working on
my
project in giraph.
I will be very thankful if you can help me with this.

Thanking you
-Vivek Sembium

On Mon, Aug 26, 2013 at 2:37 PM, Claudio Martella
claudio.marte...@gmail.com wrote:

but you were still using an additional jar added through -libjars,
right?

On Mon, Aug 26, 2013 at 8:43 AM, Vivek Sembium
vivek.semb...@gmail.com wrote:

@Claudio Martella Your solution didnt work either. I basically
tried copying the pageRankBenchmark to my own package, renamed the
package.
It compiles fine with giraph. But I couldnt run it even if I add those
files to giraph jar before deployment. Help?

On Sun, Aug 25, 2013 at 6:33 PM, Claudio Martella
claudio.marte...@gmail.com wrote:

you have this problem when you use two jars (one with giraph and
one with your classes) instead of a single fat-jar, correct? I
tracked the
same problem a few weeks ago, basically zookeeper is run passing the
wrong
jar.

On Sat, Aug 24, 2013 at 4:51 PM, Vivek Sembium
vivek.semb...@gmail.com wrote:

Thank you for your suggestion. It worked. Its not giving class
not found exception. But its giving me a new error
Its stopping at map 0% and reduce 0%. Upon inspection I found
that its unable to connect to zookeeper service.

java.lang.IllegalStateException: run: Caught an unrecoverable
exception onlineZooKeeperServers: Failed to connect in 10 tries!
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.IllegalStateException:
onlineZooKeeperServers: Failed to connect in 10 tries!
at
org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:727)
at
org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:371)
at
org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:204)
at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:59)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:89)
... 7 more

Immediately I ran page rank benchmark and it executed
successfully both from giraph in lib directory and also from
giraphs own
directory.

Can you give me a very simple java program(finding maximum in a
graph or simple page rank program) in giraph along with its jar
file and
input files which I can place in my lib directory of hadoop and
test if
its working. And also the command to execute it. This should be
added in
the documentation as new comers can quickly setup giraph and
concentrate on
their project.

On Sat, Aug 24, 2013 at 7:12 PM, Ahmet Emre Aladağ
emre.ala...@agmlab.com wrote:

It isn't asking for edge input. It says make sure you don't
need it. A warning for the case you may have forgotten to give
edge input
when you really needed.

The cause of your error is what I'm

Re: How to utilize combiners

2013-08-21 Thread Claudio Martella

Hi Kyle,

combiners are set by the user, as you recognized, and called automatically
by the infrastructure at different moments in the path. Combined messages
are passed transparently to the compute method (namely less messages than a
vertex would have received without a combiner).
Have a look at the PageRank examples and benchmark code.

Best,
Claudio


On Tue, Aug 20, 2013 at 8:51 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote:

 Hey all,

 I was wondering if there was any example code I could look at that uses a
 combiner.  Creating your own Combiner is easy enough, e.g.
 DoubleSumCombiner, but I am confused as to how/where I would use the
 classes in my code.

 For example, say I wanted to utilize the DoubleSumCombiner class to sum up
 all of the messages arriving at a particular vertex at the beginning of the
 superstep, and I wanted to do this for each vertex in the graph.  Where
 should I instantiate a DoubleSumCombiner, when should I call the combine()
 and createInitialMessage() methods, etc. in the compute() method?

 What further confuses me is that I see that the MasterCompute class has
 methods for setCombiner() and getCombiner(), and that there is also a
 command line option -c to specify a Combiner.  I'm not really sure if these
 are even necessary, but if they are, I don't know how these come into play
 either.

 Some clarification or direction towards an example would be nice!

 Thanks,
 --
 Kyle Orlando
 Computer Engineering Major
 University of Maryland




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Dynamic Graphs

2013-08-21 Thread Claudio Martella

As I said, the injection of the new vertices/edges would have to be done
manually, hence without any support of the infrastructure. I'd suggest
you implement a WorkerContext class that supports the reading of a specific
file with a specific format (under your control) from HDFS, and that is
accessed by this particular special vertex (e.g. based on the vertex ID).

Does this make sense?


On Wed, Aug 21, 2013 at 2:13 PM, Marco Aurelio Barbosa Fagnani Lotz 
m.a.b.l...@stu12.qmul.ac.uk wrote:

  Dear Mr. Martella,

 Once achieved the conditions for updating the vertex data base, what it
 the best way for the Injector Vertex to call an input reader again?

 I am able to access all the HDFS data, but I guess the vertex would need
 to have access to the input splits and also the vertex input format that I
 designate. Am I correct? Or there is a way that one can just ask Zookeeper
 to create new splits and distribute to the workers from given a path in DFS?

 Best Regards,
 Marco Lotz
  --
 *From:* Claudio Martella claudio.marte...@gmail.com
 *Sent:* 14 August 2013 15:25
 *To:* user@giraph.apache.org
 *Subject:* Re: Dynamic Graphs

  Hi Marco,

  Giraph currently does not support that. One way of doing this would be
 by having a specific (pseudo-)vertex to act as the injector of the new
 vertices and edges For example, it would read a file from HDFS and call the
 mutable API during the computation, superstep after superstep.


 On Wed, Aug 14, 2013 at 3:02 PM, Marco Aurelio Barbosa Fagnani Lotz 
 m.a.b.l...@stu12.qmul.ac.uk wrote:

  Hello all,

 I would like to know if there is any form to use dynamic graphs with
 Giraph. By dynamic one can read graphs that may change while Giraph is
 computing/deliberating. The changes are in the input file and are not
 caused by the graph computation itself.

 Is there any way to analyse it using Giraph? If not, anyone has any
 idea/suggestion if it is possible to modify the framework in order to
 process it?

 Best Regards,
 Marco Lotz




  --
Claudio Martella
claudio.marte...@gmail.com




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Giraph vs good-old PVM/MPI ?

2013-08-06 Thread Claudio Martella

In principle you could implement (and it has been) Pregel through MPI. The
idea behind Pregel was precisely to factor out typical patterns of graph
processing that used to be based on message-passing and barriers. A
framework like Pregel/Giraph hides this complexity through a well-defined
API and programming pattern, leaving the user with only the application
logics. How the rest is implemented under the hood, is another story that
the user does not have to worry about.


On Tue, Aug 6, 2013 at 7:19 PM, Yang tedd...@gmail.com wrote:

 it seems that the paradigm offered by Giraph/Pregel is very similar to the
 programming paradim of PVM , and to a lesser degree, MPI. using PVM, we
 often engages in such iterative cycles where all the nodes sync on a
 barrier and then enters the next cycle.

 so what is the extra features offered by Giraph/Pregel? I can see
 persistence/restarting of tasks, and maybe abstraction of the
 user-code-specific part into the API so that users are not concerned with
 the actual message passing (message passing is done by the framework).

 Thanks
 Yang




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Question regarding bin/giraph and bin/giraph-env

2013-08-02 Thread Claudio Martella

I think the giraph script is currently broken. I remember it now working
last time i checked for a similar problem.


On Thu, Aug 1, 2013 at 10:16 PM, Eli Reisman apache.mail...@gmail.comwrote:

 I'm not sure anyone has been running Giraph via the giraph scripts with
 Hbase input, maybe its messed up. I think those messages are from a time
 when you could unpack the tar.gz build product in target/ somewhere else
 and run from that instead of passing the fat jar to hadoop jar command
 yourself.



 On Mon, Jul 29, 2013 at 2:08 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote:

 Hello,

 I am trying to use the giraph script in $GIRAPH_HOME/bin to run my
 giraph code.  However, I cannot seem to get it to work: I keep
 getting:

 No lib directory, assuming dev environment
 No target directory. Build Giraph jar before proceeding.

  After looking at the code, I notice that is runs giraph-env. Within
 giraph-env, I see the following:

  if [ -d $GIRAPH_HOME/lib ]; then
 for f in $GIRAPH_HOME/lib/*.jar; do
   CLASSPATH=${CLASSPATH}:$f
 done

 for f in $GIRAPH_HOME/giraph*.jar ; do
   if [ -e $f ]; then
 JAR=$f
CLASSPATH=${CLASSPATH}:$f
 break
   fi
 done
 else
 echo No lib directory, assuming dev environment
 if [ ! -d $GIRAPH_HOME/target ]; then
 echo No target directory. Build Giraph jar before
 proceeding.
 exit 1
 fi

 CLASSPATH2=`mvn dependency:build-classpath | grep -v [INFO]`
 CLASSPATH=$CLASSPATH:$CLASSPATH2

 for f in $GIRAPH_HOME/giraph/target/giraph*.jar; do
   if [ -e $f ]; then
 JAR=$f
 break
   fi
 done
 fi


 This worries me. To obtain my version of giraph, I simply cloned the
 git repository and used mvn -Phadoop_1.0 clean install -DskipTests
 in /usr/local/giraph to build everything.  It appears that this script
 sets my GIRAPH_HOME as /usr/local/giraph, but I do not have a
 /usr/local/giraph/target directory.  Instead, I have
 $GIRAPH_HOME/giraph-core/target, $GIRAPH_HOME/giraph-hbase/target,
 etc.  Are these scripts out of date, or have I built my project
 incorrectly?

 Thanks

 --
 Kyle Orlando
 Computer Engineering Major
 University of Maryland





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: How to retrieve and display the values aggregated by the aggregators?

2013-07-31 Thread Claudio Martella

Hi Kyle,

good catch. ALWAYS should be set to 1. Want to write a patch to fix this?
Try to set the property on the command line by putting -D
giraph.textAggregatorWriter.frequency=-1 right after the GiraphRunner class
in your command line.

Hope this helps.

Best,
Claudio


On Wed, Jul 24, 2013 at 10:31 PM, Kyle Orlando kyle.r.orla...@gmail.comwrote:

 Hi Claudio,

 So I checked out TextAggregatorWriter and was, initially, still a bit
 confused on how to use it to write to a text file. That's when I
 noticed that, in org.apache.giraph.utils.ConfigurationUtils, there is
 an option aw, which corresponds to an AggregatorWriterClass.  I
 tried this out when running the SimplePageRankComputation program
 using my data as input by specifying this as an option:

 -aw org.apache.giraph.aggregators.TextAggregatorWriter.

 Here's the full command:

 hadoop jar /home/hduser/Documents/combined.jar
 org.apache.giraph.GiraphRunner
 org.apache.giraph.examples.SimplePageRankComputation -eif
 StackExchangeParsee.StackExchangeLongFloatTextEdgeInput -vif
 StackExchangeParsee.StackExchangeLongDoubleTextVertexValueInput -eip
 /in/gaming_edges.txt -vip /in/gaming_vertices.txt -of
 org.apache.giraph.io.formats.IdWithValueTextOutputFormat -aw
 org.apache.giraph.aggregators.TextAggregatorWriter -op /outPR -w 2 -mc

 org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute

 According to TextAggregatorWriter, it, by default, writes to a file
 called aggregatorValues. I checked my HDFS, and did not see that
 particular file.  That's when I noticed that there is a configuration
 giraph.textAggregatorWriter.frequency, and that by default the
 frequency is set to NEVER, which means that nothing is ever
 created/written to a file for the aggregators.  The other two
 frequencies are AT_THE_END and ALWAYS, which strangely both
 correspond to the same integer: -1. Could someone explain why this is
 so?

 Ignoring the above uncertainty, I surmised that the property
 giraph.textAggregatorWriter.frequency was to be added to my
 giraph-site.xml. I wanted the AT_THE_END frequency, which
 corresponds to the value of -1. Here's the contents of my
 giraph-site.xml file:

 configuration
   property
 namegiraph.textAggregatorWriter.frequency/name
 value-1/value
   /property
 /configuration

 I ran the SimplePageRankComputation program again (using the verbose
 hadoop jar command above), and still, I couldn't find
 aggregatorValues on my HDFS.

 Could someone help me out, or at the very least rectify any
 misconceptions and uncertainties that I have?



 On Wed, Jul 24, 2013 at 12:25 PM, Claudio Martella
 claudio.marte...@gmail.com wrote:
  Hi Kyle,
 
  you can check out the AggregatorWriter interface which allows you to do
  that. As a matter of fact there is already a class that implements what
 you
  need (org.apache.giraph.aggregators.TextAggregatorWriter).
 
  Hope it helps.
 
 
  On Wed, Jul 24, 2013 at 5:19 PM, Kyle Orlando kyle.r.orla...@gmail.com
  wrote:
 
  Hello,
 
  I am new to Giraph and was just wondering how one could retrieve and
  display the certain global values/statistics that the aggregators keep
  track of.  What classes and methods would I use, and would this be
  done in a class that extends VertexOutputFormat, or would it be done
  elsewhere?
 
  As an example, in the provided SimplePageRankComputation in
  org.apache.giraph.examples, there are three aggregators: sum, min, and
  max.  I would like to display all of their final values (after the
  final superstep) in some way, such as writing them to a text file.
 
  --
  Kyle Orlando
  Computer Engineering Major
  University of Maryland
 
 
 
 
  --
 Claudio Martella
 claudio.marte...@gmail.com



 --
 Kyle Orlando
 Computer Engineering Major
 University of Maryland




-- 
   Claudio Martella
   claudio.marte...@gmail.com

zookeeper not starting

2013-07-30 Thread Claudio Martella

Am I the only one that recently is experiencing problems with zookeeper? I
get the workers failing to connect to zookeeper. I presume it is not
starting at all. I'm using trunk and hadoop 1.0.3. Used to work smoothly.

-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Global factory for vertex IDs?

2013-07-04 Thread Claudio Martella

you can make use of a WorkerContext. There is one per worker, and you can
put your factory there. The factory can make use of the Mapper.Context
class from getContext(), and make use of the methods inherited from the
TaskAttemptContext class (e.g. the unique task id) to get some form of
worker id.

hope this helps.


On Thu, Jul 4, 2013 at 8:18 AM, Christian Krause m...@ckrause.org wrote:

 Yes, that would be perfectly fine. How can I do this? Specifically, how do
 I get the ID of the worker? And can I then just use a counter field in my
 computation which I increase whenever I need a new ID?

 (So my global ID would be a pair of the worker ID and the number I derived
 from incrementing the counter).

 Cheers,
 Christian


 2013/7/3 Avery Ching ach...@apache.org

 What are the requirements of your global ids?  If they simply need to be
 unique, you can split the id space across workers and assign them
 incrementally.


 On 6/30/13 1:09 AM, Christian Krause wrote:

 Hi,

 I was wondering if there is a way to register a global factory for new
 vertex IDs. Currently, I have to come up with new IDs in my compute method
 which does work, but with the penality that the required memory for vertex
 IDs is unnecessarily high. If there was a global vertex ID factory I could
 just keep a global counter and increase it by one when I need a new ID. Is
 something like that possible, or does it conflict with the BSP computation
 model? The thing is, in the end vertex ID collisions are detected by
 Giraph, so why not allow also a global vertex ID factory...

 Cheers,
 Christian






-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Are new vertices active?

2013-06-26 Thread Claudio Martella

Hi,

inline are my (tentative) answers.


On Wed, Jun 26, 2013 at 6:34 PM, Christian Krause m...@ckrause.org wrote:

 Hi,

 if I create new vertices, will they be executed in the next superstep? And
 does it make a difference whether I create them using addVertexRequest() or
 sendMessage()?


The vertex will be active. The case of a sendMessage is intuitive, because
a message wakens up a vertex.



 Another question: if I mutate the graph in superstep X and X is the last
 superstep, will the changes be executed? It is not clear to me whether the
 graph changes are executed during or before the next superstep.


I'm actually not sure about our internal implementation, somebody can shade
light on this, but I'd expect it to be running due to above (presence of
active vertices).



 And related to the last question, if I mutate the graph in superstep X,
 and I call getTotalNumVertices() in the next step, can I expect the updated
 number of vertices, or the number of vertices before the mutation?


The mutatiations are applied at the end of a superstep and are visibile in
the following one. Hence in s+1 you'd see the new number of vertices.



 Sorry for these very basic questions, but I did not find any documentation
 on these details. If this is documented somewhere, it would be helpful to
 get a link.

 Cheers,
 Christian




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: SimpleShortestPathsComputation with Edge List input file

2013-06-03 Thread Claudio Martella

with the only problem that you picked an abstract class again...
I advised you to use an inputformat that has the name of the types in the
class name, hence org.apache.giraph.io.formats.IntNullTextEdgeInputFormat
should work for you.


On Mon, Jun 3, 2013 at 9:34 PM, Peter Holland d1...@mydit.ie wrote:

 Thank you for the advice Claudio

 I updated the run command to use different io classes

 *bin/hadoop jar
 /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar
 *
 * org.apache.giraph.GiraphRunner
 org.apache.giraph.examples.SimpleShortestPathsComputation *
 *-eif org.apache.giraph.io.EdgeInputFormat*
 * -eip /simpleEdgeList/SimpleEdgeList.tsv*
 *-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat *
 *-op /outShortestEdgeList01 *
 *-w 1*
 *
 *
 This code does start a MapReduce job but progress stays at 0%. The log
 file for the job has the following IOException error;

 *MapAttempt TASK_TYPE=MAP TASKID=task_201306031954_0002_m_00
 TASK_ATTEMPT_ID=attempt_201306031954_0002_m_00_0 TASK_STATUS=FAILED
 FINISH_TIME=1370282492527 HOSTNAME=ubuntu-VirtualBox
 ERROR=java\.lang\.Throwable: Child Error*
 * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:271)*
 *Caused by: java\.io\.IOException: Task process exit with nonzero status
 of 1\.*
 * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:258)*

 So, this leaves 3 questions;
 Is the edge list file format correct? (a tab separated file with a .tsv
 extension)
 Is the input class correct?
 Is the output class correct?

 Thank you,
 Peter




 On 3 June 2013 01:05, Claudio Martella claudio.marte...@gmail.com wrote:

 Hi Peter,

 shortly, those are abstract classes, that's why you cannot instantiate
 them. You'll have to use a specific class extending those classes that are
 aware of the types of the signature of the vertex (I, V, E, M). check out
 some classes in the format package that have those types in the class name.


 On Mon, Jun 3, 2013 at 1:25 AM, Peter Holland d1...@mydit.ie wrote:

 Hello,
 I'm new to Giraph and I'm trying to run SimpleShortestPathsComputation
 using an edge list input file. I have some questions and and error message
 that hopefully I can get some help with.

 Edge List File Format
 What is the correct format for an edge list input file?
 I have a .tsv file with a vertex represented as an integer. Is this
 correct?

 5 11
 1 6
 6 9
 6 8
 8 9
 .

 Input File Class:
 Is org.apache.giraph.io.formats.*TextEdgeInputFormat *the only input
 format that can be used for edge lists?

 Output File Class:
 Does the output format depend on the job you are running? I have been
 using org.apache.giraph.io.formats.*TextVertexOutputFormat* for
 SimpleShortestPathsComputation.

 Run Command:
 So this is the command I am using to try to run the
 SimpleShortestPathsComputation using an edge list input file.

 *bin/hadoop jar
 /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner
 org.apache.giraph.examples.SimpleShortestPathsComputation *
 *-eif org.apache.giraph.io.formats.TextEdgeInputFormat *
 *-eip /simpleEdgeList/SimpleEdgeList.tsv *
 *-of org.apache.giraph.io.formats.TextVertexOutputFormat *
 *-op /outShortest*
 *-w 3*

 Error Message
 When I run the above command I get the following error message:
 Exception in thread main java.lang.IllegalStateException: newInstance:
 Couldn't instantiate org.apache.giraph.io.formats.TextVertexOutputFormat

 Thank you,
 Peter




 --
Claudio Martella
claudio.marte...@gmail.com





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: SimpleShortestPathsComputation with Edge List input file

2013-06-03 Thread Claudio Martella

The reason is that the particular computation
(SimpleShortestPathsComputation) is expecting vertices with Long ids, while
the EdgeInputFormat is parsing Integers. You have to fix one of the two
accordingly.


On Mon, Jun 3, 2013 at 11:22 PM, Peter Holland d1...@mydit.ie wrote:

 Thank you for your response Claudio.

 I updated the command with the input class you suggested.

 *bin/hadoop jar
 /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar
 *
 * org.apache.giraph.GiraphRunner
 org.apache.giraph.examples.SimpleShortestPathsComputation *
 *-eif org.apache.giraph.io.formats.IntNullTetxEdgeInputFormat*
  *-eip /simpleEdgeList/SimpleEdgeList.tsv*
 *-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat *
 *-op /outShortestEdgeList01 *
 *-w 1*

  Unfortunately I am getting an error message

 *13/06/03 23:00:08 INFO utils.ConfigurationUtils: No vertex input format
 specified. Ensure your InputFormat does not require one.*
 *Exception in thread main java.lang.IllegalArgumentException:
 checkClassTypes: Vertex index types don't match, vertex - class
 org.apache.hadoop.io.LongWritable, edge input format - class
 org.apache.hadoop.io.IntWritable*
 * at
 org.apache.giraph.job.GiraphConfigurationValidator.verifyEdgeInputFormatGenericTypes(GiraphConfigurationValidator.java:266)
 *
 * at
 org.apache.giraph.job.GiraphConfigurationValidator.validateConfiguration(GiraphConfigurationValidator.java:125)
 *
 * at
 org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:155)
 *
 * at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)*
 * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
 * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)*
 * at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)*
 * at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)*
 * at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 *
 * at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 *
 * at java.lang.reflect.Method.invoke(Method.java:597)*
 * at org.apache.hadoop.util.RunJar.main(RunJar.java:156)*


 On 3 June 2013 21:00, Claudio Martella claudio.marte...@gmail.com wrote:

 with the only problem that you picked an abstract class again...
 I advised you to use an inputformat that has the name of the types in the
 class name, hence org.apache.giraph.io.formats.IntNullTextEdgeInputFormat
 should work for you.


 On Mon, Jun 3, 2013 at 9:34 PM, Peter Holland d1...@mydit.ie wrote:

 Thank you for the advice Claudio

 I updated the run command to use different io classes

 *bin/hadoop jar
 /home/ubuntu/giraph/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.0.2-jar-with-dependencies.jar
 *
 * org.apache.giraph.GiraphRunner
 org.apache.giraph.examples.SimpleShortestPathsComputation *
 *-eif org.apache.giraph.io.EdgeInputFormat*
 * -eip /simpleEdgeList/SimpleEdgeList.tsv*
 *-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat *
 *-op /outShortestEdgeList01 *
 *-w 1*
 *
 *
 This code does start a MapReduce job but progress stays at 0%. The log
 file for the job has the following IOException error;

 *MapAttempt TASK_TYPE=MAP TASKID=task_201306031954_0002_m_00
 TASK_ATTEMPT_ID=attempt_201306031954_0002_m_00_0 TASK_STATUS=FAILED
 FINISH_TIME=1370282492527 HOSTNAME=ubuntu-VirtualBox
 ERROR=java\.lang\.Throwable: Child Error*
 * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:271)*
 *Caused by: java\.io\.IOException: Task process exit with nonzero
 status of 1\.*
 * at org\.apache\.hadoop\.mapred\.TaskRunner\.run(TaskRunner\.java:258)*

 So, this leaves 3 questions;
 Is the edge list file format correct? (a tab separated file with a .tsv
 extension)
 Is the input class correct?
 Is the output class correct?

 Thank you,
 Peter




 On 3 June 2013 01:05, Claudio Martella claudio.marte...@gmail.comwrote:

 Hi Peter,

 shortly, those are abstract classes, that's why you cannot instantiate
 them. You'll have to use a specific class extending those classes that are
 aware of the types of the signature of the vertex (I, V, E, M). check out
 some classes in the format package that have those types in the class name.


 On Mon, Jun 3, 2013 at 1:25 AM, Peter Holland d1...@mydit.iewrote:

 Hello,
 I'm new to Giraph and I'm trying to run SimpleShortestPathsComputation
 using an edge list input file. I have some questions and and error message
 that hopefully I can get some help with.

 Edge List File Format
 What is the correct format for an edge list input file?
 I have a .tsv file with a vertex represented as an integer. Is this
 correct?

 5 11
 1 6
 6 9
 6 8
 8 9
 .

 Input File Class:
 Is org.apache.giraph.io.formats.*TextEdgeInputFormat *the only input
 format that can be used for edge lists?

 Output File Class:
 Does the output format depend on the job you are running? I have been
 using

Re: External Documentation about Giraph

2013-06-02 Thread Claudio Martella

Hi Yazan,

I suggest you insert the tutorial with the user docs in the site/ directory
(hence in the Users Docs menu). It is certainly where new users would look
for it, and requires less navigation than the community wiki.

Thanks!

On Sun, Jun 2, 2013 at 7:36 AM, Yazan Boshmaf bosh...@ece.ubc.ca wrote:

JIRA issue: https://issues.apache.org/jira/browse/GIRAPH-676

On Sat, Jun 1, 2013 at 10:12 PM, Yazan Boshmaf bosh...@ece.ubc.ca wrote:
@Puneer, sure! I'll ping you once we have a draft ready. Thanks!

Cheers,
Yazan

On Sat, Jun 1, 2013 at 9:58 PM, Puneet Agarwal puagar...@yahoo.com
wrote:
Dear Yazan,

I don't know if you need this, still - I volunteer to review such a
documentation, from novice users' perspective.

I am a newbie on Giraph :)

Cheers - Puneet

- Original Message -
From: Yazan Boshmaf bosh...@ece.ubc.ca
To: Maria Stylianou mars...@gmail.com
Cc: user@giraph.apache.org
Sent: Sunday, June 2, 2013 9:33 AM
Subject: Re: External Documentation about Giraph

@Maria, this sounds great! I will start drafting one based on your
posts + my own experience + know-how from user/dev emails that I have
gathered. I will open a JIRA ticket and keep you in the loop. Once
you're available, you can give the docs another pass to improve
quality. I'm certain that experienced Giraph committers will also add
their own input but let's at least get a first version ready. So take
your time and good luck on your thesis presentation :)

@Avery, should I update Giraph mvn site and generate a patch (as in
http://giraph.apache.org/build_site.html) or just update the
community's Confluence wiki?

On Sat, Jun 1, 2013 at 12:09 PM, Maria Stylianou mars...@gmail.com
wrote:
Yazan let's do it!

But I'm afraid I will be super busy till 1st of July - day of thesis
presentation. After that, I can dedicate more time.

On Sat, Jun 1, 2013 at 5:29 AM, Avery Ching ach...@apache.org wrote:

Improving our documentation is always very nice. Thanks for doing
this
you two!

On 5/31/13 7:32 PM, Yazan Boshmaf wrote:

Maria, I can help you with this if you are interested and have the
time. If you are busy, please let me know and I will update the site
docs with a variant of your tutorial. Thanks!

On Thu, May 30, 2013 at 4:13 PM, Roman Shaposhnik r...@apache.org
wrote:

On Wed, May 29, 2013 at 2:25 PM, Maria Stylianou mars...@gmail.com

wrote:

Hello guys,

This semester I'm doing my master thesis using Giraph in a daily
basis.
In my blog (marsty5.wordpress.com) I wrote some posts about
Giraph,
some of
the new users may find them useful!
And maybe some of the experienced ones can give me feedback and
correct
any
mistakes :D
So far, I described:
1. How to set up Giraph
2. What to do next - after setting up Giraph
3. How to run ShortestPaths
4. How to run PageRank

Good stuff! As a shameless plug, one more way
to install Giraph is via Apache Bigtop. All it takes is
hooking one of these files:

http://bigtop01.cloudera.org:8080/view/Bigtop-trunk/job/Bigtop-trunk-Repository/label=fedora18/lastSuccessfulBuild/artifact/repo/bigtop.repo

http://bigtop01.cloudera.org:8080/view/Bigtop-trunk/job/Bigtop-trunk-Repository/label=opensuse12/lastSuccessfulBuild/artifact/repo/bigtop.repo
to your yum/apt system and typing:
$ sudo yum install hadoop-conf-pseudo giraph

In fact we're about to release Bigtop 0.6.0 with Hadoop 2.0.4.1
and Giraph 1.0 -- so anybody's interested in helping us
to test this stuff -- that would be really appreciated.

Thanks,
Roman.

P.S. There's quite a few other platforms available as well:

http://bigtop01.cloudera.org:8080/view/Bigtop-trunk/job/Bigtop-trunk-Repository/

--
Maria Stylianou
Intern at Telefonica, Barcelona, Spain
Master Student of European Master in Distributed Computing
marsty5.wordpress.com

--
Claudio Martella
claudio.marte...@gmail.com

Re: External Documentation about Giraph

2013-05-30 Thread Claudio Martella

This is a good idea. One of the things we actually miss in the new
documentation is a tutorial-like entry.
Maria, it could be a nice contribution.


On Thu, May 30, 2013 at 11:59 PM, Yazan Boshmaf bosh...@ece.ubc.ca wrote:

 Maria, the posts are very helpful. Thank you. Maybe you can update
 Giraph's site documentation with your tutorial?

 On Wed, May 29, 2013 at 2:25 PM, Maria Stylianou mars...@gmail.com
 wrote:
  Hello guys,
 
  This semester I'm doing my master thesis using Giraph in a daily basis.
  In my blog (marsty5.wordpress.com) I wrote some posts about Giraph,
 some of
  the new users may find them useful!
  And maybe some of the experienced ones can give me feedback and correct
 any
  mistakes :D
  So far, I described:
  1. How to set up Giraph
  2. What to do next - after setting up Giraph
  3. How to run ShortestPaths
  4. How to run PageRank
 
  Thank you!
  Enjoy reading! (hopefully ;p)
 
  --
  Maria Stylianou
  Intern at Telefonica, Barcelona, Spain
  Master Student of European Master in Distributed Computing
 




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Modifying a benchmark to use real input

2013-05-24 Thread Claudio Martella

You can still use the classes in the examples package, which are similar to
those in the benchmark package but are more flexible for your own tests.


On Fri, May 24, 2013 at 3:42 PM, Matt Molek mpmo...@gmail.com wrote:

 Oh, never mind, I think I found it by looking trough GiraphRunner.java

 GiraphFileInputFormat.addVertexInputPath(conf, new Path(/some/path));


 On Thu, May 23, 2013 at 5:22 PM, Matt Molek mpmo...@gmail.com wrote:

 Hi,

 I'm just getting started with Giraph, and struggling a bit to understand
 what exactly is needed to run a minimal Giraph computation on real data,
 rather than the PseudoRandomVertexInputFormat.

 Apologies if this is covered somewhere in the docs or mailing list
 archives. I looked but couldn't find anything applying to the current
 version, and I couldn't figure out exactly how things have changed through
 the versions. Some older code that I tried was clearly incompatible with
 the current version.

 Trying to learn by example, I copied the current
 o.a.g.benchmark.ShortestPathsBenchmark and
 o.a.g.benchmark.ShortestPathsComputation into my own project, and modified
 them to run on their own without GiraphBenchmark, and BenchmarkOption. Here
 is the new ShortestPathsBenchmark I ended up with:
 http://pastebin.com/h3rH6jTm

 When using the PseudoRandomVertexInputFormat, and some hard coded options
 for aggregateVertices and edgesPerVertex, this runs fine from my jar with
 the command:

 hadoop jar giraph-testing-jar-with-dependencies.jar
 modified_benchmarks.ShortestPathsBenchmark --workers 10

 Now I'd like to use JsonLongDoubleFloatDoubleVertexInputFormat with some
 real data, but I see no way to specify the input path. If this was plain
 hadoop, I'd expect to be able to say something like
 JsonLongDoubleFloatDoubleVertexInputFormat.addInputPath(job, new
 Path(/some/path));

 That's not available though. Could someone point me in the right
 direction with this?

 Am I going about this all wrong?

 Thanks for any help,
 Matt





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Error compiling Giraph

2013-05-13 Thread Claudio Martella

)
 at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
 at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
 at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
 at
 org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
 Caused by: org.apache.maven.project.DependencyResolutionException: Could
 not resolve dependencies for project
 org.apache.giraph:giraph-examples:jar:1.0.0: Failure to find
 org.apache.giraph:giraph-core:jar:tests:1.0.0 in
 http://repo1.maven.org/maven2 was cached in the local repository,
 resolution will not be reattempted until the update interval of central has
 elapsed or updates are forced
 at
 org.apache.maven.project.DefaultProjectDependenciesResolver.resolve(DefaultProjectDependenciesResolver.java:189)
 at
 org.apache.maven.lifecycle.internal.LifecycleDependencyResolver.getDependencies(LifecycleDependencyResolver.java:185)
 ... 22 more
 Caused by: org.sonatype.aether.resolution.DependencyResolutionException:
 Failure to find org.apache.giraph:giraph-core:jar:tests:1.0.0 in
 http://repo1.maven.org/maven2 was cached in the local repository,
 resolution will not be reattempted until the update interval of central has
 elapsed or updates are forced
 at
 org.sonatype.aether.impl.internal.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:375)
 at
 org.apache.maven.project.DefaultProjectDependenciesResolver.resolve(DefaultProjectDependenciesResolver.java:183)
 ... 23 more
 Caused by: org.sonatype.aether.resolution.ArtifactResolutionException:
 Failure to find org.apache.giraph:giraph-core:jar:tests:1.0.0 in
 http://repo1.maven.org/maven2 was cached in the local repository,
 resolution will not be reattempted until the update interval of central has
 elapsed or updates are forced
 at
 org.sonatype.aether.impl.internal.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:538)
 at
 org.sonatype.aether.impl.internal.DefaultArtifactResolver.resolveArtifacts(DefaultArtifactResolver.java:216)
 at
 org.sonatype.aether.impl.internal.DefaultRepositorySystem.resolveDependencies(DefaultRepositorySystem.java:358)
 ... 24 more
 Caused by: org.sonatype.aether.transfer.ArtifactNotFoundException: Failure
 to find org.apache.giraph:giraph-core:jar:tests:1.0.0 in
 http://repo1.maven.org/maven2 was cached in the local repository,
 resolution will not be reattempted until the update interval of central has
 elapsed or updates are forced
 at
 org.sonatype.aether.impl.internal.DefaultUpdateCheckManager.newException(DefaultUpdateCheckManager.java:230)
 at
 org.sonatype.aether.impl.internal.DefaultUpdateCheckManager.checkArtifact(DefaultUpdateCheckManager.java:204)
 at
 org.sonatype.aether.impl.internal.DefaultArtifactResolver.resolve(DefaultArtifactResolver.java:427)
 ... 26 more
 [ERROR]
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please read the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
 [ERROR]
 [ERROR] After correcting the problems, you can resume the build with the
 command
 [ERROR]   mvn goals -rf :giraph-examples


 Why is there a dependecy problem? I have not added any code - it's the
 clean Giraph distribution. Is this Maven's problem? Or the fact that I did
 not get Giraph directly from the repo, but from the http mirror directly?

 Thanks in advance,
 Alexandros





-- 
   Claudio Martella
   claudio.marte...@gmail.com

Re: Extra data on vertex

2013-05-07 Thread Claudio Martella

Keep in mind that you cannot access a neighbors value directly from a
vertex. What you are proposing now is possible because you are using the
vertex id to store your information (URL), which makes sense in the context
of a web page.
As soon as you will store data in the vertex value, as Avery suggest, you
will have to rely on messages to inform the neighbors of the value.


On Tue, May 7, 2013 at 4:47 PM, Ahmet Emre Aladağ emre.ala...@agmlab.comwrote:

 Hi,

 1) What's the best way for storing extra data (such as URL) on a vertex? I
 thought this would be through a class variable but I could not find the way
 to access that variable from the neighbor.
 For example I'd like to remove the duplicate edges going towards the nodes
 with the same url (Duplicate Removal phase of LinkRank). How can I learn
 my neighbor's url variable: targetUrl?

 2) Is removing edges like this a valid approach?


 public class LinkRankVertex extends VertexIntWritable, FloatWritable,
 NullWritable, FloatWritable {

 public String url;
 public void removeDuplicateLinks() {
 int targetId;
 String targetUrl;

 SetString urls = new HashSetString();
 ArrayListEdgesIntWritable, NullWritable edges = new
 ArrayListEdgesIntWritable, NullWritable();

 for (EdgeIntWritable, NullWritable edge : getEdges()) {
 targetId = edge.getTargetVertexId().get()**;
 targetUrl = ...??
 if (!urls.contains(targetUrl)) {
 urls.add(targetUrl);
 edges.add(edge);
 }
 }
 setEdges(edges);
 }
 }

 Thanks,
 Emre.




-- 
   Claudio Martella
   claudio.marte...@gmail.com

Google Summer of Code 2013 Giraph + Tinkerpop project

2013-04-18 Thread Claudio Martella

Hello lists,

we have added an issue to the Giraph JIRA that we would like to have as a
GSoC 2013 project.
The idea is to integrate Tinkerpop Bluerprints/Rexter as an input format to
Giraph, to run batch computations on data stored in Blueprints-compliant
graph databases.

Please consider advertising this issue to potential students or people
interested in this project.
The related issue can be found here:
https://issues.apache.org/jira/browse/GIRAPH-549

An entry in the Giraph Wiki will be added soon.

Best,
Claudio

-- 
   Claudio Martella
   claudio.marte...@gmail.com

1 2 >

1 - 100 of 110 matches

Mail list logo