Request for information on Giraph custom Partitioner using external service

2018-07-10 Thread Neha Raj
Hi,

I am working on a Graph Partitioning algorithms, and have chosen Giraph as
a Graph processing system to run Graph problems, and very new to both.I
would like to provide external partitioning information(in the form of txt
file) to Giraph. For this I have created a custom partition (something like
HashPartitionFactory), which reads the external file for graph partition Id.

While debugg I realize that this parition logic is invoked several times
(during the Giraph supersteps) ,and reading the same external file multiple
times is not time efficient. To handle this I wish to create a
global(across distributed system) Map variable which holds {vertex Id ,
partition Id} as a key value pair, and I want to populate this variable
from external file one time during a Giraph job run. I have tried several
ways to create & intialize such a global variable but the fact that global
variable will be populated for a Giraph job is very non deterministic (i.e
sometime the map is populated with value, sometimes not).

I think there might be some issue in how I am creating the Map variable and
initializing it to be invoked before My custom Partitioning logic calls it.
Can somebody please guide me the correct place to plugin this piece of
information to a Giraph job; and possibly a correct way of creating a
global variable with respect to Giraph distributed processing

Thanks & Regards,
Neha


Information

2014-03-26 Thread Angelo Immediata
Hi there

In my project I have to implement a routing system with good performance;
at the beginning this system should be able in giving routes information
only for one italian region (Lombardia) but it could be used for the whole
Italy (or world)
Let's stop to the Lombardia for now. By reading OSM files I can create my
own graph in the best format i can use it; then I need to use Dijkstra (or
any other algorithm) in order to propose to the user K possible paths from
point A to point B (K becouse i need to show to the customer also the
alternatives). I can't use Contraction Herarchy algorithm becouse I need to
take care of external events that can modify the weights on my built graph
and this implies that I should create the contracted graph once again and
this can be a very onerous operation

By my experimentations, I saw that by reading the Lombardia OSM file I
should create a graph with around 1 million of vertexes and 6 million of
edges and I was thinking to use Giraph to solve my issue (I saw this link
http://giraph.apache.org/intro.html where you talked about shortestpaths
problem
I have a couple of question for you giraph/hadoop gurus

   - does it make sense to use giraph for my scenario?
   - must i respect some graph format to pass to the giraph algorithm in
   order to have K shortest paths from point A to point B? If sowhich
   format should I respect?
   - what would be perfomance by using giraph? I know that Dijstra
   algorithm problem is that it is slow.by using giraph will I be able in
   improving its performances on very large graph?

I know these can seem very basic questions, but I'm pretty new to giraph
and I'm trying to understand it

Thank you
Angelo


Re: Information

2014-03-26 Thread Sebastian Schelter
For such a small graph, using a single machine  graph processing system
makes more sense imho. Should be faster and easier to program. Google for
cassovary.
Am 26.03.2014 10:12 schrieb Angelo Immediata angelo...@gmail.com:

 Hi there

 In my project I have to implement a routing system with good performance;
 at the beginning this system should be able in giving routes information
 only for one italian region (Lombardia) but it could be used for the whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create my
 own graph in the best format i can use it; then I need to use Dijkstra (or
 any other algorithm) in order to propose to the user K possible paths from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I need to
 take care of external events that can modify the weights on my built graph
 and this implies that I should create the contracted graph once again and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million of
 edges and I was thinking to use Giraph to solve my issue (I saw this link
 http://giraph.apache.org/intro.html where you talked about shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

- does it make sense to use giraph for my scenario?
- must i respect some graph format to pass to the giraph algorithm in
order to have K shortest paths from point A to point B? If sowhich
format should I respect?
- what would be perfomance by using giraph? I know that Dijstra
algorithm problem is that it is slow.by using giraph will I be able in
improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to giraph
 and I'm trying to understand it

 Thank you
 Angelo



Re: Information

2014-03-26 Thread Angelo Immediata
hi Sebastian

thnx for the answer

I was giving a look to cassovary, but I'm losing hot to integrate it with
neo4j and/or hor to calculate paths with it.Do you have any tips?

Angelo


2014-03-26 11:31 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com:

 For such a small graph, using a single machine  graph processing system
 makes more sense imho. Should be faster and easier to program. Google for
 cassovary.
 Am 26.03.2014 10:12 schrieb Angelo Immediata angelo...@gmail.com:

 Hi there

 In my project I have to implement a routing system with good performance;
 at the beginning this system should be able in giving routes information
 only for one italian region (Lombardia) but it could be used for the whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create my
 own graph in the best format i can use it; then I need to use Dijkstra (or
 any other algorithm) in order to propose to the user K possible paths from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I need to
 take care of external events that can modify the weights on my built graph
 and this implies that I should create the contracted graph once again and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million of
 edges and I was thinking to use Giraph to solve my issue (I saw this link
 http://giraph.apache.org/intro.html where you talked about shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

- does it make sense to use giraph for my scenario?
- must i respect some graph format to pass to the giraph algorithm in
order to have K shortest paths from point A to point B? If sowhich
format should I respect?
- what would be perfomance by using giraph? I know that Dijstra
algorithm problem is that it is slow.by using giraph will I be able in
improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to giraph
 and I'm trying to understand it

 Thank you
 Angelo




Re: Information

2014-03-26 Thread Claudio Martella
It looks like you're expecting to use Giraph in an online fashion, such as
you would use a database to answer queries within milliseconds or seconds.
Giraph is an offline batch processing system.


On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote:

 Hi there

 In my project I have to implement a routing system with good performance;
 at the beginning this system should be able in giving routes information
 only for one italian region (Lombardia) but it could be used for the whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create my
 own graph in the best format i can use it; then I need to use Dijkstra (or
 any other algorithm) in order to propose to the user K possible paths from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I need to
 take care of external events that can modify the weights on my built graph
 and this implies that I should create the contracted graph once again and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million of
 edges and I was thinking to use Giraph to solve my issue (I saw this link
 http://giraph.apache.org/intro.html where you talked about shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

- does it make sense to use giraph for my scenario?
- must i respect some graph format to pass to the giraph algorithm in
order to have K shortest paths from point A to point B? If sowhich
format should I respect?
- what would be perfomance by using giraph? I know that Dijstra
algorithm problem is that it is slow.by using giraph will I be able in
improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to giraph
 and I'm trying to understand it

 Thank you
 Angelo




-- 
   Claudio Martella


Re: Information

2014-03-26 Thread Angelo Immediata
hi Claudio

so, if I understood correctly, it has no sense to use Giraph for shortest
path calculation in my scenario

Am I right?


2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com:

 It looks like you're expecting to use Giraph in an online fashion, such as
 you would use a database to answer queries within milliseconds or seconds.
 Giraph is an offline batch processing system.


 On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote:

 Hi there

 In my project I have to implement a routing system with good performance;
 at the beginning this system should be able in giving routes information
 only for one italian region (Lombardia) but it could be used for the whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create my
 own graph in the best format i can use it; then I need to use Dijkstra (or
 any other algorithm) in order to propose to the user K possible paths from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I need to
 take care of external events that can modify the weights on my built graph
 and this implies that I should create the contracted graph once again and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million of
 edges and I was thinking to use Giraph to solve my issue (I saw this link
 http://giraph.apache.org/intro.html where you talked about shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

- does it make sense to use giraph for my scenario?
- must i respect some graph format to pass to the giraph algorithm in
order to have K shortest paths from point A to point B? If sowhich
format should I respect?
- what would be perfomance by using giraph? I know that Dijstra
algorithm problem is that it is slow.by using giraph will I be able in
improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to giraph
 and I'm trying to understand it

 Thank you
 Angelo




 --
Claudio Martella




Re: Information

2014-03-26 Thread Sebastian Schelter

Hi Angelo,

It very much depends on your use case. Do you want to precompute paths 
offline in batch or are you looking for a system that answers online? 
Giraph has been built for the first scenario.


--sebastian

On 03/26/2014 02:48 PM, Angelo Immediata wrote:

hi Claudio

so, if I understood correctly, it has no sense to use Giraph for shortest
path calculation in my scenario

Am I right?


2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com:


It looks like you're expecting to use Giraph in an online fashion, such as
you would use a database to answer queries within milliseconds or seconds.
Giraph is an offline batch processing system.


On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote:


Hi there

In my project I have to implement a routing system with good performance;
at the beginning this system should be able in giving routes information
only for one italian region (Lombardia) but it could be used for the whole
Italy (or world)
Let's stop to the Lombardia for now. By reading OSM files I can create my
own graph in the best format i can use it; then I need to use Dijkstra (or
any other algorithm) in order to propose to the user K possible paths from
point A to point B (K becouse i need to show to the customer also the
alternatives). I can't use Contraction Herarchy algorithm becouse I need to
take care of external events that can modify the weights on my built graph
and this implies that I should create the contracted graph once again and
this can be a very onerous operation

By my experimentations, I saw that by reading the Lombardia OSM file I
should create a graph with around 1 million of vertexes and 6 million of
edges and I was thinking to use Giraph to solve my issue (I saw this link
http://giraph.apache.org/intro.html where you talked about shortestpaths
problem
I have a couple of question for you giraph/hadoop gurus

- does it make sense to use giraph for my scenario?
- must i respect some graph format to pass to the giraph algorithm in
order to have K shortest paths from point A to point B? If sowhich
format should I respect?
- what would be perfomance by using giraph? I know that Dijstra
algorithm problem is that it is slow.by using giraph will I be able in
improving its performances on very large graph?

I know these can seem very basic questions, but I'm pretty new to giraph
and I'm trying to understand it

Thank you
Angelo





--
Claudio Martella








Re: Information

2014-03-26 Thread Angelo Immediata
hi Sebastian

OK...I got itI was thinking I could use it for an online scenario..
Thank you

Angelo


2014-03-26 14:52 GMT+01:00 Sebastian Schelter s...@apache.org:

 Hi Angelo,

 It very much depends on your use case. Do you want to precompute paths
 offline in batch or are you looking for a system that answers online?
 Giraph has been built for the first scenario.

 --sebastian


 On 03/26/2014 02:48 PM, Angelo Immediata wrote:

 hi Claudio

 so, if I understood correctly, it has no sense to use Giraph for shortest
 path calculation in my scenario

 Am I right?


 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com:

  It looks like you're expecting to use Giraph in an online fashion, such
 as
 you would use a database to answer queries within milliseconds or
 seconds.
 Giraph is an offline batch processing system.


 On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.com
 wrote:

  Hi there

 In my project I have to implement a routing system with good
 performance;
 at the beginning this system should be able in giving routes information
 only for one italian region (Lombardia) but it could be used for the
 whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create
 my
 own graph in the best format i can use it; then I need to use Dijkstra
 (or
 any other algorithm) in order to propose to the user K possible paths
 from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I
 need to
 take care of external events that can modify the weights on my built
 graph
 and this implies that I should create the contracted graph once again
 and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million of
 edges and I was thinking to use Giraph to solve my issue (I saw this
 link
 http://giraph.apache.org/intro.html where you talked about
 shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

 - does it make sense to use giraph for my scenario?
 - must i respect some graph format to pass to the giraph algorithm
 in

 order to have K shortest paths from point A to point B? If
 sowhich
 format should I respect?
 - what would be perfomance by using giraph? I know that Dijstra

 algorithm problem is that it is slow.by using giraph will I be
 able in
 improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to giraph
 and I'm trying to understand it

 Thank you
 Angelo




 --
 Claudio Martella







Re: Information

2014-03-26 Thread Claudio Martella
Nope, you can think about Giraph as MapReduce for graphs. Probably neo4j 
C is the way to go for you.


On Wed, Mar 26, 2014 at 3:18 PM, Angelo Immediata angelo...@gmail.comwrote:

 hi Sebastian

 OK...I got itI was thinking I could use it for an online scenario..
 Thank you

 Angelo


 2014-03-26 14:52 GMT+01:00 Sebastian Schelter s...@apache.org:

 Hi Angelo,

 It very much depends on your use case. Do you want to precompute paths
 offline in batch or are you looking for a system that answers online?
 Giraph has been built for the first scenario.

 --sebastian


 On 03/26/2014 02:48 PM, Angelo Immediata wrote:

 hi Claudio

 so, if I understood correctly, it has no sense to use Giraph for shortest
 path calculation in my scenario

 Am I right?


 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com
 :

  It looks like you're expecting to use Giraph in an online fashion, such
 as
 you would use a database to answer queries within milliseconds or
 seconds.
 Giraph is an offline batch processing system.


 On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.com
 wrote:

  Hi there

 In my project I have to implement a routing system with good
 performance;
 at the beginning this system should be able in giving routes
 information
 only for one italian region (Lombardia) but it could be used for the
 whole
 Italy (or world)
 Let's stop to the Lombardia for now. By reading OSM files I can create
 my
 own graph in the best format i can use it; then I need to use Dijkstra
 (or
 any other algorithm) in order to propose to the user K possible paths
 from
 point A to point B (K becouse i need to show to the customer also the
 alternatives). I can't use Contraction Herarchy algorithm becouse I
 need to
 take care of external events that can modify the weights on my built
 graph
 and this implies that I should create the contracted graph once
 again and
 this can be a very onerous operation

 By my experimentations, I saw that by reading the Lombardia OSM file I
 should create a graph with around 1 million of vertexes and 6 million
 of
 edges and I was thinking to use Giraph to solve my issue (I saw this
 link
 http://giraph.apache.org/intro.html where you talked about
 shortestpaths
 problem
 I have a couple of question for you giraph/hadoop gurus

 - does it make sense to use giraph for my scenario?
 - must i respect some graph format to pass to the giraph algorithm
 in

 order to have K shortest paths from point A to point B? If
 sowhich
 format should I respect?
 - what would be perfomance by using giraph? I know that Dijstra

 algorithm problem is that it is slow.by using giraph will I be
 able in
 improving its performances on very large graph?

 I know these can seem very basic questions, but I'm pretty new to
 giraph
 and I'm trying to understand it

 Thank you
 Angelo




 --
 Claudio Martella








-- 
   Claudio Martella


Re: What a worker really is and other interesting runtime information

2012-11-29 Thread Alexandros Daglis
Ok, so I added the partitions flag, going with

 hadoop jar target/giraph-0.1-jar-with-dependencies.jar
org.apache.giraph.examples.SimpleShortestPathsVertex
-Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=12
-Dhash.userPartitionCount=12 input output 12 1

but still I got no overall speedup at all (compared to using 1 thread) and
only 1 out of 12 cores is utilized at most times. Isn't Giraph supposed to
exploit parallelism to get some speedup? Any other suggestion?

Thanks,
Alexandros

On 29 November 2012 00:20, Avery Ching ach...@apache.org wrote:

  Oh, forgot one thing.  You need to set the number of partitions to use
 single each thread works on a single partition at a time.

 Try -Dhash.userPartitionCount=number of threads


 On 11/28/12 5:29 AM, Alexandros Daglis wrote:

 Dear Avery,

 I followed your advice, but the application seems to be totally
 thread-count-insensitive: I literally observe zero scaling of performance,
 while I increase the thread count. Maybe you can point out if I am doing
 something wrong.

 - Using only 4 cores on a single node at the moment
 - Input graph: 14 million vertices, file size is 470 MB
 - Running SSSP as follows: hadoop jar
 target/giraph-0.1-jar-with-dependencies.jar
 org.apache.giraph.examples.SimpleShortestPathsVertex
 -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output
 12 1
 where X=1,2,3,12,30
 - I notice a total insensitivity to the number of thread I specify.
 Aggregate core utilization is always approximately the same (usually around
 25-30% = only one of the cores running) and overall execution time is
 always the same (~8 mins)

 Why is Giraph's performance not scaling? Is the input size / number of
 workers inappropriate? It's not an IO issue either, because even during
 really low core utilization, time is wasted on idle, not on IO.

 Cheers,
 Alexandros



 On 28 November 2012 11:13, Alexandros Daglis alexandros.dag...@epfl.chwrote:

 Thank you Avery, that helped a lot!

 Regards,
 Alexandros


  On 27 November 2012 20:57, Avery Ching ach...@apache.org wrote:

 Hi Alexandros,

 The extra task is for the master process (a coordination task). In your
 case, since you are using a single machine, you can use a single task.

 -Dgiraph.SplitMasterWorker=false

 and you can try multithreading instead of multiple workers.

 -Dgiraph.numComputeThreads=12

 The reason why cpu usage increases is due to netty threads to handle
 network requests.  By using multithreading instead, you should bypass this.

 Avery


 On 11/27/12 9:40 AM, Alexandros Daglis wrote:

 Hello everybody,

 I went through most of the documentation I could find for Giraph and
 also most of the messages in this email list, but still I have not figured
 out precisely what a worker really is. I would really appreciate it if
 you could help me understand how the framework works.

 At first I thought that a worker has a one-to-one correspondence to a
 map task. Apparently this is not exactly the case, since I have noticed
 that if I ask for x workers, the job finishes after having used x+1 map
 tasks. What is this extra task for?

 I have been trying out the example SSSP application on a single node
 with 12 cores. Giving an input graph of ~400MB and using 1 worker, around
 10 GBs of memory are used during execution. What intrigues me is that if I
 use 2 workers for the same input (and without limiting memory per map
 task), double the memory will be used. Furthermore, there will be no
 improvement in performance. I rather notice a slowdown. Are these
 observations normal?

 Might it be the case that 1 and 2 workers are very few and I should go
 to the 30-100 range that is the proposed number of mappers for a
 conventional MapReduce job?

 Finally, a last observation. Even though I use only 1 worker, I see
 that there are significant periods during execution where up to 90% of the
 12 cores computing power is consumed, that is, almost 10 cores are used in
 parallel. Does each worker spawn multiple threads and dynamically balances
 the load to utilize the available hardware?

 Thanks a lot in advance!

 Best,
 Alexandros









RE: What a worker really is and other interesting runtime information

2012-11-29 Thread Magyar, Bence (US SSA)
Folks,

I have some of the same questions as Alexandros below.  What is exactly is a 
worker?  I am not sure I understood Avery's answer below.  I have 4-node 
cluster.  Each node has 24 nodes.  My first node is functioning (in MapReduce 
parlance) as both a job tracker as well as a task tracker.  So I have 4 
compute nodes.  (I have verified that master/slave config is correct).  I am 
launching the Giraph SimpleShortestPathsVertex example on an input graph with 
approximately 140,000 nodes/ 410,000 edges and the computation is taking 
approx. 6 minutes.  Although I don't know what a good number is, 6 minutes 
seems rather slow given all the compute horsepower I have at my disposal.  
When I monitor top on my machines while the compute is running, my cores are 
~ 80-90% idle.

I am launching my job with the following parameters:

./giraph -Dgiraph.useSuperstepCounters=false 
-DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex -if 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
/user/hduser/in -of 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
/user/hduser/out -w 3

Note that I have my number of workers (-w =3).  Should this be some other 
value?  Does anyone have any simple configuration suggestions that will help me 
tune Giraph to my problem?

Thanks!

Bence

From: Alexandros Daglis [mailto:alexandros.dag...@epfl.ch]
Sent: Thursday, November 29, 2012 6:19 AM
To: user@giraph.apache.org
Subject: Re: What a worker really is and other interesting runtime information

Ok, so I added the partitions flag, going with

 hadoop jar target/giraph-0.1-jar-with-dependencies.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex 
-Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=12 
-Dhash.userPartitionCount=12 input output 12 1

but still I got no overall speedup at all (compared to using 1 thread) and only 
1 out of 12 cores is utilized at most times. Isn't Giraph supposed to exploit 
parallelism to get some speedup? Any other suggestion?

Thanks,
Alexandros
On 29 November 2012 00:20, Avery Ching 
ach...@apache.orgmailto:ach...@apache.org wrote:
Oh, forgot one thing.  You need to set the number of partitions to use single 
each thread works on a single partition at a time.

Try -Dhash.userPartitionCount=number of threads


On 11/28/12 5:29 AM, Alexandros Daglis wrote:
Dear Avery,

I followed your advice, but the application seems to be totally 
thread-count-insensitive: I literally observe zero scaling of performance, 
while I increase the thread count. Maybe you can point out if I am doing 
something wrong.

- Using only 4 cores on a single node at the moment
- Input graph: 14 million vertices, file size is 470 MB
- Running SSSP as follows: hadoop jar 
target/giraph-0.1-jar-with-dependencies.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex 
-Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output 12 1
where X=1,2,3,12,30
- I notice a total insensitivity to the number of thread I specify. Aggregate 
core utilization is always approximately the same (usually around 25-30% = 
only one of the cores running) and overall execution time is always the same 
(~8 mins)

Why is Giraph's performance not scaling? Is the input size / number of workers 
inappropriate? It's not an IO issue either, because even during really low core 
utilization, time is wasted on idle, not on IO.

Cheers,
Alexandros


On 28 November 2012 11:13, Alexandros Daglis 
alexandros.dag...@epfl.chmailto:alexandros.dag...@epfl.ch wrote:
Thank you Avery, that helped a lot!

Regards,
Alexandros

On 27 November 2012 20:57, Avery Ching 
ach...@apache.orgmailto:ach...@apache.org wrote:
Hi Alexandros,

The extra task is for the master process (a coordination task). In your case, 
since you are using a single machine, you can use a single task.

-Dgiraph.SplitMasterWorker=false

and you can try multithreading instead of multiple workers.

-Dgiraph.numComputeThreads=12

The reason why cpu usage increases is due to netty threads to handle network 
requests.  By using multithreading instead, you should bypass this.

Avery


On 11/27/12 9:40 AM, Alexandros Daglis wrote:
Hello everybody,

I went through most of the documentation I could find for Giraph and also most 
of the messages in this email list, but still I have not figured out precisely 
what a worker really is. I would really appreciate it if you could help me 
understand how the framework works.

At first I thought that a worker has a one-to-one correspondence to a map task. 
Apparently this is not exactly the case, since I have noticed that if I ask for 
x workers, the job finishes after having used x+1 map tasks. What is this extra 
task for?

I have been trying out the example SSSP application on a single node with 12 
cores. Giving an input graph of ~400MB and using 1 worker, around 10 GBs of 
memory are used during execution. What intrigues me

Re: What a worker really is and other interesting runtime information

2012-11-29 Thread Alexandros Daglis
Hello Bence,

So, you have 96 cores at your disposal. My guess would be that 3 workers
are not enough to use all of them, you should either try with a lot more,
or try to multithread them as Avery said (thus, try 4 workers with 24
threads each). However, as I already reported, I tried this myself and I
didn't notice any improvement in performance. I guess scaling with the
number of worker threads is fundamental for a BSP framework so that's
really weird, I guess I must be doing something wrong.

Could you please try increasing your workers and tell me if you also don't
notice any improvement in performance? Furthermore, have you tried running
with both 1 and 3 workers? Did you see any difference there?

I really want to sort this scalability issue out...

On a final note, please don't use the word node for describing a lot of
different things, it got me quite confused :-) Your cluster's *nodes* have *
cores* and your input is a graph with 140k *vertices.

*Cheers,
Alexandros

On 29 November 2012 14:02, Magyar, Bence (US SSA) 
bence.mag...@baesystems.com wrote:

  Folks, 

 ** **

 I have some of the same questions as Alexandros below.  What is exactly is
 “a worker”?  I am not sure I understood Avery’s answer below.  I have
 4-node cluster.  Each node has 24 nodes.  My first node is functioning (in
 MapReduce parlance) as both a “job tracker” as well as a “task tracker”.
 So I have 4 compute nodes.  (I have verified that master/slave config is
 correct).  I am launching the Giraph SimpleShortestPathsVertex example on
 an input graph with approximately 140,000 nodes/ 410,000 edges and the
 computation is taking approx. 6 minutes.  Although I don’t know what a
 “good” number is, 6 minutes seems rather “slow” given all the compute
 horsepower I have at my disposal.  When I monitor “top” on my machines
 while the compute is running, my cores are ~ 80-90% idle.

 ** **

 I am launching my job with the following parameters:

 ** **

 ./giraph -Dgiraph.useSuperstepCounters=false
 -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar
 org.apache.giraph.examples.SimpleShortestPathsVertex -if
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip
 /user/hduser/in -of
 org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op
 /user/hduser/out -w 3

 ** **

 Note that I have my number of workers (–w =3).  Should this be some other
 value?  Does anyone have any simple configuration suggestions that will
 help me tune Giraph to my problem?

 ** **

 Thanks! 

 ** **

 Bence

 ** **

 *From:* Alexandros Daglis [mailto:alexandros.dag...@epfl.ch]
 *Sent:* Thursday, November 29, 2012 6:19 AM
 *To:* user@giraph.apache.org
 *Subject:* Re: What a worker really is and other interesting runtime
 information

 ** **

 Ok, so I added the partitions flag, going with

  hadoop jar target/giraph-0.1-jar-with-dependencies.jar
 org.apache.giraph.examples.SimpleShortestPathsVertex
 -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=12
 -Dhash.userPartitionCount=12 input output 12 1

 but still I got no overall speedup at all (compared to using 1 thread) and
 only 1 out of 12 cores is utilized at most times. Isn't Giraph supposed to
 exploit parallelism to get some speedup? Any other suggestion?

 Thanks,
 Alexandros

 On 29 November 2012 00:20, Avery Ching ach...@apache.org wrote:

 Oh, forgot one thing.  You need to set the number of partitions to use
 single each thread works on a single partition at a time.

 Try -Dhash.userPartitionCount=number of threads



 On 11/28/12 5:29 AM, Alexandros Daglis wrote:

 Dear Avery,

 I followed your advice, but the application seems to be totally
 thread-count-insensitive: I literally observe zero scaling of performance,
 while I increase the thread count. Maybe you can point out if I am doing
 something wrong.

 - Using only 4 cores on a single node at the moment
 - Input graph: 14 million vertices, file size is 470 MB
 - Running SSSP as follows: hadoop jar
 target/giraph-0.1-jar-with-dependencies.jar
 org.apache.giraph.examples.SimpleShortestPathsVertex
 -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output
 12 1
 where X=1,2,3,12,30
 - I notice a total insensitivity to the number of thread I specify.
 Aggregate core utilization is always approximately the same (usually around
 25-30% = only one of the cores running) and overall execution time is
 always the same (~8 mins)

 Why is Giraph's performance not scaling? Is the input size / number of
 workers inappropriate? It's not an IO issue either, because even during
 really low core utilization, time is wasted on idle, not on IO.

 Cheers,
 Alexandros


 

 On 28 November 2012 11:13, Alexandros Daglis alexandros.dag...@epfl.ch
 wrote:

 Thank you Avery, that helped a lot!

 Regards,
 Alexandros 

 ** **

 On 27 November 2012 20:57, Avery Ching ach...@apache.org wrote:

 Hi Alexandros,

 The extra task

RE: What a worker really is and other interesting runtime information

2012-11-29 Thread Magyar, Bence (US SSA)
Hi Alexandros,

I  increased my number of workers to 30, but my job just hangs at 3%:

./giraph -Dgiraph.useSuperstepCounters=false 
-DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex -if 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
/user/hduser/insight -of 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
/user/hduser/insight-out-30 -w 30


No lib directory, assuming dev environment
No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf
12/11/29 14:31:14 INFO mapred.JobClient: Running job: job_201211282132_0002
12/11/29 14:31:15 INFO mapred.JobClient:  map 0% reduce 0%
12/11/29 14:31:34 INFO mapred.JobClient:  map 3% reduce 0%
12/11/29 14:36:39 INFO mapred.JobClient: Job complete: job_201211282132_0002
12/11/29 14:36:39 INFO mapred.JobClient: Counters: 5
12/11/29 14:36:39 INFO mapred.JobClient:   Job Counters
12/11/29 14:36:39 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2451427
12/11/29 14:36:39 INFO mapred.JobClient: Total time spent by all reduces 
waiting after reserving slots (ms)=0
12/11/29 14:36:39 INFO mapred.JobClient: Total time spent by all maps 
waiting after reserving slots (ms)=0
12/11/29 14:36:39 INFO mapred.JobClient: Launched map tasks=8
12/11/29 14:36:39 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=3725

Looking at the logs on my master machine:

/opt/hadoop-1.0.3/logs/userlogs/job_201211282132_0002/attempt_201211282132_0002_m_00_0
 $

I see:

2012-11-29 14:32:00,732 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 7 responses of 30 needed to start superstep -1.  
Sleeping for 3 msecs and used 0 of 10 attempts.
2012-11-29 14:32:30,766 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 7 responses of 30 needed to start superstep -1.  
Sleeping for 3 msecs and used 1 of 10 attempts.
2012-11-29 14:33:00,807 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 7 responses of 30 needed to start superstep -1.  
Sleeping for 3 msecs and used 2 of 10 attempts.

This repeats until: Sleeping for 3 msecs and used 8 of 10 attempts.

Can anyone please provide guidance on what number of workers should be set to 
given some general characteristics/topology of a cluster?  (Mine are given in 
the email chain below)

-Bence

From: Alexandros Daglis [mailto:alexandros.dag...@epfl.ch]
Sent: Thursday, November 29, 2012 12:30 PM
To: user@giraph.apache.org
Subject: Re: What a worker really is and other interesting runtime information

Hello Bence,

So, you have 96 cores at your disposal. My guess would be that 3 workers are 
not enough to use all of them, you should either try with a lot more, or try to 
multithread them as Avery said (thus, try 4 workers with 24 threads each). 
However, as I already reported, I tried this myself and I didn't notice any 
improvement in performance. I guess scaling with the number of worker threads 
is fundamental for a BSP framework so that's really weird, I guess I must be 
doing something wrong.

Could you please try increasing your workers and tell me if you also don't 
notice any improvement in performance? Furthermore, have you tried running with 
both 1 and 3 workers? Did you see any difference there?

I really want to sort this scalability issue out...

On a final note, please don't use the word node for describing a lot of 
different things, it got me quite confused :-) Your cluster's nodes have cores 
and your input is a graph with 140k vertices.

Cheers,
Alexandros
On 29 November 2012 14:02, Magyar, Bence (US SSA) 
bence.mag...@baesystems.commailto:bence.mag...@baesystems.com wrote:
Folks,

I have some of the same questions as Alexandros below.  What is exactly is a 
worker?  I am not sure I understood Avery's answer below.  I have 4-node 
cluster.  Each node has 24 nodes.  My first node is functioning (in MapReduce 
parlance) as both a job tracker as well as a task tracker.  So I have 4 
compute nodes.  (I have verified that master/slave config is correct).  I am 
launching the Giraph SimpleShortestPathsVertex example on an input graph with 
approximately 140,000 nodes/ 410,000 edges and the computation is taking 
approx. 6 minutes.  Although I don't know what a good number is, 6 minutes 
seems rather slow given all the compute horsepower I have at my disposal.  
When I monitor top on my machines while the compute is running, my cores are 
~ 80-90% idle.

I am launching my job with the following parameters:

./giraph -Dgiraph.useSuperstepCounters=false 
-DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex -if 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip 
/user/hduser/in -of 
org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op 
/user/hduser/out -w 3

Note that I have my number of workers (-w =3).  Should this be some other 
value?  Does anyone have any

Re: What a worker really is and other interesting runtime information

2012-11-28 Thread Avery Ching
Oh, forgot one thing.  You need to set the number of partitions to use 
single each thread works on a single partition at a time.


Try -Dhash.userPartitionCount=number of threads

On 11/28/12 5:29 AM, Alexandros Daglis wrote:

Dear Avery,

I followed your advice, but the application seems to be totally 
thread-count-insensitive: I literally observe zero scaling of 
performance, while I increase the thread count. Maybe you can point 
out if I am doing something wrong.


- Using only 4 cores on a single node at the moment
- Input graph: 14 million vertices, file size is 470 MB
- Running SSSP as follows: hadoop jar 
target/giraph-0.1-jar-with-dependencies.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex 
-Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input 
output 12 1

where X=1,2,3,12,30
- I notice a total insensitivity to the number of thread I specify. 
Aggregate core utilization is always approximately the same (usually 
around 25-30% = only one of the cores running) and overall execution 
time is always the same (~8 mins)


Why is Giraph's performance not scaling? Is the input size / number of 
workers inappropriate? It's not an IO issue either, because even 
during really low core utilization, time is wasted on idle, not on IO.


Cheers,
Alexandros



On 28 November 2012 11:13, Alexandros Daglis 
alexandros.dag...@epfl.ch mailto:alexandros.dag...@epfl.ch wrote:


Thank you Avery, that helped a lot!

Regards,
Alexandros


On 27 November 2012 20:57, Avery Ching ach...@apache.org
mailto:ach...@apache.org wrote:

Hi Alexandros,

The extra task is for the master process (a coordination
task). In your case, since you are using a single machine, you
can use a single task.

-Dgiraph.SplitMasterWorker=false

and you can try multithreading instead of multiple workers.

-Dgiraph.numComputeThreads=12

The reason why cpu usage increases is due to netty threads to
handle network requests.  By using multithreading instead, you
should bypass this.

Avery


On 11/27/12 9:40 AM, Alexandros Daglis wrote:

Hello everybody,

I went through most of the documentation I could find for
Giraph and also most of the messages in this email list,
but still I have not figured out precisely what a worker
really is. I would really appreciate it if you could help
me understand how the framework works.

At first I thought that a worker has a one-to-one
correspondence to a map task. Apparently this is not
exactly the case, since I have noticed that if I ask for x
workers, the job finishes after having used x+1 map tasks.
What is this extra task for?

I have been trying out the example SSSP application on a
single node with 12 cores. Giving an input graph of ~400MB
and using 1 worker, around 10 GBs of memory are used
during execution. What intrigues me is that if I use 2
workers for the same input (and without limiting memory
per map task), double the memory will be used.
Furthermore, there will be no improvement in performance.
I rather notice a slowdown. Are these observations normal?

Might it be the case that 1 and 2 workers are very few and
I should go to the 30-100 range that is the proposed
number of mappers for a conventional MapReduce job?

Finally, a last observation. Even though I use only 1
worker, I see that there are significant periods during
execution where up to 90% of the 12 cores computing power
is consumed, that is, almost 10 cores are used in
parallel. Does each worker spawn multiple threads and
dynamically balances the load to utilize the available
hardware?

Thanks a lot in advance!

Best,
Alexandros









Re: What a worker really is and other interesting runtime information

2012-11-27 Thread Avery Ching

Hi Alexandros,

The extra task is for the master process (a coordination task). In your 
case, since you are using a single machine, you can use a single task.


-Dgiraph.SplitMasterWorker=false

and you can try multithreading instead of multiple workers.

-Dgiraph.numComputeThreads=12

The reason why cpu usage increases is due to netty threads to handle 
network requests.  By using multithreading instead, you should bypass this.


Avery

On 11/27/12 9:40 AM, Alexandros Daglis wrote:

Hello everybody,

I went through most of the documentation I could find for Giraph and 
also most of the messages in this email list, but still I have not 
figured out precisely what a worker really is. I would really 
appreciate it if you could help me understand how the framework works.


At first I thought that a worker has a one-to-one correspondence to a 
map task. Apparently this is not exactly the case, since I have 
noticed that if I ask for x workers, the job finishes after having 
used x+1 map tasks. What is this extra task for?


I have been trying out the example SSSP application on a single node 
with 12 cores. Giving an input graph of ~400MB and using 1 worker, 
around 10 GBs of memory are used during execution. What intrigues me 
is that if I use 2 workers for the same input (and without limiting 
memory per map task), double the memory will be used. Furthermore, 
there will be no improvement in performance. I rather notice a 
slowdown. Are these observations normal?


Might it be the case that 1 and 2 workers are very few and I should go 
to the 30-100 range that is the proposed number of mappers for a 
conventional MapReduce job?


Finally, a last observation. Even though I use only 1 worker, I see 
that there are significant periods during execution where up to 90% of 
the 12 cores computing power is consumed, that is, almost 10 cores are 
used in parallel. Does each worker spawn multiple threads and 
dynamically balances the load to utilize the available hardware?


Thanks a lot in advance!

Best,
Alexandros