Request for information on Giraph custom Partitioner using external service
Hi, I am working on a Graph Partitioning algorithms, and have chosen Giraph as a Graph processing system to run Graph problems, and very new to both.I would like to provide external partitioning information(in the form of txt file) to Giraph. For this I have created a custom partition (something like HashPartitionFactory), which reads the external file for graph partition Id. While debugg I realize that this parition logic is invoked several times (during the Giraph supersteps) ,and reading the same external file multiple times is not time efficient. To handle this I wish to create a global(across distributed system) Map variable which holds {vertex Id , partition Id} as a key value pair, and I want to populate this variable from external file one time during a Giraph job run. I have tried several ways to create & intialize such a global variable but the fact that global variable will be populated for a Giraph job is very non deterministic (i.e sometime the map is populated with value, sometimes not). I think there might be some issue in how I am creating the Map variable and initializing it to be invoked before My custom Partitioning logic calls it. Can somebody please guide me the correct place to plugin this piece of information to a Giraph job; and possibly a correct way of creating a global variable with respect to Giraph distributed processing Thanks & Regards, Neha
Information
Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo
Re: Information
For such a small graph, using a single machine graph processing system makes more sense imho. Should be faster and easier to program. Google for cassovary. Am 26.03.2014 10:12 schrieb Angelo Immediata angelo...@gmail.com: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo
Re: Information
hi Sebastian thnx for the answer I was giving a look to cassovary, but I'm losing hot to integrate it with neo4j and/or hor to calculate paths with it.Do you have any tips? Angelo 2014-03-26 11:31 GMT+01:00 Sebastian Schelter ssc.o...@googlemail.com: For such a small graph, using a single machine graph processing system makes more sense imho. Should be faster and easier to program. Google for cassovary. Am 26.03.2014 10:12 schrieb Angelo Immediata angelo...@gmail.com: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo
Re: Information
It looks like you're expecting to use Giraph in an online fashion, such as you would use a database to answer queries within milliseconds or seconds. Giraph is an offline batch processing system. On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo -- Claudio Martella
Re: Information
hi Claudio so, if I understood correctly, it has no sense to use Giraph for shortest path calculation in my scenario Am I right? 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com: It looks like you're expecting to use Giraph in an online fashion, such as you would use a database to answer queries within milliseconds or seconds. Giraph is an offline batch processing system. On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo -- Claudio Martella
Re: Information
Hi Angelo, It very much depends on your use case. Do you want to precompute paths offline in batch or are you looking for a system that answers online? Giraph has been built for the first scenario. --sebastian On 03/26/2014 02:48 PM, Angelo Immediata wrote: hi Claudio so, if I understood correctly, it has no sense to use Giraph for shortest path calculation in my scenario Am I right? 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com: It looks like you're expecting to use Giraph in an online fashion, such as you would use a database to answer queries within milliseconds or seconds. Giraph is an offline batch processing system. On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.comwrote: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo -- Claudio Martella
Re: Information
hi Sebastian OK...I got itI was thinking I could use it for an online scenario.. Thank you Angelo 2014-03-26 14:52 GMT+01:00 Sebastian Schelter s...@apache.org: Hi Angelo, It very much depends on your use case. Do you want to precompute paths offline in batch or are you looking for a system that answers online? Giraph has been built for the first scenario. --sebastian On 03/26/2014 02:48 PM, Angelo Immediata wrote: hi Claudio so, if I understood correctly, it has no sense to use Giraph for shortest path calculation in my scenario Am I right? 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com: It looks like you're expecting to use Giraph in an online fashion, such as you would use a database to answer queries within milliseconds or seconds. Giraph is an offline batch processing system. On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.com wrote: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo -- Claudio Martella
Re: Information
Nope, you can think about Giraph as MapReduce for graphs. Probably neo4j C is the way to go for you. On Wed, Mar 26, 2014 at 3:18 PM, Angelo Immediata angelo...@gmail.comwrote: hi Sebastian OK...I got itI was thinking I could use it for an online scenario.. Thank you Angelo 2014-03-26 14:52 GMT+01:00 Sebastian Schelter s...@apache.org: Hi Angelo, It very much depends on your use case. Do you want to precompute paths offline in batch or are you looking for a system that answers online? Giraph has been built for the first scenario. --sebastian On 03/26/2014 02:48 PM, Angelo Immediata wrote: hi Claudio so, if I understood correctly, it has no sense to use Giraph for shortest path calculation in my scenario Am I right? 2014-03-26 13:27 GMT+01:00 Claudio Martella claudio.marte...@gmail.com : It looks like you're expecting to use Giraph in an online fashion, such as you would use a database to answer queries within milliseconds or seconds. Giraph is an offline batch processing system. On Wed, Mar 26, 2014 at 11:11 AM, Angelo Immediata angelo...@gmail.com wrote: Hi there In my project I have to implement a routing system with good performance; at the beginning this system should be able in giving routes information only for one italian region (Lombardia) but it could be used for the whole Italy (or world) Let's stop to the Lombardia for now. By reading OSM files I can create my own graph in the best format i can use it; then I need to use Dijkstra (or any other algorithm) in order to propose to the user K possible paths from point A to point B (K becouse i need to show to the customer also the alternatives). I can't use Contraction Herarchy algorithm becouse I need to take care of external events that can modify the weights on my built graph and this implies that I should create the contracted graph once again and this can be a very onerous operation By my experimentations, I saw that by reading the Lombardia OSM file I should create a graph with around 1 million of vertexes and 6 million of edges and I was thinking to use Giraph to solve my issue (I saw this link http://giraph.apache.org/intro.html where you talked about shortestpaths problem I have a couple of question for you giraph/hadoop gurus - does it make sense to use giraph for my scenario? - must i respect some graph format to pass to the giraph algorithm in order to have K shortest paths from point A to point B? If sowhich format should I respect? - what would be perfomance by using giraph? I know that Dijstra algorithm problem is that it is slow.by using giraph will I be able in improving its performances on very large graph? I know these can seem very basic questions, but I'm pretty new to giraph and I'm trying to understand it Thank you Angelo -- Claudio Martella -- Claudio Martella
Re: What a worker really is and other interesting runtime information
Ok, so I added the partitions flag, going with hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=12 -Dhash.userPartitionCount=12 input output 12 1 but still I got no overall speedup at all (compared to using 1 thread) and only 1 out of 12 cores is utilized at most times. Isn't Giraph supposed to exploit parallelism to get some speedup? Any other suggestion? Thanks, Alexandros On 29 November 2012 00:20, Avery Ching ach...@apache.org wrote: Oh, forgot one thing. You need to set the number of partitions to use single each thread works on a single partition at a time. Try -Dhash.userPartitionCount=number of threads On 11/28/12 5:29 AM, Alexandros Daglis wrote: Dear Avery, I followed your advice, but the application seems to be totally thread-count-insensitive: I literally observe zero scaling of performance, while I increase the thread count. Maybe you can point out if I am doing something wrong. - Using only 4 cores on a single node at the moment - Input graph: 14 million vertices, file size is 470 MB - Running SSSP as follows: hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output 12 1 where X=1,2,3,12,30 - I notice a total insensitivity to the number of thread I specify. Aggregate core utilization is always approximately the same (usually around 25-30% = only one of the cores running) and overall execution time is always the same (~8 mins) Why is Giraph's performance not scaling? Is the input size / number of workers inappropriate? It's not an IO issue either, because even during really low core utilization, time is wasted on idle, not on IO. Cheers, Alexandros On 28 November 2012 11:13, Alexandros Daglis alexandros.dag...@epfl.chwrote: Thank you Avery, that helped a lot! Regards, Alexandros On 27 November 2012 20:57, Avery Ching ach...@apache.org wrote: Hi Alexandros, The extra task is for the master process (a coordination task). In your case, since you are using a single machine, you can use a single task. -Dgiraph.SplitMasterWorker=false and you can try multithreading instead of multiple workers. -Dgiraph.numComputeThreads=12 The reason why cpu usage increases is due to netty threads to handle network requests. By using multithreading instead, you should bypass this. Avery On 11/27/12 9:40 AM, Alexandros Daglis wrote: Hello everybody, I went through most of the documentation I could find for Giraph and also most of the messages in this email list, but still I have not figured out precisely what a worker really is. I would really appreciate it if you could help me understand how the framework works. At first I thought that a worker has a one-to-one correspondence to a map task. Apparently this is not exactly the case, since I have noticed that if I ask for x workers, the job finishes after having used x+1 map tasks. What is this extra task for? I have been trying out the example SSSP application on a single node with 12 cores. Giving an input graph of ~400MB and using 1 worker, around 10 GBs of memory are used during execution. What intrigues me is that if I use 2 workers for the same input (and without limiting memory per map task), double the memory will be used. Furthermore, there will be no improvement in performance. I rather notice a slowdown. Are these observations normal? Might it be the case that 1 and 2 workers are very few and I should go to the 30-100 range that is the proposed number of mappers for a conventional MapReduce job? Finally, a last observation. Even though I use only 1 worker, I see that there are significant periods during execution where up to 90% of the 12 cores computing power is consumed, that is, almost 10 cores are used in parallel. Does each worker spawn multiple threads and dynamically balances the load to utilize the available hardware? Thanks a lot in advance! Best, Alexandros
RE: What a worker really is and other interesting runtime information
Folks, I have some of the same questions as Alexandros below. What is exactly is a worker? I am not sure I understood Avery's answer below. I have 4-node cluster. Each node has 24 nodes. My first node is functioning (in MapReduce parlance) as both a job tracker as well as a task tracker. So I have 4 compute nodes. (I have verified that master/slave config is correct). I am launching the Giraph SimpleShortestPathsVertex example on an input graph with approximately 140,000 nodes/ 410,000 edges and the computation is taking approx. 6 minutes. Although I don't know what a good number is, 6 minutes seems rather slow given all the compute horsepower I have at my disposal. When I monitor top on my machines while the compute is running, my cores are ~ 80-90% idle. I am launching my job with the following parameters: ./giraph -Dgiraph.useSuperstepCounters=false -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/in -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/out -w 3 Note that I have my number of workers (-w =3). Should this be some other value? Does anyone have any simple configuration suggestions that will help me tune Giraph to my problem? Thanks! Bence From: Alexandros Daglis [mailto:alexandros.dag...@epfl.ch] Sent: Thursday, November 29, 2012 6:19 AM To: user@giraph.apache.org Subject: Re: What a worker really is and other interesting runtime information Ok, so I added the partitions flag, going with hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=12 -Dhash.userPartitionCount=12 input output 12 1 but still I got no overall speedup at all (compared to using 1 thread) and only 1 out of 12 cores is utilized at most times. Isn't Giraph supposed to exploit parallelism to get some speedup? Any other suggestion? Thanks, Alexandros On 29 November 2012 00:20, Avery Ching ach...@apache.orgmailto:ach...@apache.org wrote: Oh, forgot one thing. You need to set the number of partitions to use single each thread works on a single partition at a time. Try -Dhash.userPartitionCount=number of threads On 11/28/12 5:29 AM, Alexandros Daglis wrote: Dear Avery, I followed your advice, but the application seems to be totally thread-count-insensitive: I literally observe zero scaling of performance, while I increase the thread count. Maybe you can point out if I am doing something wrong. - Using only 4 cores on a single node at the moment - Input graph: 14 million vertices, file size is 470 MB - Running SSSP as follows: hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output 12 1 where X=1,2,3,12,30 - I notice a total insensitivity to the number of thread I specify. Aggregate core utilization is always approximately the same (usually around 25-30% = only one of the cores running) and overall execution time is always the same (~8 mins) Why is Giraph's performance not scaling? Is the input size / number of workers inappropriate? It's not an IO issue either, because even during really low core utilization, time is wasted on idle, not on IO. Cheers, Alexandros On 28 November 2012 11:13, Alexandros Daglis alexandros.dag...@epfl.chmailto:alexandros.dag...@epfl.ch wrote: Thank you Avery, that helped a lot! Regards, Alexandros On 27 November 2012 20:57, Avery Ching ach...@apache.orgmailto:ach...@apache.org wrote: Hi Alexandros, The extra task is for the master process (a coordination task). In your case, since you are using a single machine, you can use a single task. -Dgiraph.SplitMasterWorker=false and you can try multithreading instead of multiple workers. -Dgiraph.numComputeThreads=12 The reason why cpu usage increases is due to netty threads to handle network requests. By using multithreading instead, you should bypass this. Avery On 11/27/12 9:40 AM, Alexandros Daglis wrote: Hello everybody, I went through most of the documentation I could find for Giraph and also most of the messages in this email list, but still I have not figured out precisely what a worker really is. I would really appreciate it if you could help me understand how the framework works. At first I thought that a worker has a one-to-one correspondence to a map task. Apparently this is not exactly the case, since I have noticed that if I ask for x workers, the job finishes after having used x+1 map tasks. What is this extra task for? I have been trying out the example SSSP application on a single node with 12 cores. Giving an input graph of ~400MB and using 1 worker, around 10 GBs of memory are used during execution. What intrigues me
Re: What a worker really is and other interesting runtime information
Hello Bence, So, you have 96 cores at your disposal. My guess would be that 3 workers are not enough to use all of them, you should either try with a lot more, or try to multithread them as Avery said (thus, try 4 workers with 24 threads each). However, as I already reported, I tried this myself and I didn't notice any improvement in performance. I guess scaling with the number of worker threads is fundamental for a BSP framework so that's really weird, I guess I must be doing something wrong. Could you please try increasing your workers and tell me if you also don't notice any improvement in performance? Furthermore, have you tried running with both 1 and 3 workers? Did you see any difference there? I really want to sort this scalability issue out... On a final note, please don't use the word node for describing a lot of different things, it got me quite confused :-) Your cluster's *nodes* have * cores* and your input is a graph with 140k *vertices. *Cheers, Alexandros On 29 November 2012 14:02, Magyar, Bence (US SSA) bence.mag...@baesystems.com wrote: Folks, ** ** I have some of the same questions as Alexandros below. What is exactly is “a worker”? I am not sure I understood Avery’s answer below. I have 4-node cluster. Each node has 24 nodes. My first node is functioning (in MapReduce parlance) as both a “job tracker” as well as a “task tracker”. So I have 4 compute nodes. (I have verified that master/slave config is correct). I am launching the Giraph SimpleShortestPathsVertex example on an input graph with approximately 140,000 nodes/ 410,000 edges and the computation is taking approx. 6 minutes. Although I don’t know what a “good” number is, 6 minutes seems rather “slow” given all the compute horsepower I have at my disposal. When I monitor “top” on my machines while the compute is running, my cores are ~ 80-90% idle. ** ** I am launching my job with the following parameters: ** ** ./giraph -Dgiraph.useSuperstepCounters=false -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/in -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/out -w 3 ** ** Note that I have my number of workers (–w =3). Should this be some other value? Does anyone have any simple configuration suggestions that will help me tune Giraph to my problem? ** ** Thanks! ** ** Bence ** ** *From:* Alexandros Daglis [mailto:alexandros.dag...@epfl.ch] *Sent:* Thursday, November 29, 2012 6:19 AM *To:* user@giraph.apache.org *Subject:* Re: What a worker really is and other interesting runtime information ** ** Ok, so I added the partitions flag, going with hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=12 -Dhash.userPartitionCount=12 input output 12 1 but still I got no overall speedup at all (compared to using 1 thread) and only 1 out of 12 cores is utilized at most times. Isn't Giraph supposed to exploit parallelism to get some speedup? Any other suggestion? Thanks, Alexandros On 29 November 2012 00:20, Avery Ching ach...@apache.org wrote: Oh, forgot one thing. You need to set the number of partitions to use single each thread works on a single partition at a time. Try -Dhash.userPartitionCount=number of threads On 11/28/12 5:29 AM, Alexandros Daglis wrote: Dear Avery, I followed your advice, but the application seems to be totally thread-count-insensitive: I literally observe zero scaling of performance, while I increase the thread count. Maybe you can point out if I am doing something wrong. - Using only 4 cores on a single node at the moment - Input graph: 14 million vertices, file size is 470 MB - Running SSSP as follows: hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output 12 1 where X=1,2,3,12,30 - I notice a total insensitivity to the number of thread I specify. Aggregate core utilization is always approximately the same (usually around 25-30% = only one of the cores running) and overall execution time is always the same (~8 mins) Why is Giraph's performance not scaling? Is the input size / number of workers inappropriate? It's not an IO issue either, because even during really low core utilization, time is wasted on idle, not on IO. Cheers, Alexandros On 28 November 2012 11:13, Alexandros Daglis alexandros.dag...@epfl.ch wrote: Thank you Avery, that helped a lot! Regards, Alexandros ** ** On 27 November 2012 20:57, Avery Ching ach...@apache.org wrote: Hi Alexandros, The extra task
RE: What a worker really is and other interesting runtime information
Hi Alexandros, I increased my number of workers to 30, but my job just hangs at 3%: ./giraph -Dgiraph.useSuperstepCounters=false -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/insight -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/insight-out-30 -w 30 No lib directory, assuming dev environment No HADOOP_CONF_DIR set, using /opt/hadoop-1.0.3/conf 12/11/29 14:31:14 INFO mapred.JobClient: Running job: job_201211282132_0002 12/11/29 14:31:15 INFO mapred.JobClient: map 0% reduce 0% 12/11/29 14:31:34 INFO mapred.JobClient: map 3% reduce 0% 12/11/29 14:36:39 INFO mapred.JobClient: Job complete: job_201211282132_0002 12/11/29 14:36:39 INFO mapred.JobClient: Counters: 5 12/11/29 14:36:39 INFO mapred.JobClient: Job Counters 12/11/29 14:36:39 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2451427 12/11/29 14:36:39 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/11/29 14:36:39 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/11/29 14:36:39 INFO mapred.JobClient: Launched map tasks=8 12/11/29 14:36:39 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=3725 Looking at the logs on my master machine: /opt/hadoop-1.0.3/logs/userlogs/job_201211282132_0002/attempt_201211282132_0002_m_00_0 $ I see: 2012-11-29 14:32:00,732 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 7 responses of 30 needed to start superstep -1. Sleeping for 3 msecs and used 0 of 10 attempts. 2012-11-29 14:32:30,766 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 7 responses of 30 needed to start superstep -1. Sleeping for 3 msecs and used 1 of 10 attempts. 2012-11-29 14:33:00,807 INFO org.apache.giraph.graph.BspServiceMaster: checkWorkers: Only found 7 responses of 30 needed to start superstep -1. Sleeping for 3 msecs and used 2 of 10 attempts. This repeats until: Sleeping for 3 msecs and used 8 of 10 attempts. Can anyone please provide guidance on what number of workers should be set to given some general characteristics/topology of a cluster? (Mine are given in the email chain below) -Bence From: Alexandros Daglis [mailto:alexandros.dag...@epfl.ch] Sent: Thursday, November 29, 2012 12:30 PM To: user@giraph.apache.org Subject: Re: What a worker really is and other interesting runtime information Hello Bence, So, you have 96 cores at your disposal. My guess would be that 3 workers are not enough to use all of them, you should either try with a lot more, or try to multithread them as Avery said (thus, try 4 workers with 24 threads each). However, as I already reported, I tried this myself and I didn't notice any improvement in performance. I guess scaling with the number of worker threads is fundamental for a BSP framework so that's really weird, I guess I must be doing something wrong. Could you please try increasing your workers and tell me if you also don't notice any improvement in performance? Furthermore, have you tried running with both 1 and 3 workers? Did you see any difference there? I really want to sort this scalability issue out... On a final note, please don't use the word node for describing a lot of different things, it got me quite confused :-) Your cluster's nodes have cores and your input is a graph with 140k vertices. Cheers, Alexandros On 29 November 2012 14:02, Magyar, Bence (US SSA) bence.mag...@baesystems.commailto:bence.mag...@baesystems.com wrote: Folks, I have some of the same questions as Alexandros below. What is exactly is a worker? I am not sure I understood Avery's answer below. I have 4-node cluster. Each node has 24 nodes. My first node is functioning (in MapReduce parlance) as both a job tracker as well as a task tracker. So I have 4 compute nodes. (I have verified that master/slave config is correct). I am launching the Giraph SimpleShortestPathsVertex example on an input graph with approximately 140,000 nodes/ 410,000 edges and the computation is taking approx. 6 minutes. Although I don't know what a good number is, 6 minutes seems rather slow given all the compute horsepower I have at my disposal. When I monitor top on my machines while the compute is running, my cores are ~ 80-90% idle. I am launching my job with the following parameters: ./giraph -Dgiraph.useSuperstepCounters=false -DSimpleShortestPathsVertex.sourceId=100 ../target/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -if org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexInputFormat -ip /user/hduser/in -of org.apache.giraph.io.JsonLongDoubleFloatDoubleVertexOutputFormat -op /user/hduser/out -w 3 Note that I have my number of workers (-w =3). Should this be some other value? Does anyone have any
Re: What a worker really is and other interesting runtime information
Oh, forgot one thing. You need to set the number of partitions to use single each thread works on a single partition at a time. Try -Dhash.userPartitionCount=number of threads On 11/28/12 5:29 AM, Alexandros Daglis wrote: Dear Avery, I followed your advice, but the application seems to be totally thread-count-insensitive: I literally observe zero scaling of performance, while I increase the thread count. Maybe you can point out if I am doing something wrong. - Using only 4 cores on a single node at the moment - Input graph: 14 million vertices, file size is 470 MB - Running SSSP as follows: hadoop jar target/giraph-0.1-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex -Dgiraph.SplitMasterWorker=false -Dgiraph.numComputeThreads=X input output 12 1 where X=1,2,3,12,30 - I notice a total insensitivity to the number of thread I specify. Aggregate core utilization is always approximately the same (usually around 25-30% = only one of the cores running) and overall execution time is always the same (~8 mins) Why is Giraph's performance not scaling? Is the input size / number of workers inappropriate? It's not an IO issue either, because even during really low core utilization, time is wasted on idle, not on IO. Cheers, Alexandros On 28 November 2012 11:13, Alexandros Daglis alexandros.dag...@epfl.ch mailto:alexandros.dag...@epfl.ch wrote: Thank you Avery, that helped a lot! Regards, Alexandros On 27 November 2012 20:57, Avery Ching ach...@apache.org mailto:ach...@apache.org wrote: Hi Alexandros, The extra task is for the master process (a coordination task). In your case, since you are using a single machine, you can use a single task. -Dgiraph.SplitMasterWorker=false and you can try multithreading instead of multiple workers. -Dgiraph.numComputeThreads=12 The reason why cpu usage increases is due to netty threads to handle network requests. By using multithreading instead, you should bypass this. Avery On 11/27/12 9:40 AM, Alexandros Daglis wrote: Hello everybody, I went through most of the documentation I could find for Giraph and also most of the messages in this email list, but still I have not figured out precisely what a worker really is. I would really appreciate it if you could help me understand how the framework works. At first I thought that a worker has a one-to-one correspondence to a map task. Apparently this is not exactly the case, since I have noticed that if I ask for x workers, the job finishes after having used x+1 map tasks. What is this extra task for? I have been trying out the example SSSP application on a single node with 12 cores. Giving an input graph of ~400MB and using 1 worker, around 10 GBs of memory are used during execution. What intrigues me is that if I use 2 workers for the same input (and without limiting memory per map task), double the memory will be used. Furthermore, there will be no improvement in performance. I rather notice a slowdown. Are these observations normal? Might it be the case that 1 and 2 workers are very few and I should go to the 30-100 range that is the proposed number of mappers for a conventional MapReduce job? Finally, a last observation. Even though I use only 1 worker, I see that there are significant periods during execution where up to 90% of the 12 cores computing power is consumed, that is, almost 10 cores are used in parallel. Does each worker spawn multiple threads and dynamically balances the load to utilize the available hardware? Thanks a lot in advance! Best, Alexandros
Re: What a worker really is and other interesting runtime information
Hi Alexandros, The extra task is for the master process (a coordination task). In your case, since you are using a single machine, you can use a single task. -Dgiraph.SplitMasterWorker=false and you can try multithreading instead of multiple workers. -Dgiraph.numComputeThreads=12 The reason why cpu usage increases is due to netty threads to handle network requests. By using multithreading instead, you should bypass this. Avery On 11/27/12 9:40 AM, Alexandros Daglis wrote: Hello everybody, I went through most of the documentation I could find for Giraph and also most of the messages in this email list, but still I have not figured out precisely what a worker really is. I would really appreciate it if you could help me understand how the framework works. At first I thought that a worker has a one-to-one correspondence to a map task. Apparently this is not exactly the case, since I have noticed that if I ask for x workers, the job finishes after having used x+1 map tasks. What is this extra task for? I have been trying out the example SSSP application on a single node with 12 cores. Giving an input graph of ~400MB and using 1 worker, around 10 GBs of memory are used during execution. What intrigues me is that if I use 2 workers for the same input (and without limiting memory per map task), double the memory will be used. Furthermore, there will be no improvement in performance. I rather notice a slowdown. Are these observations normal? Might it be the case that 1 and 2 workers are very few and I should go to the 30-100 range that is the proposed number of mappers for a conventional MapReduce job? Finally, a last observation. Even though I use only 1 worker, I see that there are significant periods during execution where up to 90% of the 12 cores computing power is consumed, that is, almost 10 cores are used in parallel. Does each worker spawn multiple threads and dynamically balances the load to utilize the available hardware? Thanks a lot in advance! Best, Alexandros