Re: Giraph Partitioning

2015-02-25 Thread Arjun Sharma
Thanks Matthew for your replies! They are quite helpful. Regarding question
number 4, I see a commit of PartitionContext here by Maja
http://mail-archives.apache.org/mod_mbox/giraph-commits/201302.mbox/%3c20130209001122.ddad73a...@tyr.zones.apache.org%3E,
but it seems to be removed from the current version?


On Wed, Feb 25, 2015 at 3:30 AM, Matthew Saltz  wrote:

> Hi,
>
> 1) The partitions are processed in parallel based on the number of threads
> you specify. The vertices within a partition are processed sequentially.
> You may want to use more partitions than threads, that way if one partition
> takes a particularly long time to be processed, the other threads can
> continue processing the remaining partitions. If you have four machines
> with 12 threads each for example, with one worker per machine, the default
> number of partitions will be 4^2 = 16 partitions, whereas you actually have
> 48 threads available, so you'd probably want to specify the number of
> partitions manually to a larger number to take advantage of parallelism.
> 2) Yes
> 3) If you are only doing single threading, there's no reason to do
> multiple partitions per worker
> 3 (the second one)) I'm not familiar with the out-of-core functionality
> 4) I'm not sure
>
> I'm basing this on the version of Giraph from this summer, not the most
> recent release, but I don't think this part has changed. May want to verify
> by looking at the code.
>
> Best,
> Matthew
>
> On Wed, Feb 25, 2015 at 3:25 AM, Arjun Sharma  wrote:
>
>> Hi,
>>
>> I understand that by default, the number of partitions = number of
>> workers ^ 2. So, if we have N workers, each worker will process N
>> partitions. I have a number of questions:
>>
>> 1- By default, does Giraph process the N partitions within a single
>> worker sequentially? If yes, when setting the parameter
>> giraph.numComputeThreads, will partitions within each thread be computed
>> sequentially?
>>
>> 2- By default, does Giraph keep all partitions in memory?
>>
>> 3- If the answers to 1 and 2 are yes and yes, is there any advantage from
>> using multiple partitions versus a single partition in the case of single
>> threading per worker?
>>
>> 3- How does the out-of-core partitions affect out-of-core messages? Are
>> they completely independent? For example, if the number of partitions to be
>> kept in memory is set to a number less than N, and at the same time all
>> messages are set to be kept in memory, will ALL messages be kept in memory,
>> even those from out-of-core partitions? If the situation is reversed, where
>> all partitions are kept in memory, and out-of-core messaging is set, will
>> messages from memory-based partitions be saved on disk?
>>
>> 4- Is there a class like a PartitionContext, where you can access
>> preSuperstep and postSuperstep *per partition*, along the lines of
>> WorkerContext?
>>
>>
>


Re: Giraph Partitioning

2015-02-25 Thread Matthew Saltz
Hi,

1) The partitions are processed in parallel based on the number of threads
you specify. The vertices within a partition are processed sequentially.
You may want to use more partitions than threads, that way if one partition
takes a particularly long time to be processed, the other threads can
continue processing the remaining partitions. If you have four machines
with 12 threads each for example, with one worker per machine, the default
number of partitions will be 4^2 = 16 partitions, whereas you actually have
48 threads available, so you'd probably want to specify the number of
partitions manually to a larger number to take advantage of parallelism.
2) Yes
3) If you are only doing single threading, there's no reason to do multiple
partitions per worker
3 (the second one)) I'm not familiar with the out-of-core functionality
4) I'm not sure

I'm basing this on the version of Giraph from this summer, not the most
recent release, but I don't think this part has changed. May want to verify
by looking at the code.

Best,
Matthew

On Wed, Feb 25, 2015 at 3:25 AM, Arjun Sharma  wrote:

> Hi,
>
> I understand that by default, the number of partitions = number of workers
> ^ 2. So, if we have N workers, each worker will process N partitions. I
> have a number of questions:
>
> 1- By default, does Giraph process the N partitions within a single worker
> sequentially? If yes, when setting the parameter giraph.numComputeThreads,
> will partitions within each thread be computed sequentially?
>
> 2- By default, does Giraph keep all partitions in memory?
>
> 3- If the answers to 1 and 2 are yes and yes, is there any advantage from
> using multiple partitions versus a single partition in the case of single
> threading per worker?
>
> 3- How does the out-of-core partitions affect out-of-core messages? Are
> they completely independent? For example, if the number of partitions to be
> kept in memory is set to a number less than N, and at the same time all
> messages are set to be kept in memory, will ALL messages be kept in memory,
> even those from out-of-core partitions? If the situation is reversed, where
> all partitions are kept in memory, and out-of-core messaging is set, will
> messages from memory-based partitions be saved on disk?
>
> 4- Is there a class like a PartitionContext, where you can access
> preSuperstep and postSuperstep *per partition*, along the lines of
> WorkerContext?
>
>