Hi Muaz, this is a very interesting topic! First of all, the top 2 (number of workers, heap size) are indeed the most important. Also, I think giraph.numComputeThreads and probably the netty-related thread parameter are more important.
The following are less important: - giraph.maxMutationsPerRequest: mutations are a feature that probably kicks in a more limited set of applications, and usually in certain phases of an application. I would expect this to have limited impact with respect to the other parameters. - giraph.useMessageSizeEncoding: this will be applicable in a limited set of applications that depends on the type of vertex ID/values etc they use. Also, I would exclude the following: - giraph.VerticesToUpdateProgress: this is just used to keep stats, it's not important for processing, i doubt it will have any perf impact. - giraph.maxPartitionsInMemory: the out-of-core mechanism can be a bit unreliable, and would make your study harder. - giraph.checkpointFrequency: checkpointing may not be that common a feature, and hasn't been properly maintained so you may have trouble using it. Aside from these you could consider some GC-related parameters: the type of the GC (e.g. parallel etc), size of new generation, GC survivor ratio. I would love to learn more about how you'll be approaching the problem and ofcourse looking forward to the results. On Wed, Mar 13, 2019 at 6:12 AM Muaz Twaty <muaz.tw...@euranova.eu> wrote: > Hello Giraph community, > > "*Parameter tuning of graph processing frameworks*" is the domain of > research for my master thesis. The objective of the thesis is to find an > automated method to choose an optimal/sub-optimal configuration for the > graph processing frameworks. At this point, I reviewed the state of the art > in the optimization literature and reviewed the available graph processing > frameworks. *Giraph *is the first framework that I started to discover in > details and start running jobs with it, hoping that it will be the > framework which I will apply the optimization algorithms on. > > My question is regarding the set of parameters which should be chosen to > optimize. Since I am not a Giraph expert, I thought the best way is to ask > the community. I made a list of Giraph parameters which I thought are > important and are related directly to the framework performance. The > parameters with higher ranks are parameters which I think are more > important.I hope that you give a feedback about the list: *is it a good > set of parameters to optimize? Are there some parameters in the set which > should be fixed for all different kind of jobs? Any suggestion to change > the ranking, add or remove parameters? * > > I will add more parameters regarding the used hardware (number of CPUs, > size of RAM per CPU and hard disk speed), but the point of this email is to > focus on the parameters of *Giraph.* > > Thanks, > Muaz TWATY > *EURA NOVA * > > > Ranking Parameter name Default value Details > Hadoop 1 -w required Number of workers > Hadoop 2 -yarnheap 1024 (integer) MB. > Heap size, in MB, for each Giraph task (YARN only.) > Giraph 3 giraph.useInputSplitLocality TRUE > To minimize network usage when reading input splits, each worker can > prioritize splits that reside on its host. This, however, comes at the cost > of increased load on ZooKeeper. Hence, users with a lot of splits and input > threads (or with configurations that can't exploit locality) may want to > disable it. > Giraph 4 giraph.useMessageSizeEncoding FALSE > Use message size encoding (typically better for complex objects, not meant > for primitive wrapped messages) > Giraph 5 giraph.VerticesToUpdateProgress 100000 > Minimum number of vertices to compute before updating worker progress > Giraph 6 giraph.maxMutationsPerRequest 100 > Maximum number of mutations per partition before flush > Giraph 7 giraph.maxPartitionsInMemory 0 > Maximum number of partitions to hold in memory for each worker. By default > it is set to 0 (for adaptive out-of-core mechanism > Giraph 8 giraph.clientReceiveBufferSize 32768 Client receive buffer size > Giraph 9 giraph.clientSendBufferSize 524288 Client send buffer size > Giraph 10 giraph.serverReceiveBufferSize 524288 Server receive buffer size > Giraph 11 giraph.serverSendBufferSize 32768 Server send buffer size > Giraph 12 giraph.async.message.store.threads 0 > Number of threads to be used in async message store > Giraph 13 giraph.channelsPerServer 1 > Number of channels used per server > Giraph 14 giraph.nettyClientExecutionThreads 8 > Netty client execution threads (execution handler) > Giraph 15 giraph.nettyClientThreads 4 Netty client threads > Giraph 16 giraph.nettyServerExecutionThreads 8 > Netty server execution threads (execution handler) > Giraph 17 giraph.nettyServerThreads 16 Netty server threads > Giraph 18 giraph.numComputeThreads 1 > Number of threads for vertex computation > Giraph 19 giraph.checkpointFrequency 0 > How often to checkpoint (i.e. 0, means no checkpoint, 1 means every > superstep, 2 is every two supersteps, etc.). > > > > ♻ Be green, keep it on the screen