Hi,

My name is Xin Tan, majored in Computer Science, Peking University, China. I am 
a first year PHD student. Recently I??m doing a research on OpenStack about the 
contribution composition of a code file, which in order to explore the 
contribution pattern of different kinds of files. Further, I want to locate the 
risk file to give developers some useful information.

 

I wonder if I could show you my study, including some metrics to describe the 
contribution composition of a code file. I would appreciate it if you could 
show your opinions or give some advices, which would really, really help me a 
lot. And it would only take you a little time.

 

Thank you so much for your kindness.

 

First of all, I would give you a brief introduction to my study. We all know 
large software projects are usually composed of some general file types. For 
different file types, the contribution composition of them are various. For 
example, the files which are responsible for implementing core functionalities 
are usually very complex and they are modified very often, of course, the 
number of contributors are very high. However, relatively independent files are 
usually maintained by a small number of people who have high ownership of the 
code. Previous studies have shown that code ownership is closely related to 
software quality. And it??s very easy to understand if a developer often 
contributes to a certain code file, he is most likely to be an expert to it. To 
some extent, he is the person in charge of the code file, which is conducive to 
the stability of the quality of the code file. But if he leaves suddenly, the 
risk of this code file may be very high. Therefore, starting from code 
ownership, we analysis the contribution composition of different file types, 
which in order to: 

1)         knowing the contribution composition of files in real time.

2)         exploring the contribution pattern of different kinds of files.

3)         Locating the risk file.

First, we define three metrics to describe the contribution composition of 
files.

1)         Centrality: 

The Centrality of a file refers to the proportion of ownership for the 
contributor with the highest proportion of ownership, which is calculated by 
the number of commit times. It could indicate that whether there is one 
developer who ??owns?? the file and has a high level of expertise, who can act 
as a single point of contact for others who need to use the component, need 
changes to it, or just have questions about it.

2)         Diversity:

We measured the uncertainty in a code file's contributions (or the diversity of 
sources of contributions) in a given period using the Teachman/Shannon entropy 
index, a commonly used diversity measure in many scientific disciplines.

H(x)=E[I(xi)]=E[log(2,1/p(xi))]=-??p(xi)log(2,p(xi))(i=1,2,..n), 

p(xi) is the code ownership of developer xi, I(xi) means the information we 
need to judge if a contribution belongs to developer xi. H(x) ranges between 0, 
when all the contribution of the file belong to one developer in a release, and 
log(2, N), when N developers contribute equally (i.e., pi = 1/N) to the file. 
The larger H(x) is, the more diverse the contribution of the file is.

We assume that the more diverse the contribution, the more bugs the code file 
would have in this release. And We have proven that there is a significant 
positive correlation between the contribution diversity and the amount of 
defect of the file.

3)         Stability:

The Stability of file means its personnel scheduling. It calculated by the 
total number of the contributors of this file who leave or join relative to the 
previous cycle. When the number of contributors to a file is instable, it 
usually means the high risk. 

 

Then, we choose a nova release for a case study. We define several different 
files types according to the functionalities of code file and refer to the 
measurement value.

 

File type

Example in nova


The test file for active file

nova/tests/unit/virt/libvirt/test_driver.py


Exception handling file

nova/exception.py


Privilege management file

etc/nova/policy.json


Core interface file

nova/compute/api.py


Key function implementation file

nova/compute/manager.py


Module function implementation file

nova/conductor/manager.py


Function realization file of complex module

nova/db/sqlalchemy/models.py


Module interface file

nova/api/metadata/base.py


Module test file

nova/tests/unit/conductor/test_conductor.py


Module configuration file

nova/conf/scheduler.py


Non function implementation file

requirements.txt


i18n file

nova/locale/zh_CN/LC_MESSAGES/nova.po


 

And we calculate the above metrics of the nova active files (of course it is 
not accurate, because the contribution composition is effected by many factors 
not only file types.). We find three patterns.

Centrality: low

Diversity: high

Stability: low

Metric_1<=0.2

3=<Metric_2

14=<Metric_3

Key function implementation file


Function realization file of complex module


The test file for active file


Exception handling file


Privilege management file


Core interface file


Centrality: medium

Diversity: medium

Stability: medium

0.2< Metric_1 <=0.7

2<= Metric_2<3

5=< Metric_3<14

Module function implementation file


Module interface file


Module test file


Module configuration file


Centrality: high

Diversity: low

Stability: high

0.7< Metric_1<=1

0<= Metric_2<2

0=< Metric_3<5

Non function implementation file


i18n file


 

For locating high risk file, I have two points.

1)        Pattern 1(Centrality: low/ Diversity: high/ Stability: low) should be 
paid much more attention. But there are special cases, for example, Exception 
handling file, although it is modified too often, it is not complex, so the 
risk of it is low.

2)         If the contribution composition of a file are significantly various 
between two cycle, it should be paid more attention to.

 

Ok, that??s almost what I??m doing. I hope that I have expressed my ideas 
clearly. And I really hope to know what you think about my work on the 
following three questions, which would give me great help on my research:

 

1)         Do you think the metrics are useful for developers and project 
managers in some way?

In particular, could the Centrality be used to identify the experts of the file 
and how would it help in practice? And do you think that files with high code 
ownership would result in higher code quality and fewer failures?

Do you think the contribution diversity could act as an indicator for high risk 
of lower code quality of the file in some way and why? And what would it mean 
in practice when the contribution diversity of a file changes a lot?

Do you agree that when contributors left the project, their code would be hard 
to be maintained by others, and contributions made by newcomers would be more 
likely to bring bugs to the files? So would it help by knowing how many people 
left the project and how many people are newcomers to the projects and who are 
them? If yes, how would it help in practice?

2)         What??s your opinion of exploring contribution composition of file 
from different file types is reasonable?

3)         Do you think the type files I divided is reasonable? 

4)         For different type of files, based on your developing experience, 
what??s the idea contribution composition pattern of different file types?

5)        Any other suggestion or ideals? 

Again, I would appreciate it a lot if you could give me some advices. And thank 
you so much for your time.

Looking forward to your reply. Wish you have a good day.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to