Hi, Hadoop Community.

My name is Igor Wiese, phd Student from Brazil. I sent an email a week
ago about my research. We received some visit to inspect the results
but any feedback was provided.

I am investigating two important questions: What makes two files
change together? Can we predict when they are going to co-change
again?

I've tried to investigate this question on the Hadoop project. I've
collected data from issue reports, discussions and commits and using
some machine learning techniques to build a prediction model.


I collected a total of 950 commits in which a pair of files changed
together and could correctly predict 47% commits. These were the most
useful information for predicting co-changes of files:

- sum of number of lines of code added, modified and removed,

- number of words used to describe and discuss the issues,

- median value of closeness, a social network measure obtained from
issue comments,

- median value of constraint, a social network measure obtained from
issue comments, and

- median value of hierarchy, a social network measure obtained from
issue comments.

To illustrate, consider the following example from our analysis. For
release 0.22, the files "/ipc/Client.java" and
"security/SecurityUtil.java" changed together in 3 commits. In another
1 commit, only the first file changed, but not the second. Collecting
contextual information for each commit made to first file in the
previous release, we were able to predict 2 commits in which both
files changed together in release 0.22, and we only issued 1 wrong
prediction. For this pair of files, the most important contextual
information were the social network metrics (density, hierarchy,
efficiency) obtained from issue comments.


- Do these results surprise you? Can you think in any explanation for
the results?

- Do you think that our rate of prediction is good enough to be used
for building tool support for the software community?

- Do you have any suggestion on what can be done to improve the change
recommendation?

You can visit our webpage to inspect the results in details:
http://flosscoach.com/index.php/17-cochanges/70-hadoop

All the best,
Igor Wiese

Phd Candidate

-- 
=================================
Igor Scaliante Wiese
PhD Candidate - Computer Science @ IME/USP
Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná

Reply via email to