The history around the new file isn't the file itself, but in which directory/package it would be in.
Cheers, On Thu, Dec 10, 2015 at 3:01 PM, Igor Wiese <igor.wi...@gmail.com> wrote: > Hi Patrick > > The problem with new files is the absence of history to build the > prediction models. I need at least some commits (10 commits for example). > Yes, the link between files is what we are predicting. We can predict > changes involving commands.properties, XML files in general, .txt files, or > any source code extension :-) > > Thanks for the feedback. > > > 2015-12-10 17:40 GMT-02:00 Patrick Dube <patrickdub...@gmail.com>: > > > Are you handling new files as well, or the links between sets of files > (or > > packages)? As an example, if a user creates a new API cmd, then he will > > update the "commands.properties" file. Another example, if a VO file is > > updated, then there will be a db migration file added as well. > > Cool work, > > > > On Thu, Dec 10, 2015 at 9:21 AM, Igor Wiese <igor.wi...@gmail.com> > wrote: > > > > > Hi Sebastien. > > > > > > We used only 141 commits because we needed data from the issues. As my > > > assumption is related to the contextual information from Issues and > > Social > > > aspects, we need to aggregate commits and Issues. > > > > > > First, I collected the issues from JIRA and then i tryed to aggregate > the > > > commits that explicit made mentions to an issue collected. I only also > > used > > > closed issues to obtain the confidence that the code used to build my > > > models have been merged and checked by the community. > > > > > > That is the weak point of my approach. I need the past data from the > > > issues. Sometimes it is not available for past time. > > > It is in my plan to use also data from github to make the dataset more > > > complete. > > > > > > All the best, > > > > > > 2015-12-10 11:22 GMT-02:00 sebgoa <run...@gmail.com>: > > > > > > > > > > > On Dec 10, 2015, at 12:31 AM, Igor Wiese <igor.wi...@gmail.com> > wrote: > > > > > > > > > Hi, Cloudstack Community. > > > > > > > > > > My name is Igor Wiese, phd Student from Brazil. In my research, I > am > > > > > investigating two important questions: What makes two files change > > > > > together? Can we predict when they are going to co-change again? > > > > > > > > > > I've tried to investigate this question on the Cloudstack project. > > I've > > > > > collected data from issue reports, discussions and commits and > using > > > some > > > > > machine learning techniques to build a prediction model. > > > > > > > > > > I collected a total of 141 commits in which a pair of files changed > > > > > together and could correctly predict 60% commits. > > > > > > > > > > > > Hi Igor, why 141 commits ? Is that the only commits you found with > > only a > > > > pair for changes ? > > > > > > > > My gut feeling is that you could check the entire history of the > > > > CloudStack repo (~5 years worth of data) and work on different type > of > > > > tuples. > > > > > > > > 141 commits seems like a really small dataset. > > > > > > > > -Sebastien > > > > > > > > > These were the most > > > > > useful information for predicting co-changes of files: > > > > > > > > > > - sum of number of lines of code added, modified and removed, > > > > > > > > > > - number of words used to describe and discuss the issues, > > > > > > > > > > - number of comments in each issue, > > > > > > > > > > - median value of closeness, a social network measure obtained from > > > issue > > > > > comments, and > > > > > > > > > > - median value of constraint, a social network measure obtained > from > > > > issue > > > > > comments. > > > > > > > > > > To illustrate, consider the following example from our analysis. > For > > > > > release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and > > > > > "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 > > commits. > > > > In > > > > > another 2 commits, only the first file changed, but not the second. > > > > > Collecting contextual information for each commit made to first > file > > in > > > > the > > > > > previous release (4.3), we were able to predict all 3 commits in > > which > > > > both > > > > > files changed together in release 4.4, and we only issued 0 false > > > > > positives. For this pair of files, the most important contextual > > > > > information was the number of lines of code added, removed and > > modified > > > > in > > > > > each commit,the number of comments in each issue, and social > network > > > > > measures (closeness, density, constraint, hierarchy) obtained from > > > issue > > > > > comments. > > > > > > > > > > - Do these results surprise you? Can you think in any explanation > for > > > the > > > > > results? > > > > > > > > > > - Do you think that our rate of prediction is good enough to be > used > > > for > > > > > building tool support for the software community? > > > > > > > > > > - Do you have any suggestion on what can be done to improve the > > change > > > > > recommendation? > > > > > > > > > > You can visit our webpage to inspect the results in details: > > > > > http://flosscoach.com/index.php/17-cochanges/67-cloudstack > > > > > > > > > > All the best, > > > > > Igor Wiese > > > > > Phd Candidate > > > > > > > > > > > > > > > > > -- > > > ================================= > > > Igor Scaliante Wiese > > > PhD Candidate - Computer Science @ IME/USP > > > Faculty in Dept. of Computing at Universidade Tecnológica Federal do > > Paraná > > > > > > > > > -- > ================================= > Igor Scaliante Wiese > PhD Candidate - Computer Science @ IME/USP > Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná >