From: <xm...@cs.wisc.edu>
I have been developing my git tool (based on the git internal API) that
can find out all the commits that have changed a line for better
authorship.

The reason is for my binary code authorship research, I use machine
learning to classify code authorship. To produce training data, I start with a source code repository with well-known author labels for each line
and then compiling the project into binary. So, I am able to know the
authorship for binary code and then apply some machine learning
techniques.

To get ground truth of authorship for each line, I start with git-blame. But later I find this is not sufficient because the last commit may only
add comments or may only change a small part of the line, so that I
shouldn't attribute the line of code to the last author.

I would suggest there is:
- White space adjustment
- Comment or documentation (assumes you can parse the 'code' to decide that it isn't executable code)
- word changes within expressions
- complete replacement of line (whole statement?)

Custom & practice is the likely decider.

                             Of course, there
must be some debates on who can be the representative of a line of code.
So what I would like to do is find out all the commits that have ever
changed a line, then I can try different approaches to summarize over all
these commits to produce my final authorship label (or even tuple).

I was wondering whether there have been similar debates over accurate
authorship in this community before and whether there may be other people
interested in this work.

I'd suggest looking at the various 'diff' formats, such as character diff, word diff, and line diff for discussions.


Thanks

--Xiaozhu

Philip
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to