Re: Making empirical data code available

2012-02-21 Thread Jan Erik Moström
There was an attempt in doing something like this for the CSEd
community a few years back: http://www8.cs.umu.se/~dcer/index.html

There are also a few papers about this, for example
http://doi.acm.org/10.1145/1404520.1404534
and if I'm correct there was also a few actual research papers
produced to test out things (no refs though).

- jem

-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data code available

2012-02-18 Thread Derek M Jones

All,

I prefer to think that somebody who knows more about statistics than
me will find something significant that I missed:

"Willingness to Share Research Data Is Related to the Strength of the 
Evidence and the Quality of Reporting of Statistical Results"

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026828

--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England & Wales and a charity registered in Scotland (SC 038302).



RE: Making empirical data code available

2012-02-17 Thread Lindsay Marshall
>> - permanently accessible URLs or other references (e.g. DOIs). For this
>
>This is certainly the ideal.  Let's not fall into the trap of not
>doing anything until the ideal system is in place.

I was just talking to my friendly local DOI guru and this is definitely 
possible now using datacite and UK repositories, though some negotiation with 
them may be necessary.

L.


-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data code available

2012-02-17 Thread Derek M Jones

Dan,


IMO the most important things are to have:

- permanently accessible URLs or other references (e.g. DOIs). For this


This is certainly the ideal.  Let's not fall into the trap of not
doing anything until the ideal system is in place.

Data is available now and some of it will probably be lost
before such a system is in place.  It is better to start
encouraging people to archive their data now than let them off the
hook of not having to do something until something in the future
happens.


- clear licensing that allows sharing (ideally open data such as CC, and
open source code such as GPL or BSD).

The advantage of open licensing is that if github or archive.org goes


github (just one suggestion that is popular and up and running)
allows the licensing to be clearly specified by the owners of
the data it holds.


bust, long after I have moved on to other interests, other people can
re-host my data and code. I don't see any particularly compelling reason
to gather things into one archive, though it does seem to help in a
community-building kind of sense.

Dan


On 17/02/2012 13:53, Derek M Jones wrote:

Neil,


There are some efforts underway to do this. I'm familiar with
http://datacite.org/ and http://figshare.com. A couple of SE groups
have started data and model problem repositories, such as
http://promisedata.org.


Thanks for the links. figshare looks interesting.


The challenge is getting everyone on board. For now, I don't see a
compelling reason to use these places.


People could just as easily use git-hub, https://github.com/
which is used by a lot of researchers to make their code freely
available (git-hub make their money from people paying for hosting
of privately avaialble code).

Your paper "Automated topic naming to support cross-project analysis
of software maintenance activities" is in my pile of interesting ones
to read in more detail. You can read about my own interest in naming
in www.knosof.co.uk/cbook/sent792.pdf


I suspect it won't happen until journals and conferences begin to
insist on it. There is a reason why retraction rates are so low in CS
and SE: no way to reproduce results to confirm.

Cameron Neylon is a good point man on the issues around Science 2.0
and open access (http://cameronneylon.net/)


Neil Ernst
http://neilernst.net

On 2012-02-16, at 7:15, Derek M Jones wrote:


Lindsay,

A couple of researchers I have contacted to obtain data
told me that they have either lost it or did not make an
effort to keep it.

Having someplace that people could automatically upload their
data to might help preserve more of it, as well as making
life easier for other by cutting down on search time.


A while back I was asked to prepare an area on the PPIG website
where people could upload data for public consumption (surrounded by
appropriate caveats of course). The data I was preparing for didn't
ever turn up so the area remains hidden, but I can certainly expose
this in some way if people wish to use it.


--
Derek M. Jones tel: +44 (0) 1252 520 667
Knowledge Software Ltd blog:shape-of-code.coding-guidelines.com
Source code analysis http://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an
exempt charity in England& Wales and a charity registered in Scotland
(SC 038302).











--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk


Re: Making empirical data code available

2012-02-17 Thread Dan Stowell

On 17/02/2012 14:45, Lindsay Marshall wrote:

Ooops - half sent message. Let's just say I am saying nothing at the moment.


OK. I'd be interested to know what you say when you do say something :)

Dan

--
Dan Stowell
Postdoctoral Research Assistant
Centre for Digital Music
Queen Mary, University of London
Mile End Road, London E1 4NS
http://www.elec.qmul.ac.uk/digitalmusic/people/dans.htm
http://www.mcld.co.uk/

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England & Wales and a charity registered in Scotland (SC 038302).



RE: Making empirical data code available

2012-02-17 Thread Lindsay Marshall
Ooops - half sent message. Let's just say I am saying nothing at the moment.

L.



-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).



RE: Making empirical data code available

2012-02-17 Thread Lindsay Marshall
> I don't know of
>a public service that attaches DOIs to arbitrary datasets (shame), but I
>use archive.org for publishing datasets (e.g.
>) - it is a US
>library-oriented service whose explicit mission is to preserve digital
>data for a very long time.

Well


-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data+code available

2012-02-17 Thread Derek M Jones

Richard,


While playing with the data, I was struck by two prominent
lines I kept seeing:
table(loc_written)
0 1 2 3 4 5 6 7
2 3 3 8 1 7 3 3
   ^   ^

I don't suppose it has any significance at all for your results,
but I wonder why the loc_written data were so clumpy.


Returning to your original question.

I suspect that people are ticking boxes that sound about right rather
than calculating the numbers.  The context of the experiment does
not allow much time for reflection and calculation.  These LOC and
experience questions are part of the introduction that occurs when I
am introducing the experiment and waiting for everybody to turn up.

Having people discuss the issue before hand and then giving them time
to try and calculate reliable answers might produce more consistent
answers.

On a related note.  I wonder if the ratio of lines of code written
divided by lines of code in the final program is a reliable measure
of experience (at least during the first few years).  Beginners
do seem to write and throw away much more code than more experienced
people.

--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England & Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data code available

2012-02-17 Thread Dan Stowell

Hi all,

IMO the most important things are to have:

 - permanently accessible URLs or other references (e.g. DOIs). For 
this it helps to choose a website or archive that you believe will last 
a long time. Github is great for software sharing but I don't see any 
reason to expect their URL scheme to be fixed in stone. I don't know of 
a public service that attaches DOIs to arbitrary datasets (shame), but I 
use archive.org for publishing datasets (e.g. 
) - it is a US 
library-oriented service whose explicit mission is to preserve digital 
data for a very long time.


 - clear licensing that allows sharing (ideally open data such as CC, 
and open source code such as GPL or BSD).


The advantage of open licensing is that if github or archive.org goes 
bust, long after I have moved on to other interests, other people can 
re-host my data and code. I don't see any particularly compelling reason 
to gather things into one archive, though it does seem to help in a 
community-building kind of sense.


Dan


On 17/02/2012 13:53, Derek M Jones wrote:

Neil,


There are some efforts underway to do this. I'm familiar with
http://datacite.org/ and http://figshare.com. A couple of SE groups
have started data and model problem repositories, such as
http://promisedata.org.


Thanks for the links. figshare looks interesting.


The challenge is getting everyone on board. For now, I don't see a
compelling reason to use these places.


People could just as easily use git-hub, https://github.com/
which is used by a lot of researchers to make their code freely
available (git-hub make their money from people paying for hosting
of privately avaialble code).

Your paper "Automated topic naming to support cross-project analysis
of software maintenance activities" is in my pile of interesting ones
to read in more detail. You can read about my own interest in naming
in www.knosof.co.uk/cbook/sent792.pdf


I suspect it won't happen until journals and conferences begin to
insist on it. There is a reason why retraction rates are so low in CS
and SE: no way to reproduce results to confirm.

Cameron Neylon is a good point man on the issues around Science 2.0
and open access (http://cameronneylon.net/)


Neil Ernst
http://neilernst.net

On 2012-02-16, at 7:15, Derek M Jones wrote:


Lindsay,

A couple of researchers I have contacted to obtain data
told me that they have either lost it or did not make an
effort to keep it.

Having someplace that people could automatically upload their
data to might help preserve more of it, as well as making
life easier for other by cutting down on search time.


A while back I was asked to prepare an area on the PPIG website
where people could upload data for public consumption (surrounded by
appropriate caveats of course). The data I was preparing for didn't
ever turn up so the area remains hidden, but I can certainly expose
this in some way if people wish to use it.


--
Derek M. Jones tel: +44 (0) 1252 520 667
Knowledge Software Ltd blog:shape-of-code.coding-guidelines.com
Source code analysis http://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an
exempt charity in England& Wales and a charity registered in Scotland
(SC 038302).









--
Dan Stowell
Postdoctoral Research Assistant
Centre for Digital Music
Queen Mary, University of London
Mile End Road, London E1 4NS
http://www.elec.qmul.ac.uk/digitalmusic/people/dans.htm
http://www.mcld.co.uk/


Re: Making empirical data code available

2012-02-17 Thread Derek M Jones

Neil,


There are some efforts underway to do this. I'm familiar with 
http://datacite.org/ and http://figshare.com. A couple of SE groups have 
started data and model problem repositories, such as http://promisedata.org.


Thanks for the links.  figshare looks interesting.


The challenge is getting everyone on board. For now, I don't see a compelling 
reason to use these places.


People could just as easily use git-hub, https://github.com/
which is used by a lot of researchers to make their code freely
available (git-hub make their money from people paying for hosting
of privately avaialble code).

Your paper "Automated topic naming to support cross-project analysis
of software maintenance activities" is in my pile of interesting ones
to read in more detail.  You can read about my own interest in naming
in www.knosof.co.uk/cbook/sent792.pdf


I suspect it won't happen until journals and conferences begin to insist on it. 
There is a reason why retraction rates are so low in CS and SE: no way to 
reproduce results to confirm.

Cameron Neylon is a good point man on the issues around Science 2.0 and open 
access (http://cameronneylon.net/)


Neil Ernst
http://neilernst.net

On 2012-02-16, at 7:15, Derek M Jones wrote:


Lindsay,

A couple of researchers I have contacted to obtain data
told me that they have either lost it or did not make an
effort to keep it.

Having someplace that people could automatically upload their
data to might help preserve more of it, as well as making
life easier for other by cutting down on search time.


A while back I was asked to prepare an area on the PPIG website where people 
could upload data for public consumption (surrounded by appropriate caveats of 
course). The data I was preparing for didn't ever turn up so the area remains 
hidden, but I can certainly expose this in some way if people wish to use it.


--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England&  Wales and a charity registered in Scotland (SC 038302).






--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk


Re: Making empirical data+code available

2012-02-16 Thread Derek M Jones

Richard,


There's the corresp() function in library(MASS)
and Fionn Murtagh's code to go with his correspondence analysis
book is available over the web.


This is very common practice with R books.


While playing with the data, I was struck by two prominent
lines I kept seeing:
table(loc_written)
0 1 2 3 4 5 6 7
2 3 3 8 1 7 3 3
   ^   ^

I don't suppose it has any significance at all for your results,
but I wonder why the loc_written data were so clumpy.


That 8 caught my eye, it should be 7 (a typo).
I checked the other numbers and they are correct.

What this is saying is that developers don't have a clue how many lines
of code they have read/written (see extract of question below).
In places they are not even consistent and there is a poor correlation
with experience (0s indicate no answer given, which should really be
NA).

---
How many lines of code would you estimate you have \fBwritten\fR in
different languages over your career:
.RS
.IP i)
50,000
.IP ii)
75,000
.IP iii)
100,000
.IP iv)
150,000
.IP v)
200,000
.IP vi)
275,000
.IP vii)
350,000+
.RE
.IP b)
How many lines of code would you estimate you have \fBread\fR in
different languages over your career:
.RS
.IP i)
75,000
.IP ii)
100,000
.IP iii)
150,000
.IP iv)
200,000
.IP v)
300,000
.IP vi)
500,000
.IP vii)
800,000+





--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England & Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data+code available

2012-02-16 Thread Richard O'Keefe

On 17/02/2012, at 2:53 AM, Derek M Jones wrote:
> You can find mine here (only the 2011 experiment has all the code
> needed to perform the analysis; I'm working on fixing that):
> http://www.knosof.co.uk/dev-experiment.html

This is a wonderful thing you have done.
I note that these days, when I see a lot of subjects (well, 30)
with a bunch of discrete attributes, correspondence analysis is
one of the things I reach for to get some insight.
There's the corresp() function in library(MASS)
and Fionn Murtagh's code to go with his correspondence analysis
book is available over the web.

While playing with the data, I was struck by two prominent
lines I kept seeing:
table(loc_written)
0 1 2 3 4 5 6 7 
2 3 3 8 1 7 3 3 
  ^   ^

I don't suppose it has any significance at all for your results,
but I wonder why the loc_written data were so clumpy.


-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data code available

2012-02-16 Thread Derek M Jones

Lindsay,

A couple of researchers I have contacted to obtain data
told me that they have either lost it or did not make an
effort to keep it.

Having someplace that people could automatically upload their
data to might help preserve more of it, as well as making
life easier for other by cutting down on search time.


A while back I was asked to prepare an area on the PPIG website where people 
could upload data for public consumption (surrounded by appropriate caveats of 
course). The data I was preparing for didn't ever turn up so the area remains 
hidden, but I can certainly expose this in some way if people wish to use it.


--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England & Wales and a charity registered in Scotland (SC 038302).



RE: Making empirical data code available

2012-02-16 Thread Lindsay Marshall
A while back I was asked to prepare an area on the PPIG website where people 
could upload data for public consumption (surrounded by appropriate caveats of 
course). The data I was preparing for didn't ever turn up so the area remains 
hidden, but I can certainly expose this in some way if people wish to use it.

L.



-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302).



Making empirical data+code available

2012-02-16 Thread Derek M Jones

All,

Continuing on the theme of empirical research.

There is a growing trend for researchers to make their
experimental data available.

Promise is probably one of the more well known sites:
http://promisedata.org/

What is also needed is the code used to analyze it.
I have been having a hard time trying to get the numbers
reported in some papers from the data that has been made
available.

You can find mine here (only the 2011 experiment has all the code
needed to perform the analysis; I'm working on fixing that):
http://www.knosof.co.uk/dev-experiment.html

I hope list members will reply with where their own data can be
downloaded.

--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England & Wales and a charity registered in Scotland (SC 038302).