Re: Making empirical data code available

2012-02-21 Thread Jan Erik Moström
There was an attempt in doing something like this for the CSEd
community a few years back: http://www8.cs.umu.se/~dcer/index.html

There are also a few papers about this, for example
http://doi.acm.org/10.1145/1404520.1404534
and if I'm correct there was also a few actual research papers
produced to test out things (no refs though).

- jem

-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England  Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data code available

2012-02-18 Thread Derek M Jones

All,

I prefer to think that somebody who knows more about statistics than
me will find something significant that I missed:

Willingness to Share Research Data Is Related to the Strength of the 
Evidence and the Quality of Reporting of Statistical Results

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026828

--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England  Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data code available

2012-02-17 Thread Dan Stowell

Hi all,

IMO the most important things are to have:

 - permanently accessible URLs or other references (e.g. DOIs). For 
this it helps to choose a website or archive that you believe will last 
a long time. Github is great for software sharing but I don't see any 
reason to expect their URL scheme to be fixed in stone. I don't know of 
a public service that attaches DOIs to arbitrary datasets (shame), but I 
use archive.org for publishing datasets (e.g. 
http://www.archive.org/details/beatboxset1) - it is a US 
library-oriented service whose explicit mission is to preserve digital 
data for a very long time.


 - clear licensing that allows sharing (ideally open data such as CC, 
and open source code such as GPL or BSD).


The advantage of open licensing is that if github or archive.org goes 
bust, long after I have moved on to other interests, other people can 
re-host my data and code. I don't see any particularly compelling reason 
to gather things into one archive, though it does seem to help in a 
community-building kind of sense.


Dan


On 17/02/2012 13:53, Derek M Jones wrote:

Neil,


There are some efforts underway to do this. I'm familiar with
http://datacite.org/ and http://figshare.com. A couple of SE groups
have started data and model problem repositories, such as
http://promisedata.org.


Thanks for the links. figshare looks interesting.


The challenge is getting everyone on board. For now, I don't see a
compelling reason to use these places.


People could just as easily use git-hub, https://github.com/
which is used by a lot of researchers to make their code freely
available (git-hub make their money from people paying for hosting
of privately avaialble code).

Your paper Automated topic naming to support cross-project analysis
of software maintenance activities is in my pile of interesting ones
to read in more detail. You can read about my own interest in naming
in www.knosof.co.uk/cbook/sent792.pdf


I suspect it won't happen until journals and conferences begin to
insist on it. There is a reason why retraction rates are so low in CS
and SE: no way to reproduce results to confirm.

Cameron Neylon is a good point man on the issues around Science 2.0
and open access (http://cameronneylon.net/)


Neil Ernst
http://neilernst.net

On 2012-02-16, at 7:15, Derek M Jones wrote:


Lindsay,

A couple of researchers I have contacted to obtain data
told me that they have either lost it or did not make an
effort to keep it.

Having someplace that people could automatically upload their
data to might help preserve more of it, as well as making
life easier for other by cutting down on search time.


A while back I was asked to prepare an area on the PPIG website
where people could upload data for public consumption (surrounded by
appropriate caveats of course). The data I was preparing for didn't
ever turn up so the area remains hidden, but I can certainly expose
this in some way if people wish to use it.


--
Derek M. Jones tel: +44 (0) 1252 520 667
Knowledge Software Ltd blog:shape-of-code.coding-guidelines.com
Source code analysis http://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an
exempt charity in England Wales and a charity registered in Scotland
(SC 038302).









--
Dan Stowell
Postdoctoral Research Assistant
Centre for Digital Music
Queen Mary, University of London
Mile End Road, London E1 4NS
http://www.elec.qmul.ac.uk/digitalmusic/people/dans.htm
http://www.mcld.co.uk/


RE: Making empirical data code available

2012-02-17 Thread Lindsay Marshall
 I don't know of
a public service that attaches DOIs to arbitrary datasets (shame), but I
use archive.org for publishing datasets (e.g.
http://www.archive.org/details/beatboxset1) - it is a US
library-oriented service whose explicit mission is to preserve digital
data for a very long time.

Well


-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England  Wales and a charity registered in Scotland (SC 038302).



RE: Making empirical data code available

2012-02-17 Thread Lindsay Marshall
Ooops - half sent message. Let's just say I am saying nothing at the moment.

L.



-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England  Wales and a charity registered in Scotland (SC 038302).



RE: Making empirical data code available

2012-02-17 Thread Lindsay Marshall
 - permanently accessible URLs or other references (e.g. DOIs). For this

This is certainly the ideal.  Let's not fall into the trap of not
doing anything until the ideal system is in place.

I was just talking to my friendly local DOI guru and this is definitely 
possible now using datacite and UK repositories, though some negotiation with 
them may be necessary.

L.


-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England  Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data code available

2012-02-16 Thread Derek M Jones

Lindsay,

A couple of researchers I have contacted to obtain data
told me that they have either lost it or did not make an
effort to keep it.

Having someplace that people could automatically upload their
data to might help preserve more of it, as well as making
life easier for other by cutting down on search time.


A while back I was asked to prepare an area on the PPIG website where people 
could upload data for public consumption (surrounded by appropriate caveats of 
course). The data I was preparing for didn't ever turn up so the area remains 
hidden, but I can certainly expose this in some way if people wish to use it.


--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England  Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data+code available

2012-02-16 Thread Richard O'Keefe

On 17/02/2012, at 2:53 AM, Derek M Jones wrote:
 You can find mine here (only the 2011 experiment has all the code
 needed to perform the analysis; I'm working on fixing that):
 http://www.knosof.co.uk/dev-experiment.html

This is a wonderful thing you have done.
I note that these days, when I see a lot of subjects (well, 30)
with a bunch of discrete attributes, correspondence analysis is
one of the things I reach for to get some insight.
There's the corresp() function in library(MASS)
and Fionn Murtagh's code to go with his correspondence analysis
book is available over the web.

While playing with the data, I was struck by two prominent
lines I kept seeing:
table(loc_written)
0 1 2 3 4 5 6 7 
2 3 3 8 1 7 3 3 
  ^   ^

I don't suppose it has any significance at all for your results,
but I wonder why the loc_written data were so clumpy.


-- 
The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England  Wales and a charity registered in Scotland (SC 038302).



Re: Making empirical data+code available

2012-02-16 Thread Derek M Jones

Richard,


There's the corresp() function in library(MASS)
and Fionn Murtagh's code to go with his correspondence analysis
book is available over the web.


This is very common practice with R books.


While playing with the data, I was struck by two prominent
lines I kept seeing:
table(loc_written)
0 1 2 3 4 5 6 7
2 3 3 8 1 7 3 3
   ^   ^

I don't suppose it has any significance at all for your results,
but I wonder why the loc_written data were so clumpy.


That 8 caught my eye, it should be 7 (a typo).
I checked the other numbers and they are correct.

What this is saying is that developers don't have a clue how many lines
of code they have read/written (see extract of question below).
In places they are not even consistent and there is a poor correlation
with experience (0s indicate no answer given, which should really be
NA).

---
How many lines of code would you estimate you have \fBwritten\fR in
different languages over your career:
.RS
.IP i)
50,000
.IP ii)
75,000
.IP iii)
100,000
.IP iv)
150,000
.IP v)
200,000
.IP vi)
275,000
.IP vii)
350,000+
.RE
.IP b)
How many lines of code would you estimate you have \fBread\fR in
different languages over your career:
.RS
.IP i)
75,000
.IP ii)
100,000
.IP iii)
150,000
.IP iv)
200,000
.IP v)
300,000
.IP vi)
500,000
.IP vii)
800,000+





--
Derek M. Jones  tel: +44 (0) 1252 520 667
Knowledge Software Ltd  blog:shape-of-code.coding-guidelines.com
Source code analysishttp://www.knosof.co.uk

--
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity 
in England  Wales and a charity registered in Scotland (SC 038302).