Re: Making empirical data code available
There was an attempt in doing something like this for the CSEd community a few years back: http://www8.cs.umu.se/~dcer/index.html There are also a few papers about this, for example http://doi.acm.org/10.1145/1404520.1404534 and if I'm correct there was also a few actual research papers produced to test out things (no refs though). - jem -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
Re: Making empirical data code available
All, I prefer to think that somebody who knows more about statistics than me will find something significant that I missed: Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0026828 -- Derek M. Jones tel: +44 (0) 1252 520 667 Knowledge Software Ltd blog:shape-of-code.coding-guidelines.com Source code analysishttp://www.knosof.co.uk -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
Re: Making empirical data code available
Hi all, IMO the most important things are to have: - permanently accessible URLs or other references (e.g. DOIs). For this it helps to choose a website or archive that you believe will last a long time. Github is great for software sharing but I don't see any reason to expect their URL scheme to be fixed in stone. I don't know of a public service that attaches DOIs to arbitrary datasets (shame), but I use archive.org for publishing datasets (e.g. http://www.archive.org/details/beatboxset1) - it is a US library-oriented service whose explicit mission is to preserve digital data for a very long time. - clear licensing that allows sharing (ideally open data such as CC, and open source code such as GPL or BSD). The advantage of open licensing is that if github or archive.org goes bust, long after I have moved on to other interests, other people can re-host my data and code. I don't see any particularly compelling reason to gather things into one archive, though it does seem to help in a community-building kind of sense. Dan On 17/02/2012 13:53, Derek M Jones wrote: Neil, There are some efforts underway to do this. I'm familiar with http://datacite.org/ and http://figshare.com. A couple of SE groups have started data and model problem repositories, such as http://promisedata.org. Thanks for the links. figshare looks interesting. The challenge is getting everyone on board. For now, I don't see a compelling reason to use these places. People could just as easily use git-hub, https://github.com/ which is used by a lot of researchers to make their code freely available (git-hub make their money from people paying for hosting of privately avaialble code). Your paper Automated topic naming to support cross-project analysis of software maintenance activities is in my pile of interesting ones to read in more detail. You can read about my own interest in naming in www.knosof.co.uk/cbook/sent792.pdf I suspect it won't happen until journals and conferences begin to insist on it. There is a reason why retraction rates are so low in CS and SE: no way to reproduce results to confirm. Cameron Neylon is a good point man on the issues around Science 2.0 and open access (http://cameronneylon.net/) Neil Ernst http://neilernst.net On 2012-02-16, at 7:15, Derek M Jones wrote: Lindsay, A couple of researchers I have contacted to obtain data told me that they have either lost it or did not make an effort to keep it. Having someplace that people could automatically upload their data to might help preserve more of it, as well as making life easier for other by cutting down on search time. A while back I was asked to prepare an area on the PPIG website where people could upload data for public consumption (surrounded by appropriate caveats of course). The data I was preparing for didn't ever turn up so the area remains hidden, but I can certainly expose this in some way if people wish to use it. -- Derek M. Jones tel: +44 (0) 1252 520 667 Knowledge Software Ltd blog:shape-of-code.coding-guidelines.com Source code analysis http://www.knosof.co.uk -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302). -- Dan Stowell Postdoctoral Research Assistant Centre for Digital Music Queen Mary, University of London Mile End Road, London E1 4NS http://www.elec.qmul.ac.uk/digitalmusic/people/dans.htm http://www.mcld.co.uk/
RE: Making empirical data code available
I don't know of a public service that attaches DOIs to arbitrary datasets (shame), but I use archive.org for publishing datasets (e.g. http://www.archive.org/details/beatboxset1) - it is a US library-oriented service whose explicit mission is to preserve digital data for a very long time. Well -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
RE: Making empirical data code available
Ooops - half sent message. Let's just say I am saying nothing at the moment. L. -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
RE: Making empirical data code available
- permanently accessible URLs or other references (e.g. DOIs). For this This is certainly the ideal. Let's not fall into the trap of not doing anything until the ideal system is in place. I was just talking to my friendly local DOI guru and this is definitely possible now using datacite and UK repositories, though some negotiation with them may be necessary. L. -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
Re: Making empirical data code available
Lindsay, A couple of researchers I have contacted to obtain data told me that they have either lost it or did not make an effort to keep it. Having someplace that people could automatically upload their data to might help preserve more of it, as well as making life easier for other by cutting down on search time. A while back I was asked to prepare an area on the PPIG website where people could upload data for public consumption (surrounded by appropriate caveats of course). The data I was preparing for didn't ever turn up so the area remains hidden, but I can certainly expose this in some way if people wish to use it. -- Derek M. Jones tel: +44 (0) 1252 520 667 Knowledge Software Ltd blog:shape-of-code.coding-guidelines.com Source code analysishttp://www.knosof.co.uk -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
Re: Making empirical data+code available
On 17/02/2012, at 2:53 AM, Derek M Jones wrote: You can find mine here (only the 2011 experiment has all the code needed to perform the analysis; I'm working on fixing that): http://www.knosof.co.uk/dev-experiment.html This is a wonderful thing you have done. I note that these days, when I see a lot of subjects (well, 30) with a bunch of discrete attributes, correspondence analysis is one of the things I reach for to get some insight. There's the corresp() function in library(MASS) and Fionn Murtagh's code to go with his correspondence analysis book is available over the web. While playing with the data, I was struck by two prominent lines I kept seeing: table(loc_written) 0 1 2 3 4 5 6 7 2 3 3 8 1 7 3 3 ^ ^ I don't suppose it has any significance at all for your results, but I wonder why the loc_written data were so clumpy. -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).
Re: Making empirical data+code available
Richard, There's the corresp() function in library(MASS) and Fionn Murtagh's code to go with his correspondence analysis book is available over the web. This is very common practice with R books. While playing with the data, I was struck by two prominent lines I kept seeing: table(loc_written) 0 1 2 3 4 5 6 7 2 3 3 8 1 7 3 3 ^ ^ I don't suppose it has any significance at all for your results, but I wonder why the loc_written data were so clumpy. That 8 caught my eye, it should be 7 (a typo). I checked the other numbers and they are correct. What this is saying is that developers don't have a clue how many lines of code they have read/written (see extract of question below). In places they are not even consistent and there is a poor correlation with experience (0s indicate no answer given, which should really be NA). --- How many lines of code would you estimate you have \fBwritten\fR in different languages over your career: .RS .IP i) 50,000 .IP ii) 75,000 .IP iii) 100,000 .IP iv) 150,000 .IP v) 200,000 .IP vi) 275,000 .IP vii) 350,000+ .RE .IP b) How many lines of code would you estimate you have \fBread\fR in different languages over your career: .RS .IP i) 75,000 .IP ii) 100,000 .IP iii) 150,000 .IP iv) 200,000 .IP v) 300,000 .IP vi) 500,000 .IP vii) 800,000+ -- Derek M. Jones tel: +44 (0) 1252 520 667 Knowledge Software Ltd blog:shape-of-code.coding-guidelines.com Source code analysishttp://www.knosof.co.uk -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England Wales and a charity registered in Scotland (SC 038302).