Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit : > On 5/5/15 7:33 AM, Boris Baldassari wrote: > > Hi Folks, > > > > Sorry for the late answer on this thread. Don't know what has been done > > since then, but I've some experience to share on this, so here are my 2c.. > > No, more input is always appreciated! Hervé is doing some > centralization of the projects-new.a.o data capture, which is related > but slightly separate. +1 this can give a common place to put code once experiments show that we should add a new data source
> But this is going to be a long-term project +1 > with > plenty of different people helping I bet. I hope so... > > ... > > > * Parsing mboxes for software repository data mining: > > There is a suite of tools exactly targeted at this kind of duty on > > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I > > don't know how they manage time zones, but the toolsuite is widely used > > around (see [3] or [4] as examples) so I believe they are quite robust. > > It includes tools for data retrieval as well as visualisation. > > Drat. Metrics Grimoire looks pretty nifty - essentially a set of > frameworks for extracting metadata from a bunch of sources - but it's > GPL, so personally I have no interest in working on it. If someone else > uses it to generate datasets that's great. > > > * As for the feedback/thoughts about the architecture and formats: > > I love the REST-API idea proposed by Rob. That's really easy to access > > and retrieve through scripts on-demand. CSV and JSON are my favourite > > formats, because they are, again, easy to parse and widely used -- every > > language and library has some facility to read them natively. > > Yup - again, like project visualization, to make any of this simple for > newcomers to try stuff, we need to separate data gathering / model / > visualization. Since most of these are spare time projects, having easy > chunks makes it simpler for different people to try their hand at it. For visualization, for sure, json is the current natural format when data is consumed from the browser. I don't have great experience on this, and what I'm missing with json currently is a common practice on documenting a structure: are there common practices? Because for simple json structure, documentation is not really necessary, but once the structure goes complex, documentation is really a key requirement for people to use or extend. And I already see this shortcoming with the 11 json files from projects-new.a.o = https://projects-new.apache.org/json/foundation/ Regards, Hervé > > Thanks, > > - Shane