Re: Standards for mail archive statistics gathering?
I'd like to submit to this group that no batch job is necessary to compute many useful statistics - rather with a suitable representation an indexed event stream of mailing list messages, commits, releases, etc... can be searched aggregated and visualized in real-time. Hacked a bit this afternoon - parsed much of community-dev mbox history using mime4j into activity streams json, indexed in elasticsearch with kibana as UI. Visit link below for an idea what an indexed activity streams representation of a mailing list could look like. http://72.182.111.65:5601/#/discover?_a=(columns:!(actor.displayName,published,content,summary),index:community-dev_activity,interval:auto,query:(query_string:(analyze_wildcard:!t,query:'*')),sort:!(published,asc))&_g=(time:(from:'2009-10-07T22:26:18.843Z',mode:absolute,to:'2015-05-07T22:26:18.843Z')) Of course much more discussion and rigor would be required before something like this could become official: determining appropriate structure/identifier/format/enumeration of each field, adding robust error handling, testing that no messages are lost in translation, resolving email addresses back to apache LDAP ids, etc... but I wanted to show the potential of this approach and what can be developed with minimal net new code. All code used to build this has been pushed to http://github.com/steveblackmon/streams-apache Regards, Steve Blackmon sblack...@apache.org On Wed, May 6, 2015 at 10:44 PM, Hervé BOUTEMY wrote: > Le mercredi 6 mai 2015 12:48:34 Steve Blackmon a écrit : >> > For visualization, for sure, json is the current natural format when data >> > is consumed from the browser. >> > I don't have great experience on this, and what I'm missing with json >> > currently is a common practice on documenting a structure: are there >> > common >> > practices? >> >> In podling streams [0], we make extensive use of json schema [1] > thank you: that's exactly the initial info I was looking for: json schema! > >> from >> which we generate POJOs with a maven >> plugin jsonschema2pojo [2] which makes manipulating the objects in >> Java/Scala pleasant. I expect other languages have >> similar jsonschema-based ORM paradigms as well. > As usual Java devloper, your tooling is interesting > But in the projects-new.a.o case, it is data extraction is coded in Python: if > we create json schema, having Python classes generated could simplify coding. > Anyone with Python+json schema experience around? > > >> This pattern supports >> inheritance both within >> and across projects - for example see how [3] extends [4] which >> extends [5]. These schemas are relatively self documenting, >> but generating documentation or other artifacts is straight-forward as >> they are themselves json documents. > yeah, json schema document is easy to read (at least the examples on the > site...) > >> >> > Because for simple json structure, documentation is not really necessary, >> > but once the structure goes complex, documentation is really a key >> > requirement for people to use or extend. And I already see this >> > shortcoming with the 11 json files from projects-new.a.o = >> > https://projects-new.apache.org/json/foundation/ >> Having used these json documents a few weeks ago to build an apache >> community visualization [6] > yeah, really nice visualization! > >> IMO the current crop of project-new jsons >> are intermediate artifacts rather than a sufficiently cross-purpose >> data model, a role currently held by DOAP mbox and misc others all >> with some inherent shortcomings most notably lack of navigability >> between silos. > +1 > I'm at a point where I start to really understand the concepts involved and > want to code a simple data model: I'll report here once I have a first version > available. > >> I'd like to nominate activity streams [7] with >> community-specific extensions (such as those roughly prototyped here: >> [8] ) as a potential core data model for this effort going forward > I had a first look at it: it is more complex than what I had in mind > We'll have to share and see what's the best bet > >> and >> I'm happy to help apply some of the useful tools and connectors within >> podling streams toward that end. Converting external structured >> sources into normalized documents and indexing those activities to >> power data-centric APIs and visualizations are wheelhouse use cases >> for this project, as they say. > Great, stay tuned: I'll probably work on it this week-end > > Regards, > > Hervé > >> >> [0] http://streams.incubator.apache.org/ >> [1] http://json-schema.org/documentation.html >> [2] http://www.jsonschema2pojo.org/ >> [3] >> https://github.com/steveblackmon/streams-apache/blob/master/activities/src/ >> main/jsonschema/objectTypes/committee.json [4] >> https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma >> in/jsonschema/objectTypes/group.json [5] >> https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma >> in/jsonschema
Re: Standards for mail archive statistics gathering?
Betty James wrote: How do I get off this thread. don't know how I got on Just follow the instructions others have already provided. I'm putting you in CC in case you have already unsubscribed from this discussion list. but I am just a totally ignorant individual using Open Office and trying to donate (which doesn't sound necessary anymore) Actually, donations to OpenOffice and to the entire non-profit Apache Software Foundation are welcome! See http://www.openoffice.org/donations.html to discover the existing possibilities we have for donations. Thank you for your intentions, and sorry if some misleading instructions led you to believe you needed to subscribe to this discussion list! Regards, Andrea.
Re: Standards for mail archive statistics gathering?
Le mercredi 6 mai 2015 12:48:34 Steve Blackmon a écrit : > > For visualization, for sure, json is the current natural format when data > > is consumed from the browser. > > I don't have great experience on this, and what I'm missing with json > > currently is a common practice on documenting a structure: are there > > common > > practices? > > In podling streams [0], we make extensive use of json schema [1] thank you: that's exactly the initial info I was looking for: json schema! > from > which we generate POJOs with a maven > plugin jsonschema2pojo [2] which makes manipulating the objects in > Java/Scala pleasant. I expect other languages have > similar jsonschema-based ORM paradigms as well. As usual Java devloper, your tooling is interesting But in the projects-new.a.o case, it is data extraction is coded in Python: if we create json schema, having Python classes generated could simplify coding. Anyone with Python+json schema experience around? > This pattern supports > inheritance both within > and across projects - for example see how [3] extends [4] which > extends [5]. These schemas are relatively self documenting, > but generating documentation or other artifacts is straight-forward as > they are themselves json documents. yeah, json schema document is easy to read (at least the examples on the site...) > > > Because for simple json structure, documentation is not really necessary, > > but once the structure goes complex, documentation is really a key > > requirement for people to use or extend. And I already see this > > shortcoming with the 11 json files from projects-new.a.o = > > https://projects-new.apache.org/json/foundation/ > Having used these json documents a few weeks ago to build an apache > community visualization [6] yeah, really nice visualization! > IMO the current crop of project-new jsons > are intermediate artifacts rather than a sufficiently cross-purpose > data model, a role currently held by DOAP mbox and misc others all > with some inherent shortcomings most notably lack of navigability > between silos. +1 I'm at a point where I start to really understand the concepts involved and want to code a simple data model: I'll report here once I have a first version available. > I'd like to nominate activity streams [7] with > community-specific extensions (such as those roughly prototyped here: > [8] ) as a potential core data model for this effort going forward I had a first look at it: it is more complex than what I had in mind We'll have to share and see what's the best bet > and > I'm happy to help apply some of the useful tools and connectors within > podling streams toward that end. Converting external structured > sources into normalized documents and indexing those activities to > power data-centric APIs and visualizations are wheelhouse use cases > for this project, as they say. Great, stay tuned: I'll probably work on it this week-end Regards, Hervé > > [0] http://streams.incubator.apache.org/ > [1] http://json-schema.org/documentation.html > [2] http://www.jsonschema2pojo.org/ > [3] > https://github.com/steveblackmon/streams-apache/blob/master/activities/src/ > main/jsonschema/objectTypes/committee.json [4] > https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma > in/jsonschema/objectTypes/group.json [5] > https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma > in/jsonschema/object.json [6] http://72.182.111.65:3000/workspace/3 > [7] http://activitystrea.ms/ > [8] > https://github.com/steveblackmon/streams-apache/blob/master/activities/src/ > main/jsonschema > > Steve Blackmon > sblack...@apache.org > > On Wed, May 6, 2015 at 2:05 AM, Hervé BOUTEMY wrote: > > Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit : > >> On 5/5/15 7:33 AM, Boris Baldassari wrote: > >> > Hi Folks, > >> > > >> > Sorry for the late answer on this thread. Don't know what has been done > >> > since then, but I've some experience to share on this, so here are my > >> > 2c.. > >> > >> No, more input is always appreciated! Hervé is doing some > >> centralization of the projects-new.a.o data capture, which is related > >> but slightly separate. > > > > +1 > > this can give a common place to put code once experiments show that we > > should add a new data source > > > >> But this is going to be a long-term project > > > > +1 > > > >> with > >> plenty of different people helping I bet. > > > > I hope so... > > > >> ... > >> > >> > * Parsing mboxes for software repository data mining: > >> > There is a suite of tools exactly targeted at this kind of duty on > >> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I > >> > don't know how they manage time zones, but the toolsuite is widely used > >> > around (see [3] or [4] as examples) so I believe they are quite robust. > >> > It includes tools for data retrieval as well as visualisation. > >> > >> Drat. Metrics Grimoire looks pretty nifty - essentially a set o
Re: Standards for mail archive statistics gathering?
If you want to unsubscribe, please find instructions at http://apache.org/foundation/mailinglists.html And the name of this list is dev@community.apache.org Cheers Niclas On Thu, May 7, 2015 at 7:48 AM, Betty James wrote: > Oh my gosh. How do I get off this thread. don't know how I got on, but I > am just a totally ignorant individual using Open Office and trying to > donate (which doesn't sound necessary anymore)so unless you are in good > shape and in your 70's try to figure out how I can get off the list! > > Betty B. James > > On Tue, May 5, 2015 at 7:33 AM, Boris Baldassari < > castalia.laborat...@gmail.com> wrote: > > > Hi Folks, > > > > Sorry for the late answer on this thread. Don't know what has been done > > since then, but I've some experience to share on this, so here are my > 2c.. > > > > * Parsing dates and time zones: > > If you are to use Perl, the Date::Parse module handles dates and time > > zones pretty well. As for Python I don't know -- there probably is a > module > > for that too.. > > I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the > > data sets have been published here [0]), and it worked great. I do have a > > Perl script to do that, which I can provide -- but I have no access I'm > > aware of in the dev scm, and not sure if Perl is the most common language > > here.. so please let me know. > > > > * Parsing mboxes for software repository data mining: > > There is a suite of tools exactly targeted at this kind of duty on > github: > > Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know > > how they manage time zones, but the toolsuite is widely used around (see > > [3] or [4] as examples) so I believe they are quite robust. It includes > > tools for data retrieval as well as visualisation. > > > > * As for the feedback/thoughts about the architecture and formats: > > I love the REST-API idea proposed by Rob. That's really easy to access > and > > retrieve through scripts on-demand. CSV and JSON are my favourite > formats, > > because they are, again, easy to parse and widely used -- every language > > and library has some facility to read them natively. > > > > > > Cheers, > > > > > > [0] http://castalia.solutions/datasets/ > > [1] https://metricsgrimoire.github.io/ > > [2] http://bitergia.com > > [3] Eclipse Dashboard: http://dashboard.eclipse.org/ > > [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/ > > > > > > > > -- > > Boris Baldassari > > Castalia Solutions -- Elegant Software Engineering > > Web: http://castalia.solutions > > Phone: +33 6 48 03 82 89 > > > > > > Le 28/04/2015 16:11, Rich Bowen a écrit : > > > >> > >> > >> On 04/27/2015 09:36 AM, Shane Curcuru wrote: > >> > >>> I'm interested in working on some visualizations of mailing list > >>> activity over time, in particular some simple analyses, like thread > >>> length/participants and the like. Given that the raw data can all be > >>> precomputed from mbox archives, is there any semi-standard way to > >>> distill and save metadata about mboxes? > >>> > >>> If we had a generic static database of past mail metadata and > statistics > >>> (i.e. not details of contents, but perhaps overall # of lines of text > or > >>> something), it would be interesting to see what kinds of visualizations > >>> that different people would come up with. > >>> > >>> Anyone have pointers to either a data format or the best parsing > library > >>> for this? I'm trying to think ahead, and work on the parsing, storing > >>> statistics, and visualizations as separate pieces so it's easier for > >>> different people to collaborate on something. > >>> > >> > >> Roberto posted something to the list a month or so ago about the efforts > >> that he's been working on for this kind of thing. You might ping him. > >> > >> --Rich > >> > >> > >> > > > -- Niclas Hedhman, Software Developer http://zest.apache.org - New Energy for Java
Re: Standards for mail archive statistics gathering?
Oh my gosh. How do I get off this thread. don't know how I got on, but I am just a totally ignorant individual using Open Office and trying to donate (which doesn't sound necessary anymore)so unless you are in good shape and in your 70's try to figure out how I can get off the list! Betty B. James On Tue, May 5, 2015 at 7:33 AM, Boris Baldassari < castalia.laborat...@gmail.com> wrote: > Hi Folks, > > Sorry for the late answer on this thread. Don't know what has been done > since then, but I've some experience to share on this, so here are my 2c.. > > * Parsing dates and time zones: > If you are to use Perl, the Date::Parse module handles dates and time > zones pretty well. As for Python I don't know -- there probably is a module > for that too.. > I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the > data sets have been published here [0]), and it worked great. I do have a > Perl script to do that, which I can provide -- but I have no access I'm > aware of in the dev scm, and not sure if Perl is the most common language > here.. so please let me know. > > * Parsing mboxes for software repository data mining: > There is a suite of tools exactly targeted at this kind of duty on github: > Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know > how they manage time zones, but the toolsuite is widely used around (see > [3] or [4] as examples) so I believe they are quite robust. It includes > tools for data retrieval as well as visualisation. > > * As for the feedback/thoughts about the architecture and formats: > I love the REST-API idea proposed by Rob. That's really easy to access and > retrieve through scripts on-demand. CSV and JSON are my favourite formats, > because they are, again, easy to parse and widely used -- every language > and library has some facility to read them natively. > > > Cheers, > > > [0] http://castalia.solutions/datasets/ > [1] https://metricsgrimoire.github.io/ > [2] http://bitergia.com > [3] Eclipse Dashboard: http://dashboard.eclipse.org/ > [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/ > > > > -- > Boris Baldassari > Castalia Solutions -- Elegant Software Engineering > Web: http://castalia.solutions > Phone: +33 6 48 03 82 89 > > > Le 28/04/2015 16:11, Rich Bowen a écrit : > >> >> >> On 04/27/2015 09:36 AM, Shane Curcuru wrote: >> >>> I'm interested in working on some visualizations of mailing list >>> activity over time, in particular some simple analyses, like thread >>> length/participants and the like. Given that the raw data can all be >>> precomputed from mbox archives, is there any semi-standard way to >>> distill and save metadata about mboxes? >>> >>> If we had a generic static database of past mail metadata and statistics >>> (i.e. not details of contents, but perhaps overall # of lines of text or >>> something), it would be interesting to see what kinds of visualizations >>> that different people would come up with. >>> >>> Anyone have pointers to either a data format or the best parsing library >>> for this? I'm trying to think ahead, and work on the parsing, storing >>> statistics, and visualizations as separate pieces so it's easier for >>> different people to collaborate on something. >>> >> >> Roberto posted something to the list a month or so ago about the efforts >> that he's been working on for this kind of thing. You might ping him. >> >> --Rich >> >> >> >
Re: Standards for mail archive statistics gathering?
> For visualization, for sure, json is the current natural format when data is > consumed from the browser. > I don't have great experience on this, and what I'm missing with json > currently is a common practice on documenting a structure: are there common > practices? In podling streams [0], we make extensive use of json schema [1] from which we generate POJOs with a maven plugin jsonschema2pojo [2] which makes manipulating the objects in Java/Scala pleasant. I expect other languages have similar jsonschema-based ORM paradigms as well. This pattern supports inheritance both within and across projects - for example see how [3] extends [4] which extends [5]. These schemas are relatively self documenting, but generating documentation or other artifacts is straight-forward as they are themselves json documents. > Because for simple json structure, documentation is not really necessary, but > once the structure goes complex, documentation is really a key requirement for > people to use or extend. And I already see this shortcoming with the 11 json > files from projects-new.a.o = https://projects-new.apache.org/json/foundation/ Having used these json documents a few weeks ago to build an apache community visualization [6] IMO the current crop of project-new jsons are intermediate artifacts rather than a sufficiently cross-purpose data model, a role currently held by DOAP mbox and misc others all with some inherent shortcomings most notably lack of navigability between silos. I'd like to nominate activity streams [7] with community-specific extensions (such as those roughly prototyped here: [8] ) as a potential core data model for this effort going forward and I'm happy to help apply some of the useful tools and connectors within podling streams toward that end. Converting external structured sources into normalized documents and indexing those activities to power data-centric APIs and visualizations are wheelhouse use cases for this project, as they say. [0] http://streams.incubator.apache.org/ [1] http://json-schema.org/documentation.html [2] http://www.jsonschema2pojo.org/ [3] https://github.com/steveblackmon/streams-apache/blob/master/activities/src/main/jsonschema/objectTypes/committee.json [4] https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/main/jsonschema/objectTypes/group.json [5] https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/main/jsonschema/object.json [6] http://72.182.111.65:3000/workspace/3 [7] http://activitystrea.ms/ [8] https://github.com/steveblackmon/streams-apache/blob/master/activities/src/main/jsonschema Steve Blackmon sblack...@apache.org On Wed, May 6, 2015 at 2:05 AM, Hervé BOUTEMY wrote: > Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit : >> On 5/5/15 7:33 AM, Boris Baldassari wrote: >> > Hi Folks, >> > >> > Sorry for the late answer on this thread. Don't know what has been done >> > since then, but I've some experience to share on this, so here are my 2c.. >> >> No, more input is always appreciated! Hervé is doing some >> centralization of the projects-new.a.o data capture, which is related >> but slightly separate. > +1 > this can give a common place to put code once experiments show that we should > add a new data source > >> But this is going to be a long-term project > +1 > >> with >> plenty of different people helping I bet. > I hope so... > >> >> ... >> >> > * Parsing mboxes for software repository data mining: >> > There is a suite of tools exactly targeted at this kind of duty on >> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I >> > don't know how they manage time zones, but the toolsuite is widely used >> > around (see [3] or [4] as examples) so I believe they are quite robust. >> > It includes tools for data retrieval as well as visualisation. >> >> Drat. Metrics Grimoire looks pretty nifty - essentially a set of >> frameworks for extracting metadata from a bunch of sources - but it's >> GPL, so personally I have no interest in working on it. If someone else >> uses it to generate datasets that's great. >> >> > * As for the feedback/thoughts about the architecture and formats: >> > I love the REST-API idea proposed by Rob. That's really easy to access >> > and retrieve through scripts on-demand. CSV and JSON are my favourite >> > formats, because they are, again, easy to parse and widely used -- every >> > language and library has some facility to read them natively. >> >> Yup - again, like project visualization, to make any of this simple for >> newcomers to try stuff, we need to separate data gathering / model / >> visualization. Since most of these are spare time projects, having easy >> chunks makes it simpler for different people to try their hand at it. > For visualization, for sure, json is the current natural format when data is > consumed from the browser. > I don't have great experience on this, and what I'm missing with json > currently is a common practice on docume
Re: Standards for mail archive statistics gathering?
Hi all, Le 06/05/2015 03:26, Shane Curcuru a écrit : Drat. Metrics Grimoire looks pretty nifty - essentially a set of frameworks for extracting metadata from a bunch of sources - but it's GPL, so personally I have no interest in working on it. If someone else uses it to generate datasets that's great. Argh. I had forgotten about the licencing incompatibility, my mistake. Well, as you point out I guess the product can still be used without modifications, if needed. I'm following this ml for a few months now, but I'm not sure how far this project is planned to go. Are the mailing lists the only artefact to be analysed, or do you intend to provide also configuration management or issue tacking data? Yup - again, like project visualization, to make any of this simple for newcomers to try stuff, we need to separate data gathering / model / visualization. Since most of these are spare time projects, having easy chunks makes it simpler for different people to try their hand at it. +1 for the architectural separation of concerns, definitely. For analysability and maintainability, easy access and usage, and as a consequence for dissemination. -- Boris Baldassari Castalia Solutions -- Elegant Software Engineering Web: http://castalia.solutions Tel: +33 6 48 03 82 89
Re: Standards for mail archive statistics gathering?
Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit : > On 5/5/15 7:33 AM, Boris Baldassari wrote: > > Hi Folks, > > > > Sorry for the late answer on this thread. Don't know what has been done > > since then, but I've some experience to share on this, so here are my 2c.. > > No, more input is always appreciated! Hervé is doing some > centralization of the projects-new.a.o data capture, which is related > but slightly separate. +1 this can give a common place to put code once experiments show that we should add a new data source > But this is going to be a long-term project +1 > with > plenty of different people helping I bet. I hope so... > > ... > > > * Parsing mboxes for software repository data mining: > > There is a suite of tools exactly targeted at this kind of duty on > > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I > > don't know how they manage time zones, but the toolsuite is widely used > > around (see [3] or [4] as examples) so I believe they are quite robust. > > It includes tools for data retrieval as well as visualisation. > > Drat. Metrics Grimoire looks pretty nifty - essentially a set of > frameworks for extracting metadata from a bunch of sources - but it's > GPL, so personally I have no interest in working on it. If someone else > uses it to generate datasets that's great. > > > * As for the feedback/thoughts about the architecture and formats: > > I love the REST-API idea proposed by Rob. That's really easy to access > > and retrieve through scripts on-demand. CSV and JSON are my favourite > > formats, because they are, again, easy to parse and widely used -- every > > language and library has some facility to read them natively. > > Yup - again, like project visualization, to make any of this simple for > newcomers to try stuff, we need to separate data gathering / model / > visualization. Since most of these are spare time projects, having easy > chunks makes it simpler for different people to try their hand at it. For visualization, for sure, json is the current natural format when data is consumed from the browser. I don't have great experience on this, and what I'm missing with json currently is a common practice on documenting a structure: are there common practices? Because for simple json structure, documentation is not really necessary, but once the structure goes complex, documentation is really a key requirement for people to use or extend. And I already see this shortcoming with the 11 json files from projects-new.a.o = https://projects-new.apache.org/json/foundation/ Regards, Hervé > > Thanks, > > - Shane
Re: Standards for mail archive statistics gathering?
On 5/5/15 7:33 AM, Boris Baldassari wrote: > Hi Folks, > > Sorry for the late answer on this thread. Don't know what has been done > since then, but I've some experience to share on this, so here are my 2c.. No, more input is always appreciated! Hervé is doing some centralization of the projects-new.a.o data capture, which is related but slightly separate. But this is going to be a long-term project with plenty of different people helping I bet. ... > * Parsing mboxes for software repository data mining: > There is a suite of tools exactly targeted at this kind of duty on > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I > don't know how they manage time zones, but the toolsuite is widely used > around (see [3] or [4] as examples) so I believe they are quite robust. > It includes tools for data retrieval as well as visualisation. Drat. Metrics Grimoire looks pretty nifty - essentially a set of frameworks for extracting metadata from a bunch of sources - but it's GPL, so personally I have no interest in working on it. If someone else uses it to generate datasets that's great. > > * As for the feedback/thoughts about the architecture and formats: > I love the REST-API idea proposed by Rob. That's really easy to access > and retrieve through scripts on-demand. CSV and JSON are my favourite > formats, because they are, again, easy to parse and widely used -- every > language and library has some facility to read them natively. Yup - again, like project visualization, to make any of this simple for newcomers to try stuff, we need to separate data gathering / model / visualization. Since most of these are spare time projects, having easy chunks makes it simpler for different people to try their hand at it. Thanks, - Shane
Re: Standards for mail archive statistics gathering?
> On 05 May 2015, at 07:33, Boris Baldassari > wrote: > > Hi Folks, > > Sorry for the late answer on this thread. Don't know what has been done since > then, but I've some experience to share on this, so here are my 2c.. > > * Parsing dates and time zones: > If you are to use Perl, the Date::Parse module handles dates and time zones > pretty well. As for Python I don't know -- there probably is a module for > that too.. > I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the data > sets have been published here [0]), and it worked great. I do have a Perl > script to do that, which I can provide -- but I have no access I'm aware of > in the dev scm, and not sure if Perl is the most common language here.. so > please let me know. > > * Parsing mboxes for software repository data mining: > There is a suite of tools exactly targeted at this kind of duty on github: > Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know how > they manage time zones, but the toolsuite is widely used around (see [3] or > [4] as examples) so I believe they are quite robust. It includes tools for > data retrieval as well as visualisation. > > * As for the feedback/thoughts about the architecture and formats: > I love the REST-API idea proposed by Rob. That's really easy to access and > retrieve through scripts on-demand. CSV and JSON are my favourite formats, > because they are, again, easy to parse and widely used -- every language and > library has some facility to read them natively. I have to endorse Bitergia, too. If they don’t immediately have what is wanted, they are likely to be interested in working on it. But you know this, I’m guessing. louis > > > Cheers, > > > [0] http://castalia.solutions/datasets/ > [1] https://metricsgrimoire.github.io/ > [2] http://bitergia.com > [3] Eclipse Dashboard: http://dashboard.eclipse.org/ > [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/ > > > > -- > Boris Baldassari > Castalia Solutions -- Elegant Software Engineering > Web: http://castalia.solutions > Phone: +33 6 48 03 82 89 > > > Le 28/04/2015 16:11, Rich Bowen a écrit : >> >> >> On 04/27/2015 09:36 AM, Shane Curcuru wrote: >>> I'm interested in working on some visualizations of mailing list >>> activity over time, in particular some simple analyses, like thread >>> length/participants and the like. Given that the raw data can all be >>> precomputed from mbox archives, is there any semi-standard way to >>> distill and save metadata about mboxes? >>> >>> If we had a generic static database of past mail metadata and statistics >>> (i.e. not details of contents, but perhaps overall # of lines of text or >>> something), it would be interesting to see what kinds of visualizations >>> that different people would come up with. >>> >>> Anyone have pointers to either a data format or the best parsing library >>> for this? I'm trying to think ahead, and work on the parsing, storing >>> statistics, and visualizations as separate pieces so it's easier for >>> different people to collaborate on something. >> >> Roberto posted something to the list a month or so ago about the efforts >> that he's been working on for this kind of thing. You might ping him. >> >> --Rich >> >> > signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Standards for mail archive statistics gathering?
Hi Folks, Sorry for the late answer on this thread. Don't know what has been done since then, but I've some experience to share on this, so here are my 2c.. * Parsing dates and time zones: If you are to use Perl, the Date::Parse module handles dates and time zones pretty well. As for Python I don't know -- there probably is a module for that too.. I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the data sets have been published here [0]), and it worked great. I do have a Perl script to do that, which I can provide -- but I have no access I'm aware of in the dev scm, and not sure if Perl is the most common language here.. so please let me know. * Parsing mboxes for software repository data mining: There is a suite of tools exactly targeted at this kind of duty on github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know how they manage time zones, but the toolsuite is widely used around (see [3] or [4] as examples) so I believe they are quite robust. It includes tools for data retrieval as well as visualisation. * As for the feedback/thoughts about the architecture and formats: I love the REST-API idea proposed by Rob. That's really easy to access and retrieve through scripts on-demand. CSV and JSON are my favourite formats, because they are, again, easy to parse and widely used -- every language and library has some facility to read them natively. Cheers, [0] http://castalia.solutions/datasets/ [1] https://metricsgrimoire.github.io/ [2] http://bitergia.com [3] Eclipse Dashboard: http://dashboard.eclipse.org/ [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/ -- Boris Baldassari Castalia Solutions -- Elegant Software Engineering Web: http://castalia.solutions Phone: +33 6 48 03 82 89 Le 28/04/2015 16:11, Rich Bowen a écrit : On 04/27/2015 09:36 AM, Shane Curcuru wrote: I'm interested in working on some visualizations of mailing list activity over time, in particular some simple analyses, like thread length/participants and the like. Given that the raw data can all be precomputed from mbox archives, is there any semi-standard way to distill and save metadata about mboxes? If we had a generic static database of past mail metadata and statistics (i.e. not details of contents, but perhaps overall # of lines of text or something), it would be interesting to see what kinds of visualizations that different people would come up with. Anyone have pointers to either a data format or the best parsing library for this? I'm trying to think ahead, and work on the parsing, storing statistics, and visualizations as separate pieces so it's easier for different people to collaborate on something. Roberto posted something to the list a month or so ago about the efforts that he's been working on for this kind of thing. You might ping him. --Rich
Re: Standards for mail archive statistics gathering?
On 04/27/2015 09:36 AM, Shane Curcuru wrote: I'm interested in working on some visualizations of mailing list activity over time, in particular some simple analyses, like thread length/participants and the like. Given that the raw data can all be precomputed from mbox archives, is there any semi-standard way to distill and save metadata about mboxes? If we had a generic static database of past mail metadata and statistics (i.e. not details of contents, but perhaps overall # of lines of text or something), it would be interesting to see what kinds of visualizations that different people would come up with. Anyone have pointers to either a data format or the best parsing library for this? I'm trying to think ahead, and work on the parsing, storing statistics, and visualizations as separate pieces so it's easier for different people to collaborate on something. Roberto posted something to the list a month or so ago about the efforts that he's been working on for this kind of thing. You might ping him. --Rich -- Rich Bowen - rbo...@rcbowen.com - @rbowen http://apachecon.com/ - @apachecon
Re: Standards for mail archive statistics gathering?
On 4/27/15 2:29 PM, Rob Weir wrote: > On Mon, Apr 27, 2015 at 9:36 AM, Shane Curcuru wrote: ... > If you do Python, you might take a look at > https://svn.apache.org/repos/asf/openoffice/devtools/list-stats/ for a > simple program that could be adapted easily enough. It uses the > Python mailbox library to do the parsing. ACK, will look at. Yes, I started with a python library, but my issue is finding a chunk of time to start, code, and actually finish any one piece, so having a starting place is what I need. > > The biggest challenge making sense of such data, for me at least, was > the multiple email addresses a single person can use. Determining > these aliases for a project you are involved in is possible, though > tedious. Doing it for an unfamiliar project borders on the > impossible. Yes - a huge part of the value is in identity tracking. Many committer records now have alternate emails filled in in the LDAP data that is behind id.apache.org, and Members certainly can work with infra to get access, so we certainly can do this for most Apache lists. > Another "fun" problem is getting all the post time data into the same > UTC timezone. The mbox format does not seem to enforce a consistent > way of encoding these. Ah, good point. I was going to start cheap and simply categorize by calendar day, and call it good enough. > > I see I have a few other analysis scripts on my harddrive I haven't > checked in that handle the TZ and other issues. I'll get those > checked in. It seems that, almost as good as pre-extracted data > would be an easy API. > > > Ever think of having a contest related to "Visualizing Apache"? I > was considering proposing something like that for OpenOffice. > Provide the data for download (already extracted from our transaction > systems, so we don't get a harmful about of load on those servers) and > invite the community to do the analysis, see what insights they can > generate. Yes, that's exactly why I want to treat this as an actual architecture, so to speak. Really separate out data finding from parsing from identity matching, and then just find some interim format that visualization people could just look at. Makes it much simpler for a volunteer or someone with limited time to accomplish a real task to only have to focus on one bit. - Shane > > Regards, > > -Rob > > >> Thanks, >> - Shane
Re: Standards for mail archive statistics gathering?
On Mon, Apr 27, 2015 at 9:36 AM, Shane Curcuru wrote: > I'm interested in working on some visualizations of mailing list > activity over time, in particular some simple analyses, like thread > length/participants and the like. Given that the raw data can all be > precomputed from mbox archives, is there any semi-standard way to > distill and save metadata about mboxes? > I've done some analysis of OpenOffice email archives, including some social network analysis that I wrote up here: http://www.robweir.com/blog/wp-content/uploads/2013/12/aoo-graph-large.png > If we had a generic static database of past mail metadata and statistics > (i.e. not details of contents, but perhaps overall # of lines of text or > something), it would be interesting to see what kinds of visualizations > that different people would come up with. > > Anyone have pointers to either a data format or the best parsing library > for this? I'm trying to think ahead, and work on the parsing, storing > statistics, and visualizations as separate pieces so it's easier for > different people to collaborate on something. > If you do Python, you might take a look at https://svn.apache.org/repos/asf/openoffice/devtools/list-stats/ for a simple program that could be adapted easily enough. It uses the Python mailbox library to do the parsing. The biggest challenge making sense of such data, for me at least, was the multiple email addresses a single person can use. Determining these aliases for a project you are involved in is possible, though tedious. Doing it for an unfamiliar project borders on the impossible. Another "fun" problem is getting all the post time data into the same UTC timezone. The mbox format does not seem to enforce a consistent way of encoding these. I see I have a few other analysis scripts on my harddrive I haven't checked in that handle the TZ and other issues. I'll get those checked in. It seems that, almost as good as pre-extracted data would be an easy API. Ever think of having a contest related to "Visualizing Apache"? I was considering proposing something like that for OpenOffice. Provide the data for download (already extracted from our transaction systems, so we don't get a harmful about of load on those servers) and invite the community to do the analysis, see what insights they can generate. Regards, -Rob > Thanks, > - Shane