Re: Standards for mail archive statistics gathering?

2015-05-07 Thread Steve Blackmon
I'd like to submit to this group that no batch job is necessary to compute
many useful statistics - rather with a suitable representation an indexed
event stream of mailing list messages, commits, releases, etc... can be
searched aggregated and visualized in real-time.

Hacked a bit this afternoon - parsed much of community-dev mbox history
using mime4j into activity streams json, indexed in elasticsearch with
kibana as UI.

Visit link below for an idea what an indexed activity streams
representation of a mailing list could look like.

http://72.182.111.65:5601/#/discover?_a=(columns:!(actor.displayName,published,content,summary),index:community-dev_activity,interval:auto,query:(query_string:(analyze_wildcard:!t,query:'*')),sort:!(published,asc))&_g=(time:(from:'2009-10-07T22:26:18.843Z',mode:absolute,to:'2015-05-07T22:26:18.843Z'))

Of course much more discussion and rigor would be required before something
like this could become official: determining appropriate
structure/identifier/format/enumeration of each field, adding robust error
handling, testing that no messages are lost in translation, resolving email
addresses back to apache LDAP ids, etc... but I wanted to show the
potential of this approach and what can be developed with minimal net new
code.

All code used to build this has been pushed to
http://github.com/steveblackmon/streams-apache

Regards,

Steve Blackmon
sblack...@apache.org

On Wed, May 6, 2015 at 10:44 PM, Hervé BOUTEMY 
wrote:
> Le mercredi 6 mai 2015 12:48:34 Steve Blackmon a écrit :
>> > For visualization, for sure, json is the current natural format when
data
>> > is consumed from the browser.
>> > I don't have great experience on this, and what I'm missing with json
>> > currently is a common practice on documenting a structure: are there
>> > common
>> > practices?
>>
>> In podling streams [0], we make extensive use of json schema [1]
> thank you: that's exactly the initial info I was looking for: json schema!
>
>> from
>> which we generate POJOs with a maven
>> plugin jsonschema2pojo [2] which makes manipulating the objects in
>> Java/Scala pleasant.  I expect other languages have
>> similar jsonschema-based ORM paradigms as well.
> As usual Java devloper, your tooling is interesting
> But in the projects-new.a.o case, it is data extraction is coded in
Python: if
> we create json schema, having Python classes generated could simplify
coding.
> Anyone with Python+json schema experience around?
>
>
>> This pattern supports
>> inheritance both within
>> and across projects - for example see how [3] extends [4] which
>> extends [5].  These schemas are relatively self documenting,
>> but generating documentation or other artifacts is straight-forward as
>> they are themselves json documents.
> yeah, json schema document is easy to read (at least the examples on the
> site...)
>
>>
>> > Because for simple json structure, documentation is not really
necessary,
>> > but once the structure goes complex, documentation is really a key
>> > requirement for people to use or extend. And I already see this
>> > shortcoming with the 11 json files from projects-new.a.o =
>> > https://projects-new.apache.org/json/foundation/
>> Having used these json documents a few weeks ago to build an apache
>> community visualization [6]
> yeah, really nice visualization!
>
>> IMO the current crop of project-new jsons
>> are intermediate artifacts rather than a sufficiently cross-purpose
>> data model, a role currently held by DOAP mbox and misc others all
>> with some inherent shortcomings most notably lack of navigability
>> between silos.
> +1
> I'm at a point where I start to really understand the concepts involved
and
> want to code a simple data model: I'll report here once I have a first
version
> available.
>
>> I'd like to nominate activity streams [7] with
>> community-specific extensions (such as those roughly prototyped here:
>> [8] ) as a potential core data model for this effort going forward
> I had a first look at it: it is more complex than what I had in mind
> We'll have to share and see what's the best bet
>
>> and
>> I'm happy to help apply some of the useful tools and connectors within
>> podling streams toward that end. Converting external structured
>> sources into normalized documents and indexing those activities to
>> power data-centric APIs and visualizations are wheelhouse use cases
>> for this project, as they say.
> Great, stay tuned: I'll probably work on it this week-end
>
> Regards,
>
> Hervé
>
>>
>> [0] http://streams.incubator.apache.org/
>> [1] http://json-schema.org/documentation.html
>> [2] http://www.jsonschema2pojo.org/
>> [3]
>>
https://github.com/steveblackmon/streams-apache/blob/master/activities/src/
>> main/jsonschema/objectTypes/committee.json [4]
>>
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma
>> in/jsonschema/objectTypes/group.json [5]
>>
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma
>> in/jsonschema

Re: Standards for mail archive statistics gathering?

2015-05-07 Thread Andrea Pescetti

Betty James wrote:

How do I get off this thread.  don't know how I got on


Just follow the instructions others have already provided. I'm putting 
you in CC in case you have already unsubscribed from this discussion list.



but I  am just a totally ignorant individual using Open Office and trying to
donate (which doesn't sound necessary anymore)


Actually, donations to OpenOffice and to the entire non-profit Apache 
Software Foundation are welcome! See 
http://www.openoffice.org/donations.html to discover the existing 
possibilities we have for donations. Thank you for your intentions, and 
sorry if some misleading instructions led you to believe you needed to 
subscribe to this discussion list!


Regards,
  Andrea.


Re: Standards for mail archive statistics gathering?

2015-05-06 Thread Hervé BOUTEMY
Le mercredi 6 mai 2015 12:48:34 Steve Blackmon a écrit :
> > For visualization, for sure, json is the current natural format when data
> > is consumed from the browser.
> > I don't have great experience on this, and what I'm missing with json
> > currently is a common practice on documenting a structure: are there
> > common
> > practices?
> 
> In podling streams [0], we make extensive use of json schema [1]
thank you: that's exactly the initial info I was looking for: json schema!

> from
> which we generate POJOs with a maven
> plugin jsonschema2pojo [2] which makes manipulating the objects in
> Java/Scala pleasant.  I expect other languages have
> similar jsonschema-based ORM paradigms as well.
As usual Java devloper, your tooling is interesting
But in the projects-new.a.o case, it is data extraction is coded in Python: if 
we create json schema, having Python classes generated could simplify coding.
Anyone with Python+json schema experience around?


> This pattern supports
> inheritance both within
> and across projects - for example see how [3] extends [4] which
> extends [5].  These schemas are relatively self documenting,
> but generating documentation or other artifacts is straight-forward as
> they are themselves json documents.
yeah, json schema document is easy to read (at least the examples on the 
site...)

> 
> > Because for simple json structure, documentation is not really necessary,
> > but once the structure goes complex, documentation is really a key
> > requirement for people to use or extend. And I already see this
> > shortcoming with the 11 json files from projects-new.a.o =
> > https://projects-new.apache.org/json/foundation/
> Having used these json documents a few weeks ago to build an apache
> community visualization [6]
yeah, really nice visualization!

> IMO the current crop of project-new jsons
> are intermediate artifacts rather than a sufficiently cross-purpose
> data model, a role currently held by DOAP mbox and misc others all
> with some inherent shortcomings most notably lack of navigability
> between silos.
+1
I'm at a point where I start to really understand the concepts involved and 
want to code a simple data model: I'll report here once I have a first version 
available.

> I'd like to nominate activity streams [7] with
> community-specific extensions (such as those roughly prototyped here:
> [8] ) as a potential core data model for this effort going forward
I had a first look at it: it is more complex than what I had in mind
We'll have to share and see what's the best bet

> and
> I'm happy to help apply some of the useful tools and connectors within
> podling streams toward that end. Converting external structured
> sources into normalized documents and indexing those activities to
> power data-centric APIs and visualizations are wheelhouse use cases
> for this project, as they say.
Great, stay tuned: I'll probably work on it this week-end

Regards,

Hervé

> 
> [0] http://streams.incubator.apache.org/
> [1] http://json-schema.org/documentation.html
> [2] http://www.jsonschema2pojo.org/
> [3]
> https://github.com/steveblackmon/streams-apache/blob/master/activities/src/
> main/jsonschema/objectTypes/committee.json [4]
> https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma
> in/jsonschema/objectTypes/group.json [5]
> https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma
> in/jsonschema/object.json [6] http://72.182.111.65:3000/workspace/3
> [7] http://activitystrea.ms/
> [8]
> https://github.com/steveblackmon/streams-apache/blob/master/activities/src/
> main/jsonschema
> 
> Steve Blackmon
> sblack...@apache.org
> 
> On Wed, May 6, 2015 at 2:05 AM, Hervé BOUTEMY  wrote:
> > Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit :
> >> On 5/5/15 7:33 AM, Boris Baldassari wrote:
> >> > Hi Folks,
> >> > 
> >> > Sorry for the late answer on this thread. Don't know what has been done
> >> > since then, but I've some experience to share on this, so here are my
> >> > 2c..
> >> 
> >> No, more input is always appreciated!  Hervé is doing some
> >> centralization of the projects-new.a.o data capture, which is related
> >> but slightly separate.
> > 
> > +1
> > this can give a common place to put code once experiments show that we
> > should add a new data source
> > 
> >> But this is going to be a long-term project
> > 
> > +1
> > 
> >> with
> >> plenty of different people helping I bet.
> > 
> > I hope so...
> > 
> >> ...
> >> 
> >> > * Parsing mboxes for software repository data mining:
> >> > There is a suite of tools exactly targeted at this kind of duty on
> >> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I
> >> > don't know how they manage time zones, but the toolsuite is widely used
> >> > around (see [3] or [4] as examples) so I believe they are quite robust.
> >> > It includes tools for data retrieval as well as visualisation.
> >> 
> >> Drat.  Metrics Grimoire looks pretty nifty - essentially a set o

Re: Standards for mail archive statistics gathering?

2015-05-06 Thread Niclas Hedhman
If you want to unsubscribe, please find instructions at
http://apache.org/foundation/mailinglists.html

And the name of this list is dev@community.apache.org

Cheers
Niclas

On Thu, May 7, 2015 at 7:48 AM, Betty James  wrote:

> Oh my gosh.  How do I get off this thread.  don't know how I got on, but I
> am just a totally ignorant individual using Open Office and trying to
> donate (which doesn't sound necessary anymore)so unless you are in good
> shape and in your 70's try to figure out how I can get off the list!
>
> Betty B. James
>
> On Tue, May 5, 2015 at 7:33 AM, Boris Baldassari <
> castalia.laborat...@gmail.com> wrote:
>
> > Hi Folks,
> >
> > Sorry for the late answer on this thread. Don't know what has been done
> > since then, but I've some experience to share on this, so here are my
> 2c..
> >
> > * Parsing dates and time zones:
> > If you are to use Perl, the Date::Parse module handles dates and time
> > zones pretty well. As for Python I don't know -- there probably is a
> module
> > for that too..
> > I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the
> > data sets have been published here [0]), and it worked great. I do have a
> > Perl script to do that, which I can provide -- but I have no access I'm
> > aware of in the dev scm, and not sure if Perl is the most common language
> > here.. so please let me know.
> >
> > * Parsing mboxes for software repository data mining:
> > There is a suite of tools exactly targeted at this kind of duty on
> github:
> > Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know
> > how they manage time zones, but the toolsuite is widely used around (see
> > [3] or [4] as examples) so I believe they are quite robust. It includes
> > tools for data retrieval as well as visualisation.
> >
> > * As for the feedback/thoughts about the architecture and formats:
> > I love the REST-API idea proposed by Rob. That's really easy to access
> and
> > retrieve through scripts on-demand. CSV and JSON are my favourite
> formats,
> > because they are, again, easy to parse and widely used -- every language
> > and library has some facility to read them natively.
> >
> >
> > Cheers,
> >
> >
> > [0] http://castalia.solutions/datasets/
> > [1] https://metricsgrimoire.github.io/
> > [2] http://bitergia.com
> > [3] Eclipse Dashboard: http://dashboard.eclipse.org/
> > [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/
> >
> >
> >
> > --
> > Boris Baldassari
> > Castalia Solutions -- Elegant Software Engineering
> > Web: http://castalia.solutions
> > Phone: +33 6 48 03 82 89
> >
> >
> > Le 28/04/2015 16:11, Rich Bowen a écrit :
> >
> >>
> >>
> >> On 04/27/2015 09:36 AM, Shane Curcuru wrote:
> >>
> >>> I'm interested in working on some visualizations of mailing list
> >>> activity over time, in particular some simple analyses, like thread
> >>> length/participants and the like.  Given that the raw data can all be
> >>> precomputed from mbox archives, is there any semi-standard way to
> >>> distill and save metadata about mboxes?
> >>>
> >>> If we had a generic static database of past mail metadata and
> statistics
> >>> (i.e. not details of contents, but perhaps overall # of lines of text
> or
> >>> something), it would be interesting to see what kinds of visualizations
> >>> that different people would come up with.
> >>>
> >>> Anyone have pointers to either a data format or the best parsing
> library
> >>> for this?  I'm trying to think ahead, and work on the parsing, storing
> >>> statistics, and visualizations as separate pieces so it's easier for
> >>> different people to collaborate on something.
> >>>
> >>
> >> Roberto posted something to the list a month or so ago about the efforts
> >> that he's been working on for this kind of thing. You might ping him.
> >>
> >> --Rich
> >>
> >>
> >>
> >
>



-- 
Niclas Hedhman, Software Developer
http://zest.apache.org - New Energy for Java


Re: Standards for mail archive statistics gathering?

2015-05-06 Thread Betty James
Oh my gosh.  How do I get off this thread.  don't know how I got on, but I
am just a totally ignorant individual using Open Office and trying to
donate (which doesn't sound necessary anymore)so unless you are in good
shape and in your 70's try to figure out how I can get off the list!

Betty B. James

On Tue, May 5, 2015 at 7:33 AM, Boris Baldassari <
castalia.laborat...@gmail.com> wrote:

> Hi Folks,
>
> Sorry for the late answer on this thread. Don't know what has been done
> since then, but I've some experience to share on this, so here are my 2c..
>
> * Parsing dates and time zones:
> If you are to use Perl, the Date::Parse module handles dates and time
> zones pretty well. As for Python I don't know -- there probably is a module
> for that too..
> I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the
> data sets have been published here [0]), and it worked great. I do have a
> Perl script to do that, which I can provide -- but I have no access I'm
> aware of in the dev scm, and not sure if Perl is the most common language
> here.. so please let me know.
>
> * Parsing mboxes for software repository data mining:
> There is a suite of tools exactly targeted at this kind of duty on github:
> Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know
> how they manage time zones, but the toolsuite is widely used around (see
> [3] or [4] as examples) so I believe they are quite robust. It includes
> tools for data retrieval as well as visualisation.
>
> * As for the feedback/thoughts about the architecture and formats:
> I love the REST-API idea proposed by Rob. That's really easy to access and
> retrieve through scripts on-demand. CSV and JSON are my favourite formats,
> because they are, again, easy to parse and widely used -- every language
> and library has some facility to read them natively.
>
>
> Cheers,
>
>
> [0] http://castalia.solutions/datasets/
> [1] https://metricsgrimoire.github.io/
> [2] http://bitergia.com
> [3] Eclipse Dashboard: http://dashboard.eclipse.org/
> [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/
>
>
>
> --
> Boris Baldassari
> Castalia Solutions -- Elegant Software Engineering
> Web: http://castalia.solutions
> Phone: +33 6 48 03 82 89
>
>
> Le 28/04/2015 16:11, Rich Bowen a écrit :
>
>>
>>
>> On 04/27/2015 09:36 AM, Shane Curcuru wrote:
>>
>>> I'm interested in working on some visualizations of mailing list
>>> activity over time, in particular some simple analyses, like thread
>>> length/participants and the like.  Given that the raw data can all be
>>> precomputed from mbox archives, is there any semi-standard way to
>>> distill and save metadata about mboxes?
>>>
>>> If we had a generic static database of past mail metadata and statistics
>>> (i.e. not details of contents, but perhaps overall # of lines of text or
>>> something), it would be interesting to see what kinds of visualizations
>>> that different people would come up with.
>>>
>>> Anyone have pointers to either a data format or the best parsing library
>>> for this?  I'm trying to think ahead, and work on the parsing, storing
>>> statistics, and visualizations as separate pieces so it's easier for
>>> different people to collaborate on something.
>>>
>>
>> Roberto posted something to the list a month or so ago about the efforts
>> that he's been working on for this kind of thing. You might ping him.
>>
>> --Rich
>>
>>
>>
>


Re: Standards for mail archive statistics gathering?

2015-05-06 Thread Steve Blackmon
> For visualization, for sure, json is the current natural format when data is
> consumed from the browser.
> I don't have great experience on this, and what I'm missing with json
> currently is a common practice on documenting a structure: are there common
> practices?

In podling streams [0], we make extensive use of json schema [1] from
which we generate POJOs with a maven
plugin jsonschema2pojo [2] which makes manipulating the objects in
Java/Scala pleasant.  I expect other languages have
similar jsonschema-based ORM paradigms as well.  This pattern supports
inheritance both within
and across projects - for example see how [3] extends [4] which
extends [5].  These schemas are relatively self documenting,
but generating documentation or other artifacts is straight-forward as
they are themselves json documents.

> Because for simple json structure, documentation is not really necessary, but
> once the structure goes complex, documentation is really a key requirement for
> people to use or extend. And I already see this shortcoming with the 11 json
> files from projects-new.a.o = https://projects-new.apache.org/json/foundation/

Having used these json documents a few weeks ago to build an apache
community visualization [6] IMO the current crop of project-new jsons
are intermediate artifacts rather than a sufficiently cross-purpose
data model, a role currently held by DOAP mbox and misc others all
with some inherent shortcomings most notably lack of navigability
between silos.  I'd like to nominate activity streams [7] with
community-specific extensions (such as those roughly prototyped here:
[8] ) as a potential core data model for this effort going forward and
I'm happy to help apply some of the useful tools and connectors within
podling streams toward that end.  Converting external structured
sources into normalized documents and indexing those activities to
power data-centric APIs and visualizations are wheelhouse use cases
for this project, as they say.

[0] http://streams.incubator.apache.org/
[1] http://json-schema.org/documentation.html
[2] http://www.jsonschema2pojo.org/
[3] 
https://github.com/steveblackmon/streams-apache/blob/master/activities/src/main/jsonschema/objectTypes/committee.json
[4] 
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/main/jsonschema/objectTypes/group.json
[5] 
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/main/jsonschema/object.json
[6] http://72.182.111.65:3000/workspace/3
[7] http://activitystrea.ms/
[8] 
https://github.com/steveblackmon/streams-apache/blob/master/activities/src/main/jsonschema

Steve Blackmon
sblack...@apache.org

On Wed, May 6, 2015 at 2:05 AM, Hervé BOUTEMY  wrote:
> Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit :
>> On 5/5/15 7:33 AM, Boris Baldassari wrote:
>> > Hi Folks,
>> >
>> > Sorry for the late answer on this thread. Don't know what has been done
>> > since then, but I've some experience to share on this, so here are my 2c..
>>
>> No, more input is always appreciated!  Hervé is doing some
>> centralization of the projects-new.a.o data capture, which is related
>> but slightly separate.
> +1
> this can give a common place to put code once experiments show that we should
> add a new data source
>
>> But this is going to be a long-term project
> +1
>
>> with
>> plenty of different people helping I bet.
> I hope so...
>
>>
>> ...
>>
>> > * Parsing mboxes for software repository data mining:
>> > There is a suite of tools exactly targeted at this kind of duty on
>> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I
>> > don't know how they manage time zones, but the toolsuite is widely used
>> > around (see [3] or [4] as examples) so I believe they are quite robust.
>> > It includes tools for data retrieval as well as visualisation.
>>
>> Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
>> frameworks for extracting metadata from a bunch of sources - but it's
>> GPL, so personally I have no interest in working on it.  If someone else
>> uses it to generate datasets that's great.
>>
>> > * As for the feedback/thoughts about the architecture and formats:
>> > I love the REST-API idea proposed by Rob. That's really easy to access
>> > and retrieve through scripts on-demand. CSV and JSON are my favourite
>> > formats, because they are, again, easy to parse and widely used -- every
>> > language and library has some facility to read them natively.
>>
>> Yup - again, like project visualization, to make any of this simple for
>> newcomers to try stuff, we need to separate data gathering / model /
>> visualization.  Since most of these are spare time projects, having easy
>> chunks makes it simpler for different people to try their hand at it.
> For visualization, for sure, json is the current natural format when data is
> consumed from the browser.
> I don't have great experience on this, and what I'm missing with json
> currently is a common practice on docume

Re: Standards for mail archive statistics gathering?

2015-05-06 Thread Boris Baldassari

Hi all,


Le 06/05/2015 03:26, Shane Curcuru a écrit :

Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
frameworks for extracting metadata from a bunch of sources - but it's
GPL, so personally I have no interest in working on it.  If someone else
uses it to generate datasets that's great.

Argh. I had forgotten about the licencing incompatibility, my mistake.
Well, as you point out I guess the product can still be used without 
modifications, if needed.


I'm following this ml for a few months now, but I'm not sure how far 
this project is planned to go. Are the mailing lists the only artefact 
to be analysed, or do you intend to provide also configuration 
management or issue tacking data?



Yup - again, like project visualization, to make any of this simple for
newcomers to try stuff, we need to separate data gathering / model /
visualization.  Since most of these are spare time projects, having easy
chunks makes it simpler for different people to try their hand at it.
+1 for the architectural separation of concerns, definitely. For 
analysability and maintainability, easy access and usage, and as a 
consequence for dissemination.




--
Boris Baldassari
Castalia Solutions -- Elegant Software Engineering
Web: http://castalia.solutions
Tel: +33 6 48 03 82 89


Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Hervé BOUTEMY
Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit :
> On 5/5/15 7:33 AM, Boris Baldassari wrote:
> > Hi Folks,
> > 
> > Sorry for the late answer on this thread. Don't know what has been done
> > since then, but I've some experience to share on this, so here are my 2c..
> 
> No, more input is always appreciated!  Hervé is doing some
> centralization of the projects-new.a.o data capture, which is related
> but slightly separate.
+1
this can give a common place to put code once experiments show that we should 
add a new data source

> But this is going to be a long-term project
+1

> with
> plenty of different people helping I bet.
I hope so...

> 
> ...
> 
> > * Parsing mboxes for software repository data mining:
> > There is a suite of tools exactly targeted at this kind of duty on
> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I
> > don't know how they manage time zones, but the toolsuite is widely used
> > around (see [3] or [4] as examples) so I believe they are quite robust.
> > It includes tools for data retrieval as well as visualisation.
> 
> Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
> frameworks for extracting metadata from a bunch of sources - but it's
> GPL, so personally I have no interest in working on it.  If someone else
> uses it to generate datasets that's great.
> 
> > * As for the feedback/thoughts about the architecture and formats:
> > I love the REST-API idea proposed by Rob. That's really easy to access
> > and retrieve through scripts on-demand. CSV and JSON are my favourite
> > formats, because they are, again, easy to parse and widely used -- every
> > language and library has some facility to read them natively.
> 
> Yup - again, like project visualization, to make any of this simple for
> newcomers to try stuff, we need to separate data gathering / model /
> visualization.  Since most of these are spare time projects, having easy
> chunks makes it simpler for different people to try their hand at it.
For visualization, for sure, json is the current natural format when data is 
consumed from the browser.
I don't have great experience on this, and what I'm missing with json 
currently is a common practice on documenting a structure: are there common 
practices?
Because for simple json structure, documentation is not really necessary, but 
once the structure goes complex, documentation is really a key requirement for 
people to use or extend. And I already see this shortcoming with the 11 json 
files from projects-new.a.o = https://projects-new.apache.org/json/foundation/

Regards,

Hervé

> 
> Thanks,
> 
> - Shane



Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Shane Curcuru
On 5/5/15 7:33 AM, Boris Baldassari wrote:
> Hi Folks,
> 
> Sorry for the late answer on this thread. Don't know what has been done
> since then, but I've some experience to share on this, so here are my 2c..

No, more input is always appreciated!  Hervé is doing some
centralization of the projects-new.a.o data capture, which is related
but slightly separate.  But this is going to be a long-term project with
plenty of different people helping I bet.

...
> * Parsing mboxes for software repository data mining:
> There is a suite of tools exactly targeted at this kind of duty on
> github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I
> don't know how they manage time zones, but the toolsuite is widely used
> around (see [3] or [4] as examples) so I believe they are quite robust.
> It includes tools for data retrieval as well as visualisation.

Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
frameworks for extracting metadata from a bunch of sources - but it's
GPL, so personally I have no interest in working on it.  If someone else
uses it to generate datasets that's great.

> 
> * As for the feedback/thoughts about the architecture and formats:
> I love the REST-API idea proposed by Rob. That's really easy to access
> and retrieve through scripts on-demand. CSV and JSON are my favourite
> formats, because they are, again, easy to parse and widely used -- every
> language and library has some facility to read them natively.

Yup - again, like project visualization, to make any of this simple for
newcomers to try stuff, we need to separate data gathering / model /
visualization.  Since most of these are spare time projects, having easy
chunks makes it simpler for different people to try their hand at it.

Thanks,

- Shane



Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Louis Suárez-Potts

> On 05 May 2015, at 07:33, Boris Baldassari  
> wrote:
> 
> Hi Folks,
> 
> Sorry for the late answer on this thread. Don't know what has been done since 
> then, but I've some experience to share on this, so here are my 2c..
> 
> * Parsing dates and time zones:
> If you are to use Perl, the Date::Parse module handles dates and time zones 
> pretty well. As for Python I don't know -- there probably is a module for 
> that too..
> I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the data 
> sets have been published here [0]), and it worked great. I do have a Perl 
> script to do that, which I can provide -- but I have no access I'm aware of 
> in the dev scm, and not sure if Perl is the most common language here.. so 
> please let me know.
> 
> * Parsing mboxes for software repository data mining:
> There is a suite of tools exactly targeted at this kind of duty on github: 
> Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know how 
> they manage time zones, but the toolsuite is widely used around (see [3] or 
> [4] as examples) so I believe they are quite robust. It includes tools for 
> data retrieval as well as visualisation.
> 
> * As for the feedback/thoughts about the architecture and formats:
> I love the REST-API idea proposed by Rob. That's really easy to access and 
> retrieve through scripts on-demand. CSV and JSON are my favourite formats, 
> because they are, again, easy to parse and widely used -- every language and 
> library has some facility to read them natively.

I have to endorse Bitergia, too. If they don’t immediately have what is wanted, 
they are likely to be interested in working on it. But you know this, I’m 
guessing.

louis

> 
> 
> Cheers,
> 
> 
> [0] http://castalia.solutions/datasets/
> [1] https://metricsgrimoire.github.io/
> [2] http://bitergia.com
> [3] Eclipse Dashboard: http://dashboard.eclipse.org/
> [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/
> 
> 
> 
> --
> Boris Baldassari
> Castalia Solutions -- Elegant Software Engineering
> Web: http://castalia.solutions
> Phone: +33 6 48 03 82 89
> 
> 
> Le 28/04/2015 16:11, Rich Bowen a écrit :
>> 
>> 
>> On 04/27/2015 09:36 AM, Shane Curcuru wrote:
>>> I'm interested in working on some visualizations of mailing list
>>> activity over time, in particular some simple analyses, like thread
>>> length/participants and the like.  Given that the raw data can all be
>>> precomputed from mbox archives, is there any semi-standard way to
>>> distill and save metadata about mboxes?
>>> 
>>> If we had a generic static database of past mail metadata and statistics
>>> (i.e. not details of contents, but perhaps overall # of lines of text or
>>> something), it would be interesting to see what kinds of visualizations
>>> that different people would come up with.
>>> 
>>> Anyone have pointers to either a data format or the best parsing library
>>> for this?  I'm trying to think ahead, and work on the parsing, storing
>>> statistics, and visualizations as separate pieces so it's easier for
>>> different people to collaborate on something.
>> 
>> Roberto posted something to the list a month or so ago about the efforts 
>> that he's been working on for this kind of thing. You might ping him.
>> 
>> --Rich
>> 
>> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Boris Baldassari

Hi Folks,

Sorry for the late answer on this thread. Don't know what has been done 
since then, but I've some experience to share on this, so here are my 2c..


* Parsing dates and time zones:
If you are to use Perl, the Date::Parse module handles dates and time 
zones pretty well. As for Python I don't know -- there probably is a 
module for that too..
I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the 
data sets have been published here [0]), and it worked great. I do have 
a Perl script to do that, which I can provide -- but I have no access 
I'm aware of in the dev scm, and not sure if Perl is the most common 
language here.. so please let me know.


* Parsing mboxes for software repository data mining:
There is a suite of tools exactly targeted at this kind of duty on 
github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I 
don't know how they manage time zones, but the toolsuite is widely used 
around (see [3] or [4] as examples) so I believe they are quite robust. 
It includes tools for data retrieval as well as visualisation.


* As for the feedback/thoughts about the architecture and formats:
I love the REST-API idea proposed by Rob. That's really easy to access 
and retrieve through scripts on-demand. CSV and JSON are my favourite 
formats, because they are, again, easy to parse and widely used -- every 
language and library has some facility to read them natively.



Cheers,


[0] http://castalia.solutions/datasets/
[1] https://metricsgrimoire.github.io/
[2] http://bitergia.com
[3] Eclipse Dashboard: http://dashboard.eclipse.org/
[4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/



--
Boris Baldassari
Castalia Solutions -- Elegant Software Engineering
Web: http://castalia.solutions
Phone: +33 6 48 03 82 89


Le 28/04/2015 16:11, Rich Bowen a écrit :



On 04/27/2015 09:36 AM, Shane Curcuru wrote:

I'm interested in working on some visualizations of mailing list
activity over time, in particular some simple analyses, like thread
length/participants and the like.  Given that the raw data can all be
precomputed from mbox archives, is there any semi-standard way to
distill and save metadata about mboxes?

If we had a generic static database of past mail metadata and statistics
(i.e. not details of contents, but perhaps overall # of lines of text or
something), it would be interesting to see what kinds of visualizations
that different people would come up with.

Anyone have pointers to either a data format or the best parsing library
for this?  I'm trying to think ahead, and work on the parsing, storing
statistics, and visualizations as separate pieces so it's easier for
different people to collaborate on something.


Roberto posted something to the list a month or so ago about the 
efforts that he's been working on for this kind of thing. You might 
ping him.


--Rich






Re: Standards for mail archive statistics gathering?

2015-04-28 Thread Rich Bowen



On 04/27/2015 09:36 AM, Shane Curcuru wrote:

I'm interested in working on some visualizations of mailing list
activity over time, in particular some simple analyses, like thread
length/participants and the like.  Given that the raw data can all be
precomputed from mbox archives, is there any semi-standard way to
distill and save metadata about mboxes?

If we had a generic static database of past mail metadata and statistics
(i.e. not details of contents, but perhaps overall # of lines of text or
something), it would be interesting to see what kinds of visualizations
that different people would come up with.

Anyone have pointers to either a data format or the best parsing library
for this?  I'm trying to think ahead, and work on the parsing, storing
statistics, and visualizations as separate pieces so it's easier for
different people to collaborate on something.


Roberto posted something to the list a month or so ago about the efforts 
that he's been working on for this kind of thing. You might ping him.


--Rich


--
Rich Bowen - rbo...@rcbowen.com - @rbowen
http://apachecon.com/ - @apachecon


Re: Standards for mail archive statistics gathering?

2015-04-28 Thread Shane Curcuru
On 4/27/15 2:29 PM, Rob Weir wrote:
> On Mon, Apr 27, 2015 at 9:36 AM, Shane Curcuru  wrote:
...
> If you do Python, you might take a look at
> https://svn.apache.org/repos/asf/openoffice/devtools/list-stats/ for a
> simple program that could be adapted easily enough.   It uses the
> Python mailbox library to do the parsing.

ACK, will look at.  Yes, I started with a python library, but my issue
is finding a chunk of time to start, code, and actually finish any one
piece, so having a starting place is what I need.

> 
> The biggest challenge making sense of such data, for me at least, was
> the multiple email addresses a single person can use.   Determining
> these aliases for a project you are involved in is possible, though
> tedious.   Doing it for an unfamiliar project borders on the
> impossible.

Yes - a huge part of the value is in identity tracking.  Many committer
records now have alternate emails filled in in the LDAP data that is
behind id.apache.org, and Members certainly can work with infra to get
access, so we certainly can do this for most Apache lists.

> Another "fun" problem is getting all the post time data into the same
> UTC timezone.   The mbox format does not seem to enforce a consistent
> way of encoding these.

Ah, good point.  I was going to start cheap and simply categorize by
calendar day, and call it good enough.

> 
> I see I have a few other analysis scripts on my harddrive I haven't
> checked in that handle the TZ and other issues.   I'll get those
> checked in.   It seems that, almost as good as pre-extracted data
> would be an easy API.
> 
> 
> Ever think of having a contest related to "Visualizing Apache"?   I
> was considering proposing something like that for OpenOffice.
> Provide the data for download (already extracted from our transaction
> systems, so we don't get a harmful about of load on those servers) and
> invite the community to do the analysis, see what insights they can
> generate.

Yes, that's exactly why I want to treat this as an actual architecture,
so to speak.  Really separate out data finding from parsing from
identity matching, and then just find some interim format that
visualization people could just look at.  Makes it much simpler for a
volunteer or someone with limited time to accomplish a real task to only
have to focus on one bit.

- Shane

> 
> Regards,
> 
> -Rob
> 
> 
>> Thanks,
>> - Shane



Re: Standards for mail archive statistics gathering?

2015-04-27 Thread Rob Weir
On Mon, Apr 27, 2015 at 9:36 AM, Shane Curcuru  wrote:
> I'm interested in working on some visualizations of mailing list
> activity over time, in particular some simple analyses, like thread
> length/participants and the like.  Given that the raw data can all be
> precomputed from mbox archives, is there any semi-standard way to
> distill and save metadata about mboxes?
>

I've done some analysis of OpenOffice email archives, including some
social network analysis that I wrote up here:

http://www.robweir.com/blog/wp-content/uploads/2013/12/aoo-graph-large.png

> If we had a generic static database of past mail metadata and statistics
> (i.e. not details of contents, but perhaps overall # of lines of text or
> something), it would be interesting to see what kinds of visualizations
> that different people would come up with.
>
> Anyone have pointers to either a data format or the best parsing library
> for this?  I'm trying to think ahead, and work on the parsing, storing
> statistics, and visualizations as separate pieces so it's easier for
> different people to collaborate on something.
>

If you do Python, you might take a look at
https://svn.apache.org/repos/asf/openoffice/devtools/list-stats/ for a
simple program that could be adapted easily enough.   It uses the
Python mailbox library to do the parsing.

The biggest challenge making sense of such data, for me at least, was
the multiple email addresses a single person can use.   Determining
these aliases for a project you are involved in is possible, though
tedious.   Doing it for an unfamiliar project borders on the
impossible.

Another "fun" problem is getting all the post time data into the same
UTC timezone.   The mbox format does not seem to enforce a consistent
way of encoding these.

I see I have a few other analysis scripts on my harddrive I haven't
checked in that handle the TZ and other issues.   I'll get those
checked in.   It seems that, almost as good as pre-extracted data
would be an easy API.


Ever think of having a contest related to "Visualizing Apache"?   I
was considering proposing something like that for OpenOffice.
Provide the data for download (already extracted from our transaction
systems, so we don't get a harmful about of load on those servers) and
invite the community to do the analysis, see what insights they can
generate.

Regards,

-Rob


> Thanks,
> - Shane