Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Hervé BOUTEMY
Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit :
> On 5/5/15 7:33 AM, Boris Baldassari wrote:
> > Hi Folks,
> > 
> > Sorry for the late answer on this thread. Don't know what has been done
> > since then, but I've some experience to share on this, so here are my 2c..
> 
> No, more input is always appreciated!  Hervé is doing some
> centralization of the projects-new.a.o data capture, which is related
> but slightly separate.
+1
this can give a common place to put code once experiments show that we should 
add a new data source

> But this is going to be a long-term project
+1

> with
> plenty of different people helping I bet.
I hope so...

> 
> ...
> 
> > * Parsing mboxes for software repository data mining:
> > There is a suite of tools exactly targeted at this kind of duty on
> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I
> > don't know how they manage time zones, but the toolsuite is widely used
> > around (see [3] or [4] as examples) so I believe they are quite robust.
> > It includes tools for data retrieval as well as visualisation.
> 
> Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
> frameworks for extracting metadata from a bunch of sources - but it's
> GPL, so personally I have no interest in working on it.  If someone else
> uses it to generate datasets that's great.
> 
> > * As for the feedback/thoughts about the architecture and formats:
> > I love the REST-API idea proposed by Rob. That's really easy to access
> > and retrieve through scripts on-demand. CSV and JSON are my favourite
> > formats, because they are, again, easy to parse and widely used -- every
> > language and library has some facility to read them natively.
> 
> Yup - again, like project visualization, to make any of this simple for
> newcomers to try stuff, we need to separate data gathering / model /
> visualization.  Since most of these are spare time projects, having easy
> chunks makes it simpler for different people to try their hand at it.
For visualization, for sure, json is the current natural format when data is 
consumed from the browser.
I don't have great experience on this, and what I'm missing with json 
currently is a common practice on documenting a structure: are there common 
practices?
Because for simple json structure, documentation is not really necessary, but 
once the structure goes complex, documentation is really a key requirement for 
people to use or extend. And I already see this shortcoming with the 11 json 
files from projects-new.a.o = https://projects-new.apache.org/json/foundation/

Regards,

Hervé

> 
> Thanks,
> 
> - Shane



Re: Chairs: A small addition to the Marvin email you received yesterday.

2015-05-05 Thread Shazron
Hi Daniel,
If this is the not the right avenue, let me know.
It's not in the interface of course, but is there any way a release in
"Releases" can be removed in reporter.apache.org? I mistakenly added a
Release that should not be there for Apache Cordova.

I suppose I can just edit it out before I copy and paste for a report,
but it's easier to do beforehand in case I forgot to do so (since we
have a lot of releases per quarter).

Shaz


On Thu, Mar 5, 2015 at 6:31 AM, Daniel Gruno  wrote:
> Hi Project chairs,
> In yesterday's email to you about your upcoming board report, we forgot to
> mention that we have a new tool that can help you in cobbling together a
> report, or just view statistics of the PMCs you are on.
>
> The new service is located at: https://reporter.apache.org and is PMC
> members only.
> Should you choose to make use of the board report template in this system,
> do remember to add in the important activity bits and any issues that
> require board activity.
>
> Next time Marvin sends you an email, it will include the URL for the
> reporter system.
>
> If you have ANY feedback about this system, don't hesitate to let us know!
> :)
>
> On behalf of the Community Development Project,
> Daniel.


Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Shane Curcuru
On 5/5/15 7:33 AM, Boris Baldassari wrote:
> Hi Folks,
> 
> Sorry for the late answer on this thread. Don't know what has been done
> since then, but I've some experience to share on this, so here are my 2c..

No, more input is always appreciated!  Hervé is doing some
centralization of the projects-new.a.o data capture, which is related
but slightly separate.  But this is going to be a long-term project with
plenty of different people helping I bet.

...
> * Parsing mboxes for software repository data mining:
> There is a suite of tools exactly targeted at this kind of duty on
> github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I
> don't know how they manage time zones, but the toolsuite is widely used
> around (see [3] or [4] as examples) so I believe they are quite robust.
> It includes tools for data retrieval as well as visualisation.

Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
frameworks for extracting metadata from a bunch of sources - but it's
GPL, so personally I have no interest in working on it.  If someone else
uses it to generate datasets that's great.

> 
> * As for the feedback/thoughts about the architecture and formats:
> I love the REST-API idea proposed by Rob. That's really easy to access
> and retrieve through scripts on-demand. CSV and JSON are my favourite
> formats, because they are, again, easy to parse and widely used -- every
> language and library has some facility to read them natively.

Yup - again, like project visualization, to make any of this simple for
newcomers to try stuff, we need to separate data gathering / model /
visualization.  Since most of these are spare time projects, having easy
chunks makes it simpler for different people to try their hand at it.

Thanks,

- Shane



Re: Captcha on Apache mirror

2015-05-05 Thread Niclas Hedhman
I am still getting it... :-/

On Tue, May 5, 2015 at 8:49 PM, Rich Bowen  wrote:

>
>
> On 05/05/2015 08:36 AM, Konstantin Kolinko wrote:
>
>> 2015-05-05 5:37 GMT+03:00 Niclas Hedhman :
>>
>>> I just tried to download Maven and randomly selected a download mirror;
>>>
>>> http://apache.petsads.us/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
>>>
>>> But get presented with a captcha, and I think that is not
>>> expected/allowed
>>> since it might break automated tools.
>>>
>>>
>> Good question.
>>
>> For information:
>>
>> The captcha on that site is shown by CloudFlare, that serves as a
>> proxy for them,
>> https://www.cloudflare.com/5xx-error-landing
>>
>> Maybe there was a reason for it.
>>
>> http://www.apache.org/info/how-to-mirror.html
>> "How to become a mirror" - does not say anything about captchas,
>> though it says "Your mirror must not be shown "inside" another site
>> using, for instance, frames."
>>
>
>
> Pretty sure that
>
> You must not modify the mirrored tree in any way. In particular,
> HEADER.html and README.html files must not be altered or removed ; see
> below for adding sponsor information.
>
> covers this behavior.
>
> The reason that the site doesn't mention captchas was that the term didn't
> exist when this document was written. However, "modify" is general enough
> that it's covered.
>
> For whatever it's worth, I don't get a captcha at that URL - I get the
> expected file. Probably some transitory problem that has since cleared up.
>
>
> --
> Rich Bowen - rbo...@rcbowen.com - @rbowen
> http://apachecon.com/ - @apachecon
>



-- 
Niclas Hedhman, Software Developer
http://zest.apache.org - New Energy for Java


Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Louis Suárez-Potts

> On 05 May 2015, at 07:33, Boris Baldassari  
> wrote:
> 
> Hi Folks,
> 
> Sorry for the late answer on this thread. Don't know what has been done since 
> then, but I've some experience to share on this, so here are my 2c..
> 
> * Parsing dates and time zones:
> If you are to use Perl, the Date::Parse module handles dates and time zones 
> pretty well. As for Python I don't know -- there probably is a module for 
> that too..
> I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the data 
> sets have been published here [0]), and it worked great. I do have a Perl 
> script to do that, which I can provide -- but I have no access I'm aware of 
> in the dev scm, and not sure if Perl is the most common language here.. so 
> please let me know.
> 
> * Parsing mboxes for software repository data mining:
> There is a suite of tools exactly targeted at this kind of duty on github: 
> Metrics Grimoire [1], developed (and used) by Bitergia [2]. I don't know how 
> they manage time zones, but the toolsuite is widely used around (see [3] or 
> [4] as examples) so I believe they are quite robust. It includes tools for 
> data retrieval as well as visualisation.
> 
> * As for the feedback/thoughts about the architecture and formats:
> I love the REST-API idea proposed by Rob. That's really easy to access and 
> retrieve through scripts on-demand. CSV and JSON are my favourite formats, 
> because they are, again, easy to parse and widely used -- every language and 
> library has some facility to read them natively.

I have to endorse Bitergia, too. If they don’t immediately have what is wanted, 
they are likely to be interested in working on it. But you know this, I’m 
guessing.

louis

> 
> 
> Cheers,
> 
> 
> [0] http://castalia.solutions/datasets/
> [1] https://metricsgrimoire.github.io/
> [2] http://bitergia.com
> [3] Eclipse Dashboard: http://dashboard.eclipse.org/
> [4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/
> 
> 
> 
> --
> Boris Baldassari
> Castalia Solutions -- Elegant Software Engineering
> Web: http://castalia.solutions
> Phone: +33 6 48 03 82 89
> 
> 
> Le 28/04/2015 16:11, Rich Bowen a écrit :
>> 
>> 
>> On 04/27/2015 09:36 AM, Shane Curcuru wrote:
>>> I'm interested in working on some visualizations of mailing list
>>> activity over time, in particular some simple analyses, like thread
>>> length/participants and the like.  Given that the raw data can all be
>>> precomputed from mbox archives, is there any semi-standard way to
>>> distill and save metadata about mboxes?
>>> 
>>> If we had a generic static database of past mail metadata and statistics
>>> (i.e. not details of contents, but perhaps overall # of lines of text or
>>> something), it would be interesting to see what kinds of visualizations
>>> that different people would come up with.
>>> 
>>> Anyone have pointers to either a data format or the best parsing library
>>> for this?  I'm trying to think ahead, and work on the parsing, storing
>>> statistics, and visualizations as separate pieces so it's easier for
>>> different people to collaborate on something.
>> 
>> Roberto posted something to the list a month or so ago about the efforts 
>> that he's been working on for this kind of thing. You might ping him.
>> 
>> --Rich
>> 
>> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: DOAP format question

2015-05-05 Thread sebb
On 5 May 2015 at 16:57, Sergio Fernández  wrote:
> Hi,
>
> On Tue, May 5, 2015 at 4:39 PM, sebb  wrote:
>>
>> > One question, sebb, how the site development is organize? Do you use
>> jira or
>> > something as any other project does? Just to do the things properly
>> > according your guidelines.
>>
>> It's not a regular project.
>> I don't know who "owns" the code - possibly Infra or maybe ComDev.
>>
>> I have just been making the occasional fix as I notice problems.
>>
>> The site-dev and dev@community mailing list are probably the place to
>> discuss changes.
>
>
> OK, then I stay in this thread for discussion about this.
>
> I didn't have much time today, but what I already did was implementing the
> basics of how the DOAP processing could look like. For the moment is at
> https://github.com/wikier/asf-doap until I'll get something more
> functional, then I'll commit it to the asf repo.
>
> Basically what if currently does that simple code is to get all DOAP/PMC
> files and report some basics (size). You can run it by yourself executing:
>
> $ python doap.py
>
> What I can already say is that I do not understand what
> https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/data_files
> aim to represent.

This is the default location for the PMC data [1] files which provide
data about the PMC.
A single such file may be referenced by multiple DOAPs.
E.g. all the Commons components refer to the same PMC data file.

The contents and locations of the various files are documented on the site.

[1] http://projects.apache.org/docs/pmc.html

> Because asfext:pmc is defined as a property in the
> namespace (as we discussed couple of days ago), so I missed the subject
> where it refers to (normally it should be used  asfext:pmc
> <...>). According that usage of the term, I guess they actually wanted to
> define a class.
>
> But please, let me evolve a bit more the code for giving you some basic
> tools, and then I can discuss further such aspects.
>
> Cheers.
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co


Re: DOAP format question

2015-05-05 Thread Sergio Fernández
Hi,

On Tue, May 5, 2015 at 4:39 PM, sebb  wrote:
>
> > One question, sebb, how the site development is organize? Do you use
> jira or
> > something as any other project does? Just to do the things properly
> > according your guidelines.
>
> It's not a regular project.
> I don't know who "owns" the code - possibly Infra or maybe ComDev.
>
> I have just been making the occasional fix as I notice problems.
>
> The site-dev and dev@community mailing list are probably the place to
> discuss changes.


OK, then I stay in this thread for discussion about this.

I didn't have much time today, but what I already did was implementing the
basics of how the DOAP processing could look like. For the moment is at
https://github.com/wikier/asf-doap until I'll get something more
functional, then I'll commit it to the asf repo.

Basically what if currently does that simple code is to get all DOAP/PMC
files and report some basics (size). You can run it by yourself executing:

$ python doap.py

What I can already say is that I do not understand what
https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/data_files
aim to represent. Because asfext:pmc is defined as a property in the
namespace (as we discussed couple of days ago), so I missed the subject
where it refers to (normally it should be used  asfext:pmc
<...>). According that usage of the term, I guess they actually wanted to
define a class.

But please, let me evolve a bit more the code for giving you some basic
tools, and then I can discuss further such aspects.

Cheers.

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Re: DOAP format question

2015-05-05 Thread sebb
On 5 May 2015 at 14:58, Sergio Fernández  wrote:
> On Tue, May 5, 2015 at 12:51 PM, sebb  wrote:
>>
>> Checking for such simple typos could be done with almost any scripting
>> language.
>>
>> The RDF files are listed in
>>
>>
>> https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/files.xml
>> (DOAPs)
>> and
>>
>> https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/pmc_list.xml
>> (PMC definitions)
>>
>> Most of the PMC definitions are stored locally, and have already been
>> fixed.
>
>
> OK, in the next day I'll provide a basic implementation (Python+RDFLib) that
> provides some kind of validation and reporting of pitfails. So later we can
> extend that add proper DOAP support to the projects-new.a.o.
>
> One question, sebb, how the site development is organize? Do you use jira or
> something as any other project does? Just to do the things properly
> according your guidelines.

It's not a regular project.
I don't know who "owns" the code - possibly Infra or maybe ComDev.

I have just been making the occasional fix as I notice problems.

The site-dev and dev@community mailing list are probably the place to
discuss changes.

> Cheers,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co


Re: DOAP format question

2015-05-05 Thread Sergio Fernández
On Tue, May 5, 2015 at 12:51 PM, sebb  wrote:
>
> Checking for such simple typos could be done with almost any scripting
> language.
>
> The RDF files are listed in
>
>
> https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/files.xml
> (DOAPs)
> and
>
> https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/pmc_list.xml
> (PMC definitions)
>
> Most of the PMC definitions are stored locally, and have already been
> fixed.
>

OK, in the next day I'll provide a basic implementation (Python+RDFLib)
that provides some kind of validation and reporting of pitfails. So later
we can extend that add proper DOAP support to the projects-new.a.o.

One question, sebb, how the site development is organize? Do you use jira
or something as any other project does? Just to do the things properly
according your guidelines.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Re: Captcha on Apache mirror

2015-05-05 Thread Rich Bowen



On 05/05/2015 08:36 AM, Konstantin Kolinko wrote:

2015-05-05 5:37 GMT+03:00 Niclas Hedhman :

I just tried to download Maven and randomly selected a download mirror;
http://apache.petsads.us/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz

But get presented with a captcha, and I think that is not expected/allowed
since it might break automated tools.



Good question.

For information:

The captcha on that site is shown by CloudFlare, that serves as a
proxy for them,
https://www.cloudflare.com/5xx-error-landing

Maybe there was a reason for it.

http://www.apache.org/info/how-to-mirror.html
"How to become a mirror" - does not say anything about captchas,
though it says "Your mirror must not be shown "inside" another site
using, for instance, frames."



Pretty sure that

You must not modify the mirrored tree in any way. In particular, 
HEADER.html and README.html files must not be altered or removed ; see 
below for adding sponsor information.


covers this behavior.

The reason that the site doesn't mention captchas was that the term 
didn't exist when this document was written. However, "modify" is 
general enough that it's covered.


For whatever it's worth, I don't get a captcha at that URL - I get the 
expected file. Probably some transitory problem that has since cleared up.



--
Rich Bowen - rbo...@rcbowen.com - @rbowen
http://apachecon.com/ - @apachecon


Re: Captcha on Apache mirror

2015-05-05 Thread Konstantin Kolinko
2015-05-05 5:37 GMT+03:00 Niclas Hedhman :
> I just tried to download Maven and randomly selected a download mirror;
> http://apache.petsads.us/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
>
> But get presented with a captcha, and I think that is not expected/allowed
> since it might break automated tools.
>

Good question.

For information:

The captcha on that site is shown by CloudFlare, that serves as a
proxy for them,
https://www.cloudflare.com/5xx-error-landing

Maybe there was a reason for it.

http://www.apache.org/info/how-to-mirror.html
"How to become a mirror" - does not say anything about captchas,
though it says "Your mirror must not be shown "inside" another site
using, for instance, frames."

Best regards,
Konstantin Kolinko


Re: Standards for mail archive statistics gathering?

2015-05-05 Thread Boris Baldassari

Hi Folks,

Sorry for the late answer on this thread. Don't know what has been done 
since then, but I've some experience to share on this, so here are my 2c..


* Parsing dates and time zones:
If you are to use Perl, the Date::Parse module handles dates and time 
zones pretty well. As for Python I don't know -- there probably is a 
module for that too..
I used Date::Parse to parse ASF mboxes (notably for Ant and JMeter, the 
data sets have been published here [0]), and it worked great. I do have 
a Perl script to do that, which I can provide -- but I have no access 
I'm aware of in the dev scm, and not sure if Perl is the most common 
language here.. so please let me know.


* Parsing mboxes for software repository data mining:
There is a suite of tools exactly targeted at this kind of duty on 
github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I 
don't know how they manage time zones, but the toolsuite is widely used 
around (see [3] or [4] as examples) so I believe they are quite robust. 
It includes tools for data retrieval as well as visualisation.


* As for the feedback/thoughts about the architecture and formats:
I love the REST-API idea proposed by Rob. That's really easy to access 
and retrieve through scripts on-demand. CSV and JSON are my favourite 
formats, because they are, again, easy to parse and widely used -- every 
language and library has some facility to read them natively.



Cheers,


[0] http://castalia.solutions/datasets/
[1] https://metricsgrimoire.github.io/
[2] http://bitergia.com
[3] Eclipse Dashboard: http://dashboard.eclipse.org/
[4] OpenStack Dashboard: http://activity.openstack.org/dash/browser/



--
Boris Baldassari
Castalia Solutions -- Elegant Software Engineering
Web: http://castalia.solutions
Phone: +33 6 48 03 82 89


Le 28/04/2015 16:11, Rich Bowen a écrit :



On 04/27/2015 09:36 AM, Shane Curcuru wrote:

I'm interested in working on some visualizations of mailing list
activity over time, in particular some simple analyses, like thread
length/participants and the like.  Given that the raw data can all be
precomputed from mbox archives, is there any semi-standard way to
distill and save metadata about mboxes?

If we had a generic static database of past mail metadata and statistics
(i.e. not details of contents, but perhaps overall # of lines of text or
something), it would be interesting to see what kinds of visualizations
that different people would come up with.

Anyone have pointers to either a data format or the best parsing library
for this?  I'm trying to think ahead, and work on the parsing, storing
statistics, and visualizations as separate pieces so it's easier for
different people to collaborate on something.


Roberto posted something to the list a month or so ago about the 
efforts that he's been working on for this kind of thing. You might 
ping him.


--Rich






Re: DOAP format question

2015-05-05 Thread sebb
On 5 May 2015 at 07:06, Hervé BOUTEMY  wrote:
> Le mardi 5 mai 2015 01:05:31 sebb a écrit :
>> > OK, but that's because whoever code the XSLT decided to be defensive to
>> > such interpretation.
>>
>> If you read back in this thread you will see that I did this in order
>> to support both asfext:PMC and asfext:pmc.
>>
>> > But that does not mean is right.
>>
>> The code is right in the sense that it works with the input files that
>> are provided.
> +1
> that's a temporary workaround that we should try to not need any more in the
> future

Ideally, yes, but as already noted that is not trivial.

> [...]
>
>> >> Also if it is possible to validate that the various RDF files are
>> >> correct according to the formal definitions.
>> >> PMCs could then submit their files for checking.
>> >
>> > I think we can discuss that infrastructure for the new site. I'm happy to
>> > help. Python provides the required libraries. I'll open a thread, probably
>> > tomorrow.
>>
>> I think there needs to be a way for PMCs to check their RDF files
>> against the formal definitions.
>> For example, a CGI script that accepts the URL of a file.
> +1
> I tried W3C checker, but as it is only a syntax checker, it checked only
> syntax, not references to the namespace
> and I couldn't find any other useful tool :(
>
> Other tools to make effective use of the DOAP files would be useful too: but I
> completely agree that the first priority seems to have a more complete checker

It's possible to add warning checks to the cron job scripts, but this
will create a lot of noisy e-mails until projects have been notified
and fixed their files.
Experience shows that fixing DOAPs can take months for some PMCs.

One approach that might be worth trying is creating an additional
on-demand report that checks the list of RDFs for known issues.
Initially it could just check for asfext:PMC, but could be extended as
other issues are found or better syntax checking is available.

Checking for such simple typos could be done with almost any scripting language.

The RDF files are listed in

https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/files.xml
(DOAPs)
and
https://svn.apache.org/repos/asf/infrastructure/site-tools/trunk/projects/pmc_list.xml
(PMC definitions)

Most of the PMC definitions are stored locally, and have already been fixed.

> Regards,
>
> Hervé
>
>>
>> > Cheers,
>> >
>> > --
>> > Sergio Fernández
>> > Partner Technology Manager
>> > Redlink GmbH
>> > m: +43 6602747925
>> > e: sergio.fernan...@redlink.co
>> > w: http://redlink.co
>