Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
Hi Andre, On Dec 3, 2012, at 7:51 PM, Andre Klapper aklap...@wikimedia.org wrote: On Mon, 2012-12-03 at 19:40 +0100, Federico Leva (Nemo) wrote: Compare e.g. https://www.ohloh.net/p/mediawiki/contributors?query=sort=commits Also, https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10days=10 proves they're not talking of the whole bugzilla but then they don't say which components. Would be helpful to mention the exact dataset you refer to. Also I'd rather challenge weekly-bug-summary.cgi's results: MediaWiki extensions has 2031 open bugs, and only 1883 have been filed in the last 10 days? = 148 bug reports got opened more than 274 years ago? But maybe I fail to read weekly-bug-summary.cgi correctly. Well, you don't. I think the UI is just misleading because the 10 days are just automatically positioned in the table header. The script does not account for the real age of the bug. The oldest bug with the number 1 has been created by Brion on Aug 10, 2004. Between this day and today are 3039 days (including today). Therefore, by replacing the number of days in the request, the same result occurs. At least in my data, the first bug has been closed on May 22, 2005 which are then 2754 days... But now it becomes complicated because there are too many changes that do not show up in my data (that are based on the bugzilla API). But you get my point :) However, I very much like the Bitergia stats - very good first step. andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/ Claudia ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
On Mon, 2012-12-03 at 23:46 +0100, Platonides wrote: El 03/12/12 22:45, Jesus M. Gonzalez-Barahona escribió: On Mon, 2012-12-03 at 21:04 +0100, Platonides wrote: El 03/12/12 19:40, Federico Leva (Nemo) escribió: That data is hardly useful, it doesn't explain what it refers to I guess I missed your message, Federico. He forgot to keep you in CC, so it was sent only to the mailing list. Thanks. I happen to be a subscriber to the list, but I automatically archive it, so I didn't notice the message. I already saw it. I agree a glossary of each term would be useful. It took me a while to realise that committers/closers/senders where the terms used for users of git/bugzilla/mailing list. Well, in fact commiters are committers to the git repository (you also have authors, see below), closers are specifically ticket closers (you also have people opening or changing tickets) and senders are indeed senders of mailing list messages. We'll work to make this much more clear. Again, thanks for the suggestion. They should track authors instead of committers, though (preferably skipping merge commits) We do both. In the summary (main) page you have authors in the summary (orange) chart, since authors seemed more meaningful than committers. Same for the blue chart in that page. In the source code page you have committers (orange chart) and both authors and committers (blue charts), for a more detailed comparison. I was specifically thinking in the table of Top committers. Also, the summary page has an authors graph but http://bitergia.com/public/previews/2012_11_mediawiki/scm.html has a committers one. When the committer is different than the author there are usually two options: - It was a merge and the committer is 'gerrit'. - The patchset was (slightly) changed by the committer from the original by the author. There's also a less common one of committing a patch from a different source, such as a bugzilla patch. Thanks for the info. Number of commits by gerrit are meaningless, and committers with little changes inflate some numbers but are not too useful. Number of comments / approvals in gerrit would be more appropiate than that. Equally, the author field of merges should IMHO be ignore since that's not a commit which really touches the code (could be measured in a different statistic), so many commits produce two entries. You're right, thanks. However, we intended to measure raw commits, as found in the git repo. One of the filters after this first pass filters out those commits. You're right that in this case at least, those numbers would be more appropriate. Seems that Jesús did a fine job. It could be polished quite more with some local knowledge, merging users, hiding bots, etc. Thanks a lot. We usually go, after this first stage, with that identification of bots, unification of identities, identification of large commits, classification of different kinds of tickets, etc. In this case, we were mainly testing the automated (first) stage: the second one, as you mention, usually needs some detailed knowledge about the project, and some manual intervention. Sure. I wasn't intending to put pressure on you. None taken ;-) A few quirks I noticed: - nore...@sourceforge.net is abusing its second place as sender (2525). - I bet the two brion are the same, with different emails (4561+1285=5846, wow!) We will look into that. l10n-bot is indeed a bot. On svn localisation commits weren't done with a specific account, but you can look for commit messages like «Localisation updates for core and extension messages from translatewiki.net» ok Commits migrated from svn will have emails of usern...@users.mediawiki.org All commits done on git use a different one. Moreover, some people have used different mails (see other threads on the mailing list about this in ohloh). We noticed this. But it is a bit risky to assume that x...@users.mediawiki.org is the same as xxx@otherdomain: probably in most cases, the merge is perfect, but in some it could be wrong. We preferred to provide the raw numbers since we didn't have resources to do the manual checking needed. But we can try to produce accumulated stats assuming that match: probably it would be accurate enough. [...] Saludos, Jesus. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
On Mon, 2012-12-03 at 19:40 +0100, Federico Leva (Nemo) wrote: [...] Also, https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10days=10 proves they're not talking of the whole bugzilla but then they don't say which components. Our mining is for the MediaWiki product. In particular, the url we're using is: https://bugzilla.wikimedia.org/buglist.cgi?product=MediaWiki If you look in the bicho database available at http://bitergia.com/public/previews/2012_11_mediawiki/data/db/ you can count the tickets: mysql select count(id) from issues; +---+ | count(id) | +---+ | 19776 | +---+ This is consistent with the 19953 tickets that I can see right now in Bugzilla. You're right that it is not obvious that we're only considering this product, we're fixing that. Thanks for the advice! Jesus. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
On Mon, 2012-12-03 at 19:40 +0100, Federico Leva (Nemo) wrote: That data is hardly useful, it doesn't explain what it refers to and, even when it does, seems wrong. Compare e.g. https://www.ohloh.net/p/mediawiki/contributors?query=sort=commits [...] ok, some info about this one. It seems Ohloh is counting commits in the master branch. If you just use the git log to get the main stats: $ git log --format=format:%ae Authors $ grep brion Authors | wc -l 4493 $ grep tstarling Authors | wc -l 2554 which is pretty much what you see in Ohloh. In our case, we're counting *all* the activity in the repository (all branches): $ git log --all --format=format:%ae Authors $ grep brion Authors | wc -l 5425 $ grep tstarling Authors | wc -l 3068 Which is pretty much our data. To be honest, I'm not sure which one (counting only master branch, or all branches) is better: probably we should be providing both, or even a separate count for each branch, so that users may decide which data better suits their needs. I take notice about this. Again, thanks for pointing it out. Saludos, Jesus. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Complete (basic) analysis of MediaWiki
If last October we got a bunch of MediaWiki developer stats thanks to the aggregation of data by Ohloh [1], now we are getting plenty more stats from Bitergia, including data from bug reporting and mailing lists: http://blog.bitergia.com/2012/12/03/complete-basic-analysis-of-mediawiki/ Bitergia is a company based in Madrid formed by a small team of developers that have been working on FLOSS stats software for a long time. All the tools they develop are free software publicly available and open to contributions. They have been kind enough to contribute some time and work setting up stats for the MediaWiki community. They also welcome feedback about the service and the data collected. I'm CCing Jesús M. González-Barahona, who has been my regular contact for this task in the past weeks. Al good news for http://www.mediawiki.org/wiki/Community_Metrics ! [1] https://www.ohloh.net/orgs/wikimedia -- Quim Gil Technical Contributor Coordinator Wikimedia Foundation ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
That data is hardly useful, it doesn't explain what it refers to and, even when it does, seems wrong. Compare e.g. https://www.ohloh.net/p/mediawiki/contributors?query=sort=commits Also, https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10days=10 proves they're not talking of the whole bugzilla but then they don't say which components. Nemo ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
On Mon, 2012-12-03 at 19:40 +0100, Federico Leva (Nemo) wrote: Compare e.g. https://www.ohloh.net/p/mediawiki/contributors?query=sort=commits Also, https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10days=10 proves they're not talking of the whole bugzilla but then they don't say which components. Would be helpful to mention the exact dataset you refer to. Also I'd rather challenge weekly-bug-summary.cgi's results: MediaWiki extensions has 2031 open bugs, and only 1883 have been filed in the last 10 days? = 148 bug reports got opened more than 274 years ago? But maybe I fail to read weekly-bug-summary.cgi correctly. andre -- Andre Klapper | Wikimedia Bugwrangler http://blogs.gnome.org/aklapper/ ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
El 03/12/12 19:40, Federico Leva (Nemo) escribió: That data is hardly useful, it doesn't explain what it refers to I agree a glossary of each term would be useful. It took me a while to realise that committers/closers/senders where the terms used for users of git/bugzilla/mailing list. They should track authors instead of committers, though (preferably skipping merge commits) Also, https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10days=10 proves they're not talking of the whole bugzilla but then they don't say which components. Nemo Looking at http://bitergia.com/public/previews/2012_11_mediawiki/data/db/acs_bicho_mediawiki.sql.bz2 they seem to have obtained data from bugs 1 to 19775. Not that they skipped bugs based on components. Seems that Jesús did a fine job. It could be polished quite more with some local knowledge, merging users, hiding bots, etc. I would also change the layout of the summary page, making the graphs larger and placing the tables below. Plus some cosmetics empty brackets, missing name... ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Complete (basic) analysis of MediaWiki
El 03/12/12 22:45, Jesus M. Gonzalez-Barahona escribió: On Mon, 2012-12-03 at 21:04 +0100, Platonides wrote: El 03/12/12 19:40, Federico Leva (Nemo) escribió: That data is hardly useful, it doesn't explain what it refers to I guess I missed your message, Federico. He forgot to keep you in CC, so it was sent only to the mailing list. I agree a glossary of each term would be useful. It took me a while to realise that committers/closers/senders where the terms used for users of git/bugzilla/mailing list. Well, in fact commiters are committers to the git repository (you also have authors, see below), closers are specifically ticket closers (you also have people opening or changing tickets) and senders are indeed senders of mailing list messages. We'll work to make this much more clear. Again, thanks for the suggestion. They should track authors instead of committers, though (preferably skipping merge commits) We do both. In the summary (main) page you have authors in the summary (orange) chart, since authors seemed more meaningful than committers. Same for the blue chart in that page. In the source code page you have committers (orange chart) and both authors and committers (blue charts), for a more detailed comparison. I was specifically thinking in the table of Top committers. Also, the summary page has an authors graph but http://bitergia.com/public/previews/2012_11_mediawiki/scm.html has a committers one. When the committer is different than the author there are usually two options: - It was a merge and the committer is 'gerrit'. - The patchset was (slightly) changed by the committer from the original by the author. There's also a less common one of committing a patch from a different source, such as a bugzilla patch. Number of commits by gerrit are meaningless, and committers with little changes inflate some numbers but are not too useful. Number of comments / approvals in gerrit would be more appropiate than that. Equally, the author field of merges should IMHO be ignore since that's not a commit which really touches the code (could be measured in a different statistic), so many commits produce two entries. Seems that Jesús did a fine job. It could be polished quite more with some local knowledge, merging users, hiding bots, etc. Thanks a lot. We usually go, after this first stage, with that identification of bots, unification of identities, identification of large commits, classification of different kinds of tickets, etc. In this case, we were mainly testing the automated (first) stage: the second one, as you mention, usually needs some detailed knowledge about the project, and some manual intervention. Sure. I wasn't intending to put pressure on you. A few quirks I noticed: - nore...@sourceforge.net is abusing its second place as sender (2525). - I bet the two brion are the same, with different emails (4561+1285=5846, wow!) l10n-bot is indeed a bot. On svn localisation commits weren't done with a specific account, but you can look for commit messages like «Localisation updates for core and extension messages from translatewiki.net» Commits migrated from svn will have emails of usern...@users.mediawiki.org All commits done on git use a different one. Moreover, some people have used different mails (see other threads on the mailing list about this in ohloh). I would also change the layout of the summary page, making the graphs larger and placing the tables below. Plus some cosmetics empty brackets, missing name... This is a very good point, and something we didn't work too much into. I take note. Thanks a lot for the feedback! You are welcome! ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l