I wonder how many points you get for nerd snipping a software engineer :)

Wonderful work Daniel, kudos!
In the given list of bad dependencies, I find User class depending on Skin
class one of the most problematic ones. Also, ApiQuery class depending on
all of its subclasses is one of the biggest issue making all of API query
modules act as giant big monolith. That can (hopefully) easily addressed
and untangle significant portion of the 40%, a rather low hanging fruit.

Best

On Fri, May 10, 2019 at 6:39 PM Daniel Kinzler <dkinz...@wikimedia.org>
wrote:

> Hi all!
>
> I invite you to try out "Project Ruprecht"[1][2], a tool that measures the
> "tangledness" of PHP code, and provides you with a "naughty list" of
> things to fix.
>
> For now, you will have to install this locally. I hope however to soon
> have this
> run automatically against core on a regular basis, perhaps by integrating
> it
> with SonarQube. Maybe some day we can also integrate it with CI, to
> generate a
> warning when a new cyclic dependency is about to be introduced.
>
> So that was the tl;dr. Now for some context, history, and shout-outs. And
> some
> actual real world science, too!
>
> For a while now, I have been talking about the how much of a problem cyclic
> dependencies are in MediaWiki core: When two components (classes,
> namespaces,
> libraries, whatever) depend on each other, directly or indirectly, this
> means
> that one cannot be used without the other, nor can it be tested,
> understood, or
> modified without also considering the other. So, in effect, they behave as
> *one*
> component, not two. Applied to MW core, this means that roughly half of
> our 1600
> classes effectively behave like a single giant class. This makes the code
> rather
> hard to deal with.
>
> To fix this, I have been looking for tools that let me identify "tangles"
> of
> classes that depend on each other, and metrics' to measure the progress of
> "untangling" the code. However, the classic code quality metrics focus on
> "local" properties of the code, so they can't tell us much about the
> progress of
> untangling. And the tools I found that would detect cyclic dependencies in
> PHP
> code would all choke on MediaWiki core: they would try to list all detected
> cycles - which, by the super-exponential nature of possible paths through a
> graph, would be millions and millions. So, the tools would choke and die.
> That
> approach isn't practical for us.
>
> Two discoveries allowed me to come up with a working solution:
> First, I decided to leave the PHP world and turned towards graph analysis
> tools
> built for large data sets. Python's graph-tool did the trick. It's build
> on top
> of boost and numpy, and it's *fast*. It crunched through the 7500 or so
> class
> dependencies in MW core in a split second, and told me that we have 14
> "tangles"
> (non-trivial strongly connected components), and that 43% of our classes
> are in
> these tangles, with 40% being part of one big tangle that is essentially
> our
> monolith manifest. So now I had a metric to work with: the number of
> classes in
> tangles.
>
> That was great, but still didn't tell me where to start. Graph-tool was
> still
> not fast enough to deal with millions of cycles, and even if it had been,
> that
> data wouldn't be very useful. I needed some smart heuristics. Luckily, I
> (totally unintentionally, promise!) nerd sniped[5] Amir Sarabadani one
> evening
> at the WMDE office by telling him about this problem. The next day, he
> told me
> that he had been digging into the problem all night, and he had found a
> paper
> that sounded relevant, and it also came with working code: "Breaking
> Cycles in
> Noisy Hierarchies"[3] by J. Sun, D. Ajwani, P.K. Nicholson, A. Sala, and S.
> Parthasarathy. I played with the code a bit, and yes! It spat out a list
> of 290
> or so dependencies[4] that it thought were bad - and I agree for a good
> number
> of them. It's not a clean working list, but it gives a very good idea of
> where
> to start looking.
>
> I find it quite fascinating that this works so well for cleaning up a
> codebase.
> After all, the heuristic wasn't design for this - it was designed for
> fixing
> messy ontologies. Indeed, one of their test data sets was (English
> language)
> Wikipedia's category system! I'd love to see what it does with Wikidata's
> subclass hierarchy :)
>
> But I suppose it makes sense - dependencies in software are conceptually a
> lot
> like an ontology, and the same strategies of stratification and abstraction
> apply. And the same difficulties, too - it's easy enough to spot a
> problematic
> cycle, but often hard to say where it should be cut. And how to cut it -
> often,
> the solution is not to just remove the dependency, but to introduce a new
> abstraction that allows the relationship to exist without a cycle. I'd
> love to
> see the research continue in that direction!
>
> So, a big shout out to the researchers, and to Amir who found the paper!
>
> I hope my ramblings have made you curious to play with Ruprecht, and see
> what it
> has to say about other code bases. There's also another feature to play
> with
> which I haven't discussed here: detection of risky classes using the Page
> Rank
> algorithm. Fun!
>
> Cheers,
> Daniel
>
> [1] https://phabricator.wikimedia.org/diffusion/MTDA/repository/master/
> [2]
>
> https://gerrit.wikimedia.org/r/admin/projects/mediawiki/tools/dependency-analysis
> [3] https://github.com/zhenv5/breaking_cycles_in_noisy_hierarchies
> [4] https://phabricator.wikimedia.org/P8513
> [5] https://xkcd.com/356/
>
> --
> Daniel Kinzler
> Principal Software Engineer, Core Platform
> Wikimedia Foundation
>


-- 
Amir (he/him)
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to