Re: [Wikitech-l] RFC: Refactoring the Title object

Daniel Kinzler Wed, 30 Oct 2013 07:12:23 -0700

Am 10.10.2013 18:40, schrieb Rob Lanphier:
> Hi folks,
> 
> I think Daniel buried the lede here (see his mail below), so I'm
> mailing this out with a subject line that will hopefully provoke more
> discussion.  :-)


Thanks for bumping this, Rob. And thanks to Tim for moderating this discussion
so far, and to everyone for the criticism.

And sorry for my late reply, I'm just now catching up on email after a brief
vacation and a few days of conferencing.

I'll reply to the various comments in this thread in this mail, to keep the
discussion focused. I hope I managed to reply to all the relevant points.

First off, TLDR:

* I maintain that dependency injection is useful, and less painful than one
might think.
* I'm open to discussion about whether to use a namespace ID or a canonical
namespace name in the TitleValue object
* I'm interested in discussing how to best slice and dice the parser and
formatter services.

So, here are my replies to some comments:

> I agree with this as well. The idea behind this RFC is the "hair can't cut
> itself" pattern. However, a value object needs to be easily serializable.
> So what representation is used for serializing a TitleValue? It can't be
> the display title or DB key since that's part of the TitleFormatter class. 

TitleValue as proposed can be serialized without any problems using standard PHP
serialization, or as a JSON structure containing the namespace ID and name
string. Or we can come up with something nicer, like $nsid:$title or some such.
The current Title object cannot be serialized at all directly.

> Maybe, but the RFC says, "As a value object, this has no knowledge about
> namespaces, permissions, etc.". I think there comes a point when you have
> to acknowledge that some properties of Title objects are indeed part of the
> value object

The point is avoiding access to config information (which is global state).
Namespace names are configuration and require global state for
normalization/lookup. This should not be done in a value object.

> I read it as TitleValue doesn't know about texual namespaces like
> Category:, Project:, or User:. But just contains an integer namespace
> such as `4`.

I understand that using the numeric namespace ID is controversial. TitleValue
could also require the canonical (!) namespace name to be provided to the
constructor, instead of the numeric id. This would make it harder though to use
it for DB queries.

> $title = Title::newFromText( $text );
> if ( $title ) {
>   return $title->getLocalUrl( 'foo=bar' );
> }

newFromText() uses global state to look up namespaces, interwiki prefixes, title
normalization options, etc. If we want to get away from relying on global state,
we have to avoid this.

getLocalUrl() uses global state to construct the URL using the wiki's base URL.
Again, if we want to have reusable, testable code, this must be avoided.

Sure, global state is convenient. You don't have to think about what information
is needed where, you don't have to be explicit about your dependencies. This
makes it easier to write code. It makes it very hard to understand, test, reuse
or refactor the code, though.

Yes, it's more terse and "easier to read". But it hides information flow and
implicit dependencies, making the "easy read" quite misleading. It makes it
harder to truly understand what is going on.

> $tp = new TextTitleParser();
> try {
>   $title = $tp->parse( $text );
> $tf = new UrlTitleFormatter( $title, 'foo=bar );
>   return $tf->format();
> } catch( MWException ) {
>   return null;
> }

As Jeroen already pointed out in his reply, you should rarely have the need to
turn a page title string into a URL. When generating the URL, you would
typically already have a TitleValue object (if not, ask yourself why not).

Catching exceptions should be done as late as possible - ideally, in the
presentation layer. Generally, throwing an exception is preferable to returning
a "special" value, so the try/catch would not be here in the code.

However, you are right that being explicit about which information and services
are needed where means writing more code. That's what "explicit" means. If we
agree that it's a good thing to explicitly expose information flow and
dependencies, then this implies that we need to actually write the additional
(declarative) code.

> Maybe my hair can't cut itself, but I can cut my own hair without having to
> find a Barber instance. ;)

But you will need to find a Scissors instance.

> Why do we need
> a separate TitleParser and TitleFormatter? They use the same configuration
> data to do complementary operations. And then there's talk about making
> "TitleFormatter" a one-method class that has subclasses for every different
> way to format a title, and about having a subclass for wikilinks (huh?)
> which has subclasses for internal and external and interwiki links (wtf?),
> and so on.

I'm very much open to discussion regarding how the parser/renderer services are
designed. For example:

* It would probably be fine to have a single class for parsing titles in
wiki-link syntax, and generating wiki-link style titles.
* I'm reluctant to include functions for generating URLs / HTML links in the
same class (this needs additional information / services ).
* Beware that TitleValue objects represent *titles*, not links - they do not
handle interwiki prefixes and such. An interwiki link would need to be
represented by a different object (because of the prefix, but also because of
namespace IDs vs. namespace names - there is room for discussion here).

> High-level methods to actually do anything useful like "move a
> page" or "parse some wikitext" seem like they are going to need to take
> dozens of these objects to be able to actually do what they need to do.

I read this as: high level functions rely on multiple services (which is what
makes them high level), which we then can control, configure and replace
individually. This is a good thing.

> (And yes, I know the argument is that then the useful function should be
> split into multiple smaller functions with fewer arguments, to which I
> reply "you just pushed the problem up the call stack, you didn't solve it").

No - I *distributed* it along the call stack in a sensible way.

> So then the RFC proposes a "ServiceRegistry" to try to get around this
> problem of everything requiring dozens of arguments. But after observing
> that RequestContext already *is* this, it then argues against using
> RequestContext on the basis of additional dependencies.

The difference is that RequestContext is passed around the code a lot, while a
registry should be passed around as little as possible. Access to such a
registry should be limited to code that, for some reason, cannot have the
services themselves injected.

A lot of code uses (read: depends on) RequestContext. Thus, all that code
depends on anything RequestContext depends on. Adding more things to
RequestContext means making a lot of code depend on more things, making it
harder to test, understand and re-use.

"Kitchen-Sink" objects are bad, I'm trying to avoid making the kitchen sink 
bigger.

> Why not have one
> ServiceRegistry, maybe even part of RequestContext, and use it
> preferentially for passing services instead of just as a fallback? 

Because then everything depends on all the services.

That is, to use or test something that needs e.g. a UserPermissionService, you'd
also have to provide all the other services, our you'd at least somehow have to
know which services it needs, without this being declared explicitly anywhere.

> Do we
> really have the need for multiple differently-configured services all over
> the place in "real" code that we need to be passing them around
> individually, or is it just for testing where we could as easily solve the
> problem by creating a registry instance that throws errors for all the
> services not needed for each test?

The point is modularity. Modularity makes it easier to re-use code, easier to
understand it, to refactor it. Unit tests just expose dependency issues nicely,
because they require modular isolation per definition.

But there are numerous examples in the MediaWiki code base where it is obvious
that splitting things into services would have been beneficial, and would have
avoided duplicate code and annoying workarounds.

The fact that Title needs the database and thus makes all tests need the
database is just one glaring example. Another example is that the underlying
logic of special pages is often the same as the underlying logic of API modules.
But currently, the logic on special pages is mixed up with HTML generation, and
the code in API modules is mixed with parameter processing and result
generation, making re-use hard and leading to redundant code.

> At the top level something *does*
> have to do all three things, even if it does it by calling into various
> "trees" and each of those trees has its own branches and each branch has
> its own twigs and so on.

Yes, but it then doesn't have to know *how* these things are done, and the logic
for doing these things can easily be replaced.

> The claimed problem behind a lot of this is "too many dependencies" making
> things hard to test and the idea that you can somehow make this go away by
> dividing everything into tinier and tinier pieces. To some extent this
> works, but at the cost of making the system as a whole harder to understand
> because you have to track all the little pieces.

You *don't* have to track the pieces, because they are all mentioned explicitly,
and your editor (hopefully) allows you to navigate between them easily.

In my experience, what makes a system hard to understand are hidden dependencies
and side effects (Title is full of those). Splitting things into mode classes
and smaller functions leads to code that is more easily readable.

It increases the number of files, yes. But files are really not the entity we
should use to think about software architecture. If the number of files is an
issue, something is wrong with the tools we use.

> I doubt MediaWiki has
> reached the point of diminishing returns on that, but I'm not really sure
> that the end-goal envisioned here is the *right* division.

Well, how would *you* re-factor the Title class, then?

> So in the end you either have to also
> test the thing as a whole anyways. Or you end up with unit tests that
> are even less useful than you started because they won't catch
> regressions they would've originally.

Integration tests are useful, but they are no replacement for unit tests (nor
vice versa). One big is the number of code pathes to cover (another is the
effort of locating errors).

A heavily simplified example: We have three methods to test, where a() uses b()
and b() uses c(). Let's say each of these has 3 code paths to test.

If we can test a(), b(), and c() separately (making a() use a mock version of
b(), etc), we need 5+5+5 = 15 test cases for full coverage.

If we use an integration test, we'll need one test case for every possible
combination of code pathes - that's 5*5*5 = 225 test cases.

In practice, we'll not do this. We'll test, say, 20 typical cases. Which means a
lot of code paths are not covered, and may well lead to problems.

> Would it make sense to cut a TitleValue branch as a playground? If we
> can get some code before the January summit, that would give us all a
> bit more experience. 

My experience from introducing the ContentHandler makes me very wary of using a
development branch. Rebasing that over and over is going to be very painful.

-- daniel


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Refactoring the Title object

Reply via email to