On 02/11/2016 09:15 PM, Patrick Lauer wrote: > Now instead of looking up [metadata.xml] -> (herd name) -> [herds.xml] > -> email it goes backwards: > [metadata.xml] -> (maintainer type=project) -> email -> [projects.xml] > -> Project name > > Since this involves XML and python's ElementTree library it's a > nontrivial change that also removes a few now useless helpers > (_get_herd_email has no reason to be, but we'd need a _get_herd_name > helper instead. Err, get_proj ... ah well, whatever name works) > > And all that just so (1) gentoolkit output works and (2) euscan updates > properly. Both of which I don't really care about much, but now that > I've invested ~4h into debugging and trying to fix it I'm a tiny bit > IRRITATED. > So this turns out to be more fun than expected.
Having spent a little bit of time staring at XML, DTDs and wondering why we do things the most difficult way ... Previously the herd tag was defined as: <!ELEMENT herd (#PCDATA)> So we end up with, for example: <herd>kde</herd> The new schema collapses herd (err, project!) into maintainers (err, sustainers ... staff ... linchpin?) And maintainer is defined as: <!ELEMENT maintainer ( email, (description| name)* )> Which means that only email is mandatory. So instead of search by name you are now required to search by email. And it leads to inconsistent (partial) duplication: Some metadata.xml entries carry Name, some Description, and some are Email only. For example for gentoolkit this means that instead of search by name now it needs to be search by email, and the previous search by name functionality requires herds.xml, err, projects.xml to figure out the name of a project. Which might not match the one in metadata.xml! (And you may need to filter out maintainers-that-are-not-projects, and what about maintainers that are undefined? So much extra code complexity!) And this is why I avoided the topic and hoped that the 'migration' would make sense: (1) Using XML is mildly insane. Neither machine- nor human-readable (2) The DTD is even more insane, and few people have the patience to figure it out (3) The recent changes to the DTD change the data model in subtle ways so that there's even *more* denormalization possible (4) The tooling is, due to XML, wonderfully horrible and requires things like XPATH to get the required data (because query by attribute is harder than query by tag) There's fundamental questions that should be handled before doing more modifications - for example, should the data be more normalized (e.g. name only in projects.xml / maintainers.xml and only email in metadata.xml)? If we allow denormalization, do we have tools to check and autocorrect (e.g. a maintainer changing name)? Once we decide to abstract it away so that people should use tools and not mangle it manually (have you looked at herds.xml ?! omg ...) there's the question ... why XML? It's about the worst format for this job, INI format is sufficient and easier to parse. Or JSON, or YAML, or whatever is trendy now. Or do we autogenerate from templates? Another funny thing: projects.xml is not in the same repository, so synchronizing changes gets more tricky. And the metadata.dtd is in yet another place. Wouldn't it make sense to have this organized in a less confusing way? You see where this is going - and why I didn't object loud enough to the changes: I want to not care about this whole cluster of topics and do things that are more rewarding. But that choice got taken away when things broke (oh, they didn't break, they Function Differently now) and I had to spend some time investigating why things deviate. Sigh. Am I grumpy?