Has anyone seen a tool to extract a "template" from a set of similar
web pages? We acquired a website that uses the same code across
multiple web pages. Each web page was copy and pasted from the last;
no includes were used. Each is slightly different from the next, even
where they should be the same. (For example, some have <title> tags;
some don't.) To the human eye, it's obvious what's template and what's
content, but I can't do and find/replace because there's no good
pattern to the code.
Adrian Holovaty (creator of ChicagoCrime.org and Django) has a Python
script called templatemaker[1][2], which in theory would do what I
want. You feed it a bunch of similar web pages and it produces a
template with "holes" where the data was different across each web
page. In practice, it's too granular; it doesn't recognize HTML. It
looks at every I don't care about spaces between tags. I only care
about substantial content differences across pages. Everything else
can be moved to the template.
Any ideas come to mind?
Richard
[1] http://code.google.com/p/templatemaker/
[2] http://www.holovaty.com/blog/archive/2007/07/06/0128
_______________________________________________
UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net