Thanks I will. Giving a final update here for those that may want to use the code above, don't trust the results. I've just tested this against a web crawler I've been using for quite a while now and in many cases there's still a breakage, leaving some results out.
Until the parser can repair broken html I recommend looking into using a webdriver and querying the dom there. [https://github.com/dom96/webdriver](https://github.com/dom96/webdriver) is a good starting point.