Hi Nick!

> I'm not sure where and how you're manipulating the DOM but I'd also
> be curious as to how it works with potentially horribly XML unfriendly
> content eg something that has been posted that originated in Microsoft
> Word for example. I just remember in some of the PHP4
> XML based templating engines I played with that they had a tendency
> to choke on the kind of real world content that users put in.

Yes, I was also thinking of Word and the likes when implementing the DOM
based approach :/
Initially, I used regex to find all ahrefs and formactions for link
replacement. Unfortunately, I'm no mr. regex so that turned out to be
quite difficult for me. On the other hand, I was fearing that regex might
just solve another part of the problem, working e.g. for valid and
malformed documents but not for all cases that ahref links/ formactions
might look like.

The current code basically looks like this:

$responseDoc = new DOMDocument();
$responseDoc->loadHtml($response);

// process the form action links
$formTags = $responseDoc->getElementsByTagName("form");
foreach ($formTags as $formTag)
{
  if ($formTag->hasAttribute("action"))
  {
    $action = $formTag->getAttribute("action");
    $newAction = $this->_postProcessUrl($action,
                   $previousPortletactionParam);
    $formTag->setAttribute("action", $newAction);
  }
}

which was really easy to implement. Do you see a chance to improve the
parsing part?

regs,

Stephan

Reply via email to