Danny,
Would it be possible to convert the HTML you’re analyzing as a string into an 
XML structure? With XML, the link detection would be a trivial XPath 
expression. If the HTML doesn’t parse as legal XML out-of-the-box, you can take 
a look at the Tidy integration that’s built into MarkLogic 
<http://developer.marklogic.com:8040/5.0doc/docapp.xqy#search.xqy?query=tidy>.

Justin

Justin Makeig
Director, Product Management
MarkLogic Corporation
[email protected]<mailto:[email protected]>
Phone: +1 650 655 2387
www.marklogic.com<http://www.marklogic.com/>

On Jul 16, 2012, at 11:51 AM, Danny Sinang wrote:

Hi,

I'm trying to use regex in detecting html anchor tags.

So far, my Googling has yield this as the best regex to use : 
<a[\s]+[^>]*?href[\s]?=[\s\"\']*(.*?)[\"\']*.*?>([^<]+|.*?)?<\/a>

My problem is, how do I assign that to a variable in XQuery so I can call 
fn:analyze-string() . I was hoping to do it this way :

let $htmlBody := $asset/assetContent/htmlBody/string()
let $pattern := 
<a[\s]+[^>]*?href[\s]?=[\s\"\']*(.*?)[\"\']*.*?>([^<]+|.*?)?<\/a>
return fn:analyze-string($htmlBody, $pattern)

But I can't enclose the regex with either a single or double quote.

Any idea ?

Regards,
Danny

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to