Re: HELP: arbitrary HTML 2 XML parser that works with XmlParse()
> I've tried jTidy, but it seems to choke if the HTML document it receives is > not well-formed... Matthew: jTidy will handle ill-formed documents... JournURL uses it pretty much constantly, anywhere HTML is involved. The problem is puzzling out which of the gazillion methods you need to call to get the results you're after. Suggestions: jTidy.setNumEntities(true); jTidy.setXHTML(true); jTidy.setXmlOut(true); jTidy.setForceOutput(true); That last one is crucial, and caused me weeks of headaches before I finally figured out what was happening. -- Roger Benningfield JournURL http://admin.support.journurl.com/ http://admin.mxblogspace.journurl.com/ ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199092 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4 Donations & Support: http://www.houseoffusion.com/tiny.cfm/54
Re: HELP: arbitrary HTML 2 XML parser that works with XmlParse()
I haven't used any Java HTML parser, but you could check: http://jerichohtml.sourceforge.net/ http://htmlparser.sourceforge.net/ Massimo Foti DW tools: http://www.massimocorner.com CF tools: http://www.olimpo.ch/tmt/ ~| Find out how CFTicket can increase your company's customer support efficiency by 100% http://www.houseoffusion.com/banners/view.cfm?bannerid=49 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:198794 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations & Support: http://www.houseoffusion.com/tiny.cfm/54
Re: HELP: arbitrary HTML 2 XML parser that works with XmlParse()
On Tuesday 15 Mar 2005 11:42 am, Matthew Lesko wrote: > to XML parser (especially codes samples) would be much appreciated. HTML is not an XML format. XHTML is a lot stricter, and so is an XML format, and no matter what some web sites might claim to be, very few sites that have been around longer than a year or so are compliant. IIRC, YMMV etc. etc. -- Tom Chiverton Advanced ColdFusion Programmer ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:198795 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations & Support: http://www.houseoffusion.com/tiny.cfm/54
RE: HELP: arbitrary HTML 2 XML parser that works with XmlParse()
Matthew, This is not possible. Not only are you bound to well formed xml (xml, xhtml) for xmlParse, but it is also needed for xPath to function at all. A valid document tree needs to be build before those functions can actually find themselves a way through the markup. So either, provide valid and well formed xml, or start looking at a parser (which I think you won't find because that is what browsers do) for tagsoup extraction. Micha Schopman Project Manager Modern Media, Databankweg 12 M, 3821 AL Amersfoort Tel 033-4535377, Fax 033-4535388 KvK Amersfoort 39081679, Rabo 39.48.05.380 - Modern Media, Making You Interact Smarter. Onze oplossingen verbeteren de interactie met uw doelgroep. Wilt u meer omzet, lagere kosten of een beter service niveau? Voor meer informatie zie www.modernmedia.nl - -Original Message- From: Matthew Lesko [mailto:[EMAIL PROTECTED] Sent: dinsdag 15 maart 2005 12:43 To: CF-Talk Subject: HELP: arbitrary HTML 2 XML parser that works with XmlParse() All, I'm trying to write some functionality that is capable of retrieving a document from a URL, ala CFHTTP for instance, and then parsing that document - warts and all - into something that XmlParse() can work with in order to use XPath to pull out certain pieces. I've tried jTidy, but it seems to choke if the HTML document it receives is not well-formed, and I have no control over the documents being pulled. So, it seems as if nekoHTML might fit the bill, but I'm at a loss as to how to either get output from it that can be read by XmlParse() or how to use XPath on the object(s) it creates directly. Any help with some sort of HTML to XML parser (especially codes samples) would be much appreciated. Thanks, Matthew Lesko ~| Logware (www.logware.us): a new and convenient web-based time tracking application. Start tracking and documenting hours spent on a project or with a client with Logware today. Try it for free with a 15 day trial account. http://www.houseoffusion.com/banners/view.cfm?bannerid=67 Message: http://www.houseoffusion.com/lists.cfm/link=i:4:198793 Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4 Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4 Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4 Donations & Support: http://www.houseoffusion.com/tiny.cfm/54