Re: HELP: arbitrary HTML 2 XML parser that works with XmlParse()

2005-03-17 Thread Roger B.
> I've tried jTidy, but it seems to choke if the HTML document it receives is 
> not well-formed...

Matthew: jTidy will handle ill-formed documents... JournURL uses it
pretty much constantly, anywhere HTML is involved. The problem is
puzzling out which of the gazillion methods you need to call to get
the results you're after.

Suggestions:

jTidy.setNumEntities(true);
jTidy.setXHTML(true);
jTidy.setXmlOut(true);
jTidy.setForceOutput(true);

That last one is crucial, and caused me weeks of headaches before I
finally figured out what was happening.

--
Roger Benningfield
JournURL
http://admin.support.journurl.com/
http://admin.mxblogspace.journurl.com/

~|
Logware (www.logware.us): a new and convenient web-based time tracking 
application. Start tracking and documenting hours spent on a project or with a 
client with Logware today. Try it for free with a 15 day trial account.
http://www.houseoffusion.com/banners/view.cfm?bannerid=67

Message: http://www.houseoffusion.com/lists.cfm/link=i:4:199092
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: HELP: arbitrary HTML 2 XML parser that works with XmlParse()

2005-03-15 Thread Massimo Foti
I haven't used any Java HTML parser, but you could check:

http://jerichohtml.sourceforge.net/

http://htmlparser.sourceforge.net/


Massimo Foti
DW tools: http://www.massimocorner.com
CF tools:  http://www.olimpo.ch/tmt/




~|
Find out how CFTicket can increase your company's customer support 
efficiency by 100%
http://www.houseoffusion.com/banners/view.cfm?bannerid=49

Message: http://www.houseoffusion.com/lists.cfm/link=i:4:198794
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


Re: HELP: arbitrary HTML 2 XML parser that works with XmlParse()

2005-03-15 Thread Thomas Chiverton
On Tuesday 15 Mar 2005 11:42 am, Matthew Lesko wrote:
> to XML parser (especially codes samples) would be much appreciated.

HTML is not an XML format.
XHTML is a lot stricter, and so is an XML format, and no matter what some web 
sites might claim to be, very few sites that have been around longer than a 
year or so are compliant.

IIRC, YMMV etc. etc.

-- 
Tom Chiverton 
Advanced ColdFusion Programmer

~|
Logware (www.logware.us): a new and convenient web-based time tracking 
application. Start tracking and documenting hours spent on a project or with a 
client with Logware today. Try it for free with a 15 day trial account.
http://www.houseoffusion.com/banners/view.cfm?bannerid=67

Message: http://www.houseoffusion.com/lists.cfm/link=i:4:198795
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54


RE: HELP: arbitrary HTML 2 XML parser that works with XmlParse()

2005-03-15 Thread Micha Schopman
Matthew, 

This is not possible. Not only are you bound to well formed xml (xml,
xhtml) for xmlParse, but it is also needed for xPath to function at all.

A valid document tree needs to be build before those functions can
actually find themselves a way through the markup.

So either, provide valid and well formed xml, or start looking at a
parser (which I think you won't find because that is what browsers do)
for tagsoup extraction.

Micha Schopman
Project Manager

Modern Media, Databankweg 12 M, 3821 AL  Amersfoort
Tel 033-4535377, Fax 033-4535388
KvK Amersfoort 39081679, Rabo 39.48.05.380



-
Modern Media, Making You Interact Smarter. Onze oplossingen verbeteren
de interactie met uw doelgroep. 
Wilt u meer omzet, lagere kosten of een beter service niveau? Voor meer
informatie zie www.modernmedia.nl 


-

-Original Message-
From: Matthew Lesko [mailto:[EMAIL PROTECTED] 
Sent: dinsdag 15 maart 2005 12:43
To: CF-Talk
Subject: HELP: arbitrary HTML 2 XML parser that works with XmlParse()

All,

I'm trying to write some functionality that is capable of retrieving a
document from a URL, ala CFHTTP for instance, and then parsing that
document - warts and all - into something that XmlParse() can work with
in order to use XPath to pull out certain pieces. 

I've tried jTidy, but it seems to choke if the HTML document it receives
is not well-formed, and I have no control over the documents being
pulled. So, it seems as if nekoHTML might fit the bill, but I'm at a
loss as to how to either get output from it that can be read by
XmlParse() or how to use XPath on the object(s) it creates directly. 
Any help with some sort of HTML to XML parser (especially codes samples)
would be much appreciated. 

Thanks,

Matthew Lesko



~|
Logware (www.logware.us): a new and convenient web-based time tracking 
application. Start tracking and documenting hours spent on a project or with a 
client with Logware today. Try it for free with a 15 day trial account.
http://www.houseoffusion.com/banners/view.cfm?bannerid=67

Message: http://www.houseoffusion.com/lists.cfm/link=i:4:198793
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54