(wow, it seems it took 1 or 2 day before these thread appears here, so
sorry for the multipost i posted another one in the mean time)

Thanks for your help.
I tried first 3) but give up and then tried pure nekohtml solution
which works perfectly well :)

Thierry.

On 3 jan, 01:18, Frank Weiss <fewe...@gmail.com> wrote:
> There have been a few similar questions. The basic issue is that SAX parsers
> requires valid XML or XHTML as input. If you have control (or can influence
> the authors of) the service, make the output valid, which as you well know,
> means that <, >, ", ', & need to be escaped. In PHP, this is easily done
> with the htmlspecialchars function. Tip: use validator.w3.org to see what's
> wrong with the document.
>
> If you can't change the service or it is HTML anyway, here are some
> suggestions:
>
> 1) Use NekoHtml to preprocess the the flakey markup into a DOM. You can then
> user SAXParser, XPath, XSLT, etc. to get the data. I haven't tried it on
> Andorid - it may be a bit heavy-weight, but otherwise is a great way to deal
> with flakey markup.
> 2) See if you can modify the SAXParser itself (can you say Open Source?) to
> relax the particular issues. If the source docuemnt is really bad
> (unbalanced tags, etc.) this is probably going to get too hairy.
> 3) Use regex to parse the page.
>
> There are probably some other creative solutions. Which one is best depends
> on the details of the source document and what you want want to do with it.
>
> On Fri, Jan 1, 2010 at 6:12 AM, tlegras <tleg...@gmail.com> wrote:
> > Happy new years :)
>
> > I am using SAXParser to parse an html page (any better solution?) and
> > have this exception:
>
> >            W/System.err( 1358): org.apache.harmony.xml.ExpatParser
> > $ParseException: At line 1, column 59: not well-formed (invalid token)
>
> > I have reduced the page to this:
>
> >            <div id="submenu"><a href="/compte/console.pl?
> > id=382730&idt=1cf6b94aa1a4cf84"></a></div>
>
> > and what causes the exception is the '&' inside the href attribute
> > value.
>
> > Here is a minimalist test code:
>
> >            DefaultHandler emptySaxHandler = new DefaultHandler() {};
> >            String xmlstr = "<div id=\"submenu\"><a href=\"/compte/
> > console.pl?id=382730&idt=1cf6b94aa1a4cf84<http://console.pl/?id=382730&idt=1cf6b94aa1a4cf84>
> > \"></a></div>";
>
> >            SAXParserFactory factory = SAXParserFactory.newInstance();
> >            SAXParser saxParser = factory.newSAXParser();
> >            saxParser.parse(new ByteArrayInputStream(xmlstr.getBytes
> > ()),emptySaxHandler);
>
> > is this a normal behaviour or kind of bug? if normal, what should do
> > to preprocess the string before parsing?
>
> > Thks for any help,
> > Thierry.
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Android Developers" group.
> > To post to this group, send email to android-developers@googlegroups.com
> > To unsubscribe from this group, send email to
> > android-developers+unsubscr...@googlegroups.com<android-developers%2bunsubscr...@googlegroups.com>
> > For more options, visit this group at
> >http://groups.google.com/group/android-developers?hl=en
-- 
You received this message because you are subscribed to the Google
Groups "Android Developers" group.
To post to this group, send email to android-developers@googlegroups.com
To unsubscribe from this group, send email to
android-developers+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/android-developers?hl=en

Reply via email to