(wow, it seems it took 1 or 2 day before these thread appears here, so sorry for the multipost i posted another one in the mean time)
Thanks for your help. I tried first 3) but give up and then tried pure nekohtml solution which works perfectly well :) Thierry. On 3 jan, 01:18, Frank Weiss <fewe...@gmail.com> wrote: > There have been a few similar questions. The basic issue is that SAX parsers > requires valid XML or XHTML as input. If you have control (or can influence > the authors of) the service, make the output valid, which as you well know, > means that <, >, ", ', & need to be escaped. In PHP, this is easily done > with the htmlspecialchars function. Tip: use validator.w3.org to see what's > wrong with the document. > > If you can't change the service or it is HTML anyway, here are some > suggestions: > > 1) Use NekoHtml to preprocess the the flakey markup into a DOM. You can then > user SAXParser, XPath, XSLT, etc. to get the data. I haven't tried it on > Andorid - it may be a bit heavy-weight, but otherwise is a great way to deal > with flakey markup. > 2) See if you can modify the SAXParser itself (can you say Open Source?) to > relax the particular issues. If the source docuemnt is really bad > (unbalanced tags, etc.) this is probably going to get too hairy. > 3) Use regex to parse the page. > > There are probably some other creative solutions. Which one is best depends > on the details of the source document and what you want want to do with it. > > On Fri, Jan 1, 2010 at 6:12 AM, tlegras <tleg...@gmail.com> wrote: > > Happy new years :) > > > I am using SAXParser to parse an html page (any better solution?) and > > have this exception: > > > W/System.err( 1358): org.apache.harmony.xml.ExpatParser > > $ParseException: At line 1, column 59: not well-formed (invalid token) > > > I have reduced the page to this: > > > <div id="submenu"><a href="/compte/console.pl? > > id=382730&idt=1cf6b94aa1a4cf84"></a></div> > > > and what causes the exception is the '&' inside the href attribute > > value. > > > Here is a minimalist test code: > > > DefaultHandler emptySaxHandler = new DefaultHandler() {}; > > String xmlstr = "<div id=\"submenu\"><a href=\"/compte/ > > console.pl?id=382730&idt=1cf6b94aa1a4cf84<http://console.pl/?id=382730&idt=1cf6b94aa1a4cf84> > > \"></a></div>"; > > > SAXParserFactory factory = SAXParserFactory.newInstance(); > > SAXParser saxParser = factory.newSAXParser(); > > saxParser.parse(new ByteArrayInputStream(xmlstr.getBytes > > ()),emptySaxHandler); > > > is this a normal behaviour or kind of bug? if normal, what should do > > to preprocess the string before parsing? > > > Thks for any help, > > Thierry. > > > -- > > You received this message because you are subscribed to the Google > > Groups "Android Developers" group. > > To post to this group, send email to android-developers@googlegroups.com > > To unsubscribe from this group, send email to > > android-developers+unsubscr...@googlegroups.com<android-developers%2bunsubscr...@googlegroups.com> > > For more options, visit this group at > >http://groups.google.com/group/android-developers?hl=en
-- You received this message because you are subscribed to the Google Groups "Android Developers" group. To post to this group, send email to android-developers@googlegroups.com To unsubscribe from this group, send email to android-developers+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/android-developers?hl=en