For that kind of incremental parsing, you could also use XMLParserStAX, a pull-parser that parses a document as a stream of event objects you control with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages like #nextNode, #nextElement, and #nextElementNamed:, which return the next event object(s) as DOM subtrees (searchable with XPath). See the StAXParser class comment for an example. (The StAXHTMLParser class requires XMLParserHTML be installed to work.)
> Sent: Friday, May 12, 2017 at 5:30 AM > From: PBKResearch <pe...@pbkresearch.co.uk> > To: "'Any question about pharo is welcome'" <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for > utf-8 encoding > > With reference to Norbert's comment, there /may/ be an ambiguity about the > word 'header' in Udo's reply. It could refer to the http HEAD section, in > which case Norbert is of course right. It could also refer to the <head> > section of the html file, which is part of the content of the http response. > If it is the latter, this is similar to a question that Paul deBruicker > posted last November ("[Pharo-users] ZnClient GET, but just the content of > the <head> tag?"). I tried the method I devised for Paul's case on Udo's > problem website, and read the html header with no problem. Incidentally, the > header includes 'charset=iso-8859-1', which does not agree with Sven's > findings. > > In case it is of interest, I used XMLHTMLParser to read and parse the > header. Try the following in a Playground: > > par := XMLHTMLParser onURL: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. > par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top > isElement and:[ top isNamed: 'body']]]. > par parsingResult findElementNamed: 'head'. > > If you 'Do it and go', the full header appears. The way I get it to stop > after the header may not be quite correct, because it uses > XMLHTMLParser>>topNode, which is a private method. On the other hand, I > can't see how to make the stop condition for > XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without > using a private method. > > Hope this is helpful > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of > Norbert Hartl > Sent: 12 May 2017 08:04 > To: Any question about pharo is welcome <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for > utf-8 encoding > > Just to mention. If you are not interested in the content body you could do > a HEAD request instead of GET. > > Norbert > > > Am 11.05.2017 um 22:44 schrieb Udo Schneider > <udo.schnei...@homeaddress.de>: > > > > Hi Sven, > > > > that's perfect. To be honest I don't care about the content - I'm just > parsing the header. And even if there is a wrong decoding in there... I can > live with that. > > > > Thank you very very much! For your help but also your stuff in general. > > > > CU, > > > > Udo > > > > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: > >> Hi Udo, > >>> On 11 May 2017, at 21:37, Udo Schneider <udo.schnei...@homeaddress.de> > wrote: > >>> > >>> All, > >>> > >>> I'm hitting an error where fetching web content fails. The website does > indeed use invalid characters. > >>> > >>> The easiest way to reproduce: > >>> > >>> ZnEasy get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723' > >>> > >>> Is there any way to tell Zinc to simply ignore that error and to > continue? > >>> > >>> CU, > >>> > >>> Udo > >> That server/page has a mime-type text/plain with no explicit encoding > (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 > does not work. The following does work, but you can't be sure everything > went well (beLenient takes some bytes as they are). > >> ZnDefaultCharacterEncoder > >> value: ZnCharacterEncoder latin1 beLenient > >> during: [ > >> ZnClient new > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself ]. > >> I added some API earlier today, so that the following should also work > (you need to load Zn #bleedingEdge first). > >> ZnClient new > >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; > >> get: > 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- > with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; > >> yourself. > >> HTH, > >> Regards, > >> Sven > > > > > > > > > >