With reference to Norbert's comment, there /may/ be an ambiguity about the word 'header' in Udo's reply. It could refer to the http HEAD section, in which case Norbert is of course right. It could also refer to the <head> section of the html file, which is part of the content of the http response. If it is the latter, this is similar to a question that Paul deBruicker posted last November ("[Pharo-users] ZnClient GET, but just the content of the <head> tag?"). I tried the method I devised for Paul's case on Udo's problem website, and read the html header with no problem. Incidentally, the header includes 'charset=iso-8859-1', which does not agree with Sven's findings.
In case it is of interest, I used XMLHTMLParser to read and parse the header. Try the following in a Playground: par := XMLHTMLParser onURL: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'. par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top isElement and:[ top isNamed: 'body']]]. par parsingResult findElementNamed: 'head'. If you 'Do it and go', the full header appears. The way I get it to stop after the header may not be quite correct, because it uses XMLHTMLParser>>topNode, which is a private method. On the other hand, I can't see how to make the stop condition for XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without using a private method. Hope this is helpful Peter Kenny -----Original Message----- From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of Norbert Hartl Sent: 12 May 2017 08:04 To: Any question about pharo is welcome <pharo-users@lists.pharo.org> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding Just to mention. If you are not interested in the content body you could do a HEAD request instead of GET. Norbert > Am 11.05.2017 um 22:44 schrieb Udo Schneider <udo.schnei...@homeaddress.de>: > > Hi Sven, > > that's perfect. To be honest I don't care about the content - I'm just parsing the header. And even if there is a wrong decoding in there... I can live with that. > > Thank you very very much! For your help but also your stuff in general. > > CU, > > Udo > > >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe: >> Hi Udo, >>> On 11 May 2017, at 21:37, Udo Schneider <udo.schnei...@homeaddress.de> wrote: >>> >>> All, >>> >>> I'm hitting an error where fetching web content fails. The website does indeed use invalid characters. >>> >>> The easiest way to reproduce: >>> >>> ZnEasy get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723' >>> >>> Is there any way to tell Zinc to simply ignore that error and to continue? >>> >>> CU, >>> >>> Udo >> That server/page has a mime-type text/plain with no explicit encoding (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591 does not work. The following does work, but you can't be sure everything went well (beLenient takes some bytes as they are). >> ZnDefaultCharacterEncoder >> value: ZnCharacterEncoder latin1 beLenient >> during: [ >> ZnClient new >> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >> yourself ]. >> I added some API earlier today, so that the following should also work (you need to load Zn #bleedingEdge first). >> ZnClient new >> defaultEncoder: ZnCharacterEncoder latin1 beLenient; >> get: 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'; >> yourself. >> HTH, >> Regards, >> Sven > > >