Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

monty Mon, 15 May 2017 04:17:05 -0700

For that kind of incremental parsing, you could also use XMLParserStAX, a 
pull-parser that parses a document as a stream of event objects you control 
with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages 
like #nextNode, #nextElement, and #nextElementNamed:, which return the next 
event object(s) as DOM subtrees (searchable with XPath). See the StAXParser 
class comment for an example. (The StAXHTMLParser class requires XMLParserHTML 
be installed to work.)


> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <pe...@pbkresearch.co.uk>
> To: "'Any question about pharo is welcome'" <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for 
> utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about the
> word 'header' in Udo's reply. It could refer to the http HEAD section, in
> which case Norbert is of course right. It could also refer to the <head>
> section of the html file, which is part of the content of the http response.
> If it is the latter, this is similar to a question that Paul deBruicker
> posted last November ("[Pharo-users] ZnClient GET, but just the  content of
> the <head> tag?"). I tried the method I devised for Paul's case on Udo's
> problem website, and read the html header with no problem. Incidentally, the
> header includes 'charset=iso-8859-1', which does not agree with Sven's
> findings.
> 
> In case it is of interest, I used XMLHTMLParser to read and parse the
> header. Try the following in a Playground:
> 
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
> 
> If you 'Do it and go', the full header appears. The way I get it to stop
> after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand, I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results without
> using a private method.
> 
> Hope this is helpful
> 
> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of
> Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> utf-8 encoding
> 
> Just to mention. If you are not interested in the content body you could do
> a HEAD request instead of GET. 
> 
> Norbert
> 
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <udo.schnei...@homeaddress.de>:
> > 
> > Hi Sven,
> > 
> > that's perfect. To be honest I don't care about the content - I'm just
> parsing the header. And even if there is a wrong decoding in there... I can
> live with that.
> > 
> > Thank you very very much! For your help but also your stuff in general.
> > 
> > CU,
> > 
> > Udo
> > 
> > 
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider <udo.schnei...@homeaddress.de>
> wrote:
> >>> 
> >>> All,
> >>> 
> >>> I'm hitting an error where fetching web content fails. The website does
> indeed use invalid characters.
> >>> 
> >>> The easiest way to reproduce:
> >>> 
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>> 
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>> 
> >>> CU,
> >>> 
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit encoding
> (charset) setting, so we have to guess. Like utf-8, pure latin1/iso88591
> does not work. The following does work, but you can't be sure everything
> went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >>   value: ZnCharacterEncoder latin1 beLenient
> >>   during: [
> >>     ZnClient new
> >>       get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>       yourself ].
> >> I added some API earlier today, so that the following should also work
> (you need to load Zn #bleedingEdge first).
> >>  ZnClient new
> >>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >>   get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-returns-
> with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>   yourself.
> >> HTH,
> >> Regards,
> >> Sven
> > 
> > 
> > 
> 
> 
> 
>

Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding

Reply via email to