Monty

Many thanks for this. My original purpose was just to answer Paul deBruicker's 
query, namely to parse an html file and stop reading at the end of the <head> 
section. I solved this by trial and error using the code shown below ( which 
actually stops at the opening tag of the body). This was not my problem at all, 
but Paul's; I just tackled it for fun.

However, you note has prompted me to update my version of the whole XML system 
- I was using the version I downloaded with Moose 6.0, which was dated August 
2016. I am looking at the StAX parsers as a possible way of simplifying what I 
currently do, which involves downloading an entire web page as a DOM and then 
manipulating it with XPath to extract the bits I am interested in. I may be 
able to use StAX to do some of the selection and manipulation as I am reading.

It's all a new topic to me, so I foresee a lot of experimentation. It all helps 
to keep the grey matter active.

Thanks again

Peter Kenny

-----Original Message-----
From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
monty
Sent: 15 May 2017 12:15
To: pharo-users@lists.pharo.org
Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 
encoding

For that kind of incremental parsing, you could also use XMLParserStAX, a 
pull-parser that parses a document as a stream of event objects you control 
with #next, #peek, and #atEnd. It also supports pull-DOM parsing with messages 
like #nextNode, #nextElement, and #nextElementNamed:, which return the next 
event object(s) as DOM subtrees (searchable with XPath). See the StAXParser 
class comment for an example. (The StAXHTMLParser class requires XMLParserHTML 
be installed to work.)

> Sent: Friday, May 12, 2017 at 5:30 AM
> From: PBKResearch <pe...@pbkresearch.co.uk>
> To: "'Any question about pharo is welcome'" 
> <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte 
> for utf-8 encoding
>
> With reference to Norbert's comment, there /may/ be an ambiguity about 
> the word 'header' in Udo's reply. It could refer to the http HEAD 
> section, in which case Norbert is of course right. It could also refer 
> to the <head> section of the html file, which is part of the content of the 
> http response.
> If it is the latter, this is similar to a question that Paul 
> deBruicker posted last November ("[Pharo-users] ZnClient GET, but just 
> the  content of the <head> tag?"). I tried the method I devised for 
> Paul's case on Udo's problem website, and read the html header with no 
> problem. Incidentally, the header includes 'charset=iso-8859-1', which 
> does not agree with Sven's findings.
> 
> In case it is of interest, I used XMLHTMLParser to read and parse the 
> header. Try the following in a Playground:
> 
> par := XMLHTMLParser onURL:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'.
> par parseDocumentUntil: [|top|(top := par topNode) notNil and: [ top 
> isElement and:[ top isNamed: 'body']]].
> par parsingResult findElementNamed: 'head'.
> 
> If you 'Do it and go', the full header appears. The way I get it to 
> stop after the header may not be quite correct, because it uses
> XMLHTMLParser>>topNode, which is a private method. On the other hand, 
> XMLHTMLParser>>I
> can't see how to make the stop condition for
> XMLHTMLParser>>parseDocumentUntil: depend on the parsed results 
> XMLHTMLParser>>without
> using a private method.
> 
> Hope this is helpful
> 
> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On 
> Behalf Of Norbert Hartl
> Sent: 12 May 2017 08:04
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte 
> for
> utf-8 encoding
> 
> Just to mention. If you are not interested in the content body you 
> could do a HEAD request instead of GET.
> 
> Norbert
> 
> > Am 11.05.2017 um 22:44 schrieb Udo Schneider
> <udo.schnei...@homeaddress.de>:
> > 
> > Hi Sven,
> > 
> > that's perfect. To be honest I don't care about the content - I'm 
> > just
> parsing the header. And even if there is a wrong decoding in there... 
> I can live with that.
> > 
> > Thank you very very much! For your help but also your stuff in general.
> > 
> > CU,
> > 
> > Udo
> > 
> > 
> >> Am 11/05/17 um 22:35 schrieb Sven Van Caekenberghe:
> >> Hi Udo,
> >>> On 11 May 2017, at 21:37, Udo Schneider 
> >>> <udo.schnei...@homeaddress.de>
> wrote:
> >>> 
> >>> All,
> >>> 
> >>> I'm hitting an error where fetching web content fails. The website 
> >>> does
> indeed use invalid characters.
> >>> 
> >>> The easiest way to reproduce:
> >>> 
> >>> ZnEasy get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723'
> >>> 
> >>> Is there any way to tell Zinc to simply ignore that error and to
> continue?
> >>> 
> >>> CU,
> >>> 
> >>> Udo
> >> That server/page has a mime-type text/plain with no explicit 
> >> encoding
> (charset) setting, so we have to guess. Like utf-8, pure 
> latin1/iso88591 does not work. The following does work, but you can't 
> be sure everything went well (beLenient takes some bytes as they are).
> >> ZnDefaultCharacterEncoder
> >>   value: ZnCharacterEncoder latin1 beLenient
> >>   during: [
> >>     ZnClient new
> >>       get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>       yourself ].
> >> I added some API earlier today, so that the following should also 
> >> work
> (you need to load Zn #bleedingEdge first).
> >>  ZnClient new
> >>   defaultEncoder: ZnCharacterEncoder latin1 beLenient;
> >>   get:
> 'http://www.darkreading.com/partner-perspectives/malwarebytes/locky-re
> turns- with-a-new-(borrowed)-distribution-method/a/d-id/1328723';
> >>   yourself.
> >> HTH,
> >> Regards,
> >> Sven
> > 
> > 
> > 
> 
> 
> 
> 


Reply via email to