For example, this: ((StAXHTMLParser onURL: aURLString) nextElementNamed: 'head') ifNotNil: [:headElement | ...]
parses the document upto the next "head" element and returns it and any descendants as a DOM subtree. If there's no next "head" element, it exhausts the event stream looking for one. If you don't want that, test it first: (parser peek isStartTagNamed: 'head') ifTrue: [| headElement | headElement := parser nextNode. ...]. because you now know what kind of DOM subtree the next events represent, #nextNode is used, which builds any DOM subtree out of the next events, including an element with descendants, a string or comment node, or even an entire document (if sent before reading the start-of-document event). So this: (StAXHTMLParser onURL: aURLString) nextNode is equivalent to this: XMLHTMLParser parseURL: aURLString. StAX is more useful with XML than HTML, because XML documents can be huge. > Sent: Tuesday, May 16, 2017 at 6:39 PM > From: PBKResearch <pe...@pbkresearch.co.uk> > To: "'Any question about pharo is welcome'" <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] > ZnInvalidUTF8: Illegal leading byte for utf-8 encoding) > > Monty > > Many thanks for your help. I have followed your advice to start again in a > clean Moose 6.1 image, and so far everything is working fine. Apologies for > getting you to sort out the results of my stupidity. In Pharo I am really an > experienced beginner. > > Thanks again > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of > monty > Sent: 16 May 2017 03:37 > To: pharo-users@lists.pharo.org > Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] > ZnInvalidUTF8: Illegal leading byte for utf-8 encoding) > > Something went wrong during your upgrade with class initialization. > > Installing the latest versions of these projects into a clean image would > work, and so would installing the latest XMLParserHTML and XMLParserStAX into > the newest Moose-6.1 image (which has the latest XMLParser and XPath). > > But if you insist on upgrading your old image, try the latest > ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from > their PharoExtras repos and install their latest project versions, and do the > same with XMLParserHTML and XMLParserStAX (the older versions aren't > compatible with newer XMLParser versions). Then open the test runner and run > all "XML|XPath" tests. If you get any failures, evaluate this: > > #('XML-Parser' 'XPath-Core') do: [:package | > (SystemNavigation default allClassesInPackageNamed: package) do: > [:class | > class initialize]] > > and try running the tests again. > > > Sent: Monday, May 15, 2017 at 6:50 PM > > From: PBKResearch <pe...@pbkresearch.co.uk> > > To: "'Any question about pharo is welcome'" <pharo-users@lists.pharo.org> > > Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc] > > ZnInvalidUTF8: Illegal leading byte for utf-8 encoding) > > > > Monty > > > > As an update, I have rebuilt from the Moose 6.0 download. The version of > > XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I > > installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with > > that. (The respective configurations are monty.48 and monty.39). With these > > versions all my previous XMLHTMLParser operations work as before, and I > > have been able to use the StAX parser in a simple way. So I can start > > exploring as I intended. > > > > I have made repeated attempts to update this rebuilt image to more recent > > versions of the HTML and StAX parsers, and every time I run into the same > > error reported below. I started from the latest version and worked > > backwards, but gave up quickly; it takes about 6 minutes on my machine to > > load and compile a version, and it soon gets tedious. If I feel more > > enthusiastic tomorrow, I might start working forwards from my current > > versions. > > > > Anyway, I now have a working system with the StaX and HTML parsers, so I > > can continue to explore. > > > > Best wishes > > > > Peter Kenny > > > > -----Original Message----- > > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of > > PBKResearch > > Sent: 15 May 2017 20:44 > > To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org> > > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for > > utf-8 encoding > > > > Monty > > > > I have just started trying to use the StAX parsers, and I have found that > > the update has introduced a problem, which means that XMLHTMLParser no > > longer works on examples I have used before. I updated to > > ConfigurationOfXMLParser(monty.302), which is the latest version on the > > smalltalkhub repository, and then used the load version in the class > > comment, which loads the stable default. Similarly, I loaded > > ConfigurationOfXMLParserHTML(monty.62) and > > ConfigurationOfXMLParserStAX(monty.51), again using stable and default. > > When I try to run the XMLHTMLParser example I quoted below, I get an error > > message 'MessageNotunderstood: receiver of "critical:" is nil'. The same > > message comes up with anything else I try with XMLHTMLParser or with > > StAXHTMLParser. > > > > I am not really up to using the debugger on someone else's code, but the > > one thing I can see is that the problem lies in > > XMLKeyValueCache>>critical:, which has the code: > > ^ self mutex critical: aBlock > > The problem being that mutex is nil. > > > > In my enthusiasm, I saved the updated image with the same name as the old > > image, which is now therefore overwritten. If I cannot solve this problem, > > my only way out is to rebuild my image from the Moose 6.0 download. Any > > suggestions gratefully received. > > > > Thanks in advance > > > > Peter Kenny > > > > -----Original Message----- > > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of > > PBKResearch > > Sent: 15 May 2017 19:16 > > To: 'Any question about pharo is welcome' <pharo-users@lists.pharo.org> > > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for > > utf-8 encoding > > > > Monty > > > > Many thanks for this. My original purpose was just to answer Paul > > deBruicker's query, namely to parse an html file and stop reading at the > > end of the <head> section. I solved this by trial and error using the code > > shown below ( which actually stops at the opening tag of the body). This > > was not my problem at all, but Paul's; I just tackled it for fun. > > > > However, you note has prompted me to update my version of the whole XML > > system - I was using the version I downloaded with Moose 6.0, which was > > dated August 2016. I am looking at the StAX parsers as a possible way of > > simplifying what I currently do, which involves downloading an entire web > > page as a DOM and then manipulating it with XPath to extract the bits I am > > interested in. I may be able to use StAX to do some of the selection and > > manipulation as I am reading. > > > > It's all a new topic to me, so I foresee a lot of experimentation. It all > > helps to keep the grey matter active. > > > > Thanks again > > > > Peter Kenny > > > > -----Original Message----- > > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of > > monty > > Sent: 15 May 2017 12:15 > > To: pharo-users@lists.pharo.org > > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for > > utf-8 encoding > > > > For that kind of incremental parsing, you could also use XMLParserStAX, a > > pull-parser that parses a document as a stream of event objects you control > > with #next, #peek, and #atEnd. It also supports pull-DOM parsing with > > messages like #nextNode, #nextElement, and #nextElementNamed:, which return > > the next event object(s) as DOM subtrees (searchable with XPath). See the > > StAXParser class comment for an example. (The StAXHTMLParser class requires > > XMLParserHTML be installed to work.) > > > > > > > > > > > > >