[Pharo-users] PetitParser question parsing HTML meta tags
This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is: I'm trying to parse descriptions from HTML meta elements. I can't use Soup because there isn't a working GemStone port. I've got it to work with the structure: and but I'm running into instances of: and and am having trouble adapting my parsing code (such as it is). The parsing code that addresses the first two cases is: parseHtmlPageForDescription: htmlString | startParser endParser ppStream descParser result text lower str doubleQuoteIndex | lower := 'escription' asParser. startParser := ' #'second'. result := descParser parse: ppStream. text := (result inject: (WriteStream on: String new) into: [ :stream :char | stream nextPut: char. stream ]) contents trimBoth. str := text copyFrom: (text findString: 'content=') + 9 to: text size. doubleQuoteIndex := 8 - ((str last: 7) indexOf: $"). ^ str copyFrom: 1 to: str size - doubleQuoteIndex I can't figure out how to change the startParser parser to accept the second idiom. And maybe there's a better approach altogether. Anyway. If anyone has any ideas on different approaches I'd appreciate learning them. Thanks for giving it some thought Paul
[Pharo-users] PetitParser question parsing HTML meta tags
its easier this way (still in form of a script): parseHtmlPageForDescription: htmlString | parser endParser metaParser descriptionParser contentParser res1Parser res2Parser quoteParser nonQuoteParser | metaParser := ' parser := (metaParser, res1Parser, contentParser, res2Parser, endParser) end ==> [:nodes| Array with: (nodes at: 2) inputValue with: (nodes at: 4) inputValue ]. ^parser parse: htmlString. "self parseHtmlPageForDescription: self htmlString1 self parseHtmlPageForDescription: self htmlString2 self parseHtmlPageForDescription: self htmlString3 " with htmlString1 ^'' etc.. you may want to read http://www.lukas-renggli.ch/blog/petitparser-1 good luck Hartmut This is kind of a "I'm tired of thinking about this and not making much progress for the amount of time I'm putting in question" but here it is: I'm trying to parse descriptions from HTML meta elements. I can't use Soup because there isn't a working GemStone port. I've got it to work with the structure: and but I'm running into instances of: and and am having trouble adapting my parsing code (such as it is). The parsing code that addresses the first two cases is: parseHtmlPageForDescription: htmlString | startParser endParser ppStream descParser result text lower str doubleQuoteIndex | lower := 'escription' asParser. startParser := ' #'second'. result := descParser parse: ppStream. text := (result inject: (WriteStream on: String new) into: [ :stream :char | stream nextPut: char. stream ]) contents trimBoth. str := text copyFrom: (text findString: 'content=') + 9 to: text size. doubleQuoteIndex := 8 - ((str last: 7) indexOf: $"). ^ str copyFrom: 1 to: str size - doubleQuoteIndex I can't figure out how to change the startParser parser to accept the second idiom. And maybe there's a better approach altogether. Anyway. If anyone has any ideas on different approaches I'd appreciate learning them. Thanks for giving it some thought Paul -- signatur Hartmut Krasemann Königsberger Str. 41 c D 22869 Schenefeld Tel. 040.8307097 Mobil 0171.6451283 krasem...@acm.org
Re: [Pharo-users] PetitParser question parsing HTML meta tags
On 03/30/2017 10:58 AM, PAUL DEBRUICKER wrote: > I can't figure out how to change the startParser parser to accept the second > idiom. And maybe there's a better approach altogether. Anyway. If anyone > has any ideas on different approaches I'd appreciate learning them. This looks like a job for the ordered choice operator. Perhaps something very roughly like this... descTag := nameDescTag / httpDescTag. nameDescTag := 'name=' , descAnyCase. httpDescTag := 'http-equiv=' , descAnyCase. descAnyCase := 'description' asParser / 'Description' asParser. And so on. HTH. Regards, -Martin
Re: [Pharo-users] PetitParser question parsing HTML meta tags
You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML (supported on Pharo, Squak, and GS): descriptions := OrderedCollection new. (XMLHTMLParser parseURL: aURL) allElementsNamed: 'meta' do: [:each | ((each attributeAt: 'name') asLowercase = 'description' or: [(each attributeAt: 'http-equiv') asLowercase = 'description']) ifTrue: [descriptions addLast: (each attributeAt: 'content')]]. it accepts messy HTML and produces an XML DOM tree from it. > Sent: Thursday, March 30, 2017 at 1:58 PM > From: "PAUL DEBRUICKER" > To: "Any question about pharo is welcome" > Subject: [Pharo-users] PetitParser question parsing HTML meta tags > > This is kind of a "I'm tired of thinking about this and not making much > progress for the amount of time I'm putting in question" but here it is: > > > > I'm trying to parse descriptions from HTML meta elements. I can't use Soup > because there isn't a working GemStone port. > > I've got it to work with the structure: > > > > and > > > > > but I'm running into instances of: > > > > and > > > > > and am having trouble adapting my parsing code (such as it is). > > > The parsing code that addresses the first two cases is: > > > > parseHtmlPageForDescription: htmlString > | startParser endParser ppStream descParser result text lower str > doubleQuoteIndex | > lower := 'escription' asParser. > startParser := ' endParser := '>' asParser. > ppStream := htmlString readStream asPetitStream. > descParser := ((#'any' asParser starLazy: startParser , lower) > , (#'any' asParser starLazy: endParser)) ==> #'second'. > result := descParser parse: ppStream. > text := (result > inject: (WriteStream on: String new) > into: [ :stream :char | > stream nextPut: char. > stream ]) > contents trimBoth. > str := text copyFrom: (text findString: 'content=') + 9 to: text size. > doubleQuoteIndex := 8 - ((str last: 7) indexOf: $"). > ^ str copyFrom: 1 to: str size - doubleQuoteIndex > > > I can't figure out how to change the startParser parser to accept the second > idiom. And maybe there's a better approach altogether. Anyway. If anyone > has any ideas on different approaches I'd appreciate learning them. > > > Thanks for giving it some thought > > Paul >
Re: [Pharo-users] PetitParser question parsing HTML meta tags
XMLParserHTML is the fastest HTML parser on Pharo, Squeak, and GS. It has DOM and SAX parsers and works with other libs such as PharoExtras/XPath and PharoExtras/XMLParserStAX. Element and attribute names are normalized to lowercase, and printing XML DOM trees back as HTML is complicated by browsers not recognizing XML-style self-closing tags ending with "/>" for some elements (like "script"), so use #printedWithoutSelfClosingTags/#printWithoutSelfClosingTagsOn:/#printWithoutSelfClosingTagsToFileNamed: instead. > Sent: Thursday, March 30, 2017 at 1:58 PM > From: "PAUL DEBRUICKER" > To: "Any question about pharo is welcome" > Subject: [Pharo-users] PetitParser question parsing HTML meta tags > > This is kind of a "I'm tired of thinking about this and not making much > progress for the amount of time I'm putting in question" but here it is: > > > > I'm trying to parse descriptions from HTML meta elements. I can't use Soup > because there isn't a working GemStone port. > > I've got it to work with the structure: > > > > and > > > > > but I'm running into instances of: > > > > and > > > > > and am having trouble adapting my parsing code (such as it is). > > > The parsing code that addresses the first two cases is: > > > > parseHtmlPageForDescription: htmlString > | startParser endParser ppStream descParser result text lower str > doubleQuoteIndex | > lower := 'escription' asParser. > startParser := ' endParser := '>' asParser. > ppStream := htmlString readStream asPetitStream. > descParser := ((#'any' asParser starLazy: startParser , lower) > , (#'any' asParser starLazy: endParser)) ==> #'second'. > result := descParser parse: ppStream. > text := (result > inject: (WriteStream on: String new) > into: [ :stream :char | > stream nextPut: char. > stream ]) > contents trimBoth. > str := text copyFrom: (text findString: 'content=') + 9 to: text size. > doubleQuoteIndex := 8 - ((str last: 7) indexOf: $"). > ^ str copyFrom: 1 to: str size - doubleQuoteIndex > > > I can't figure out how to change the startParser parser to accept the second > idiom. And maybe there's a better approach altogether. Anyway. If anyone > has any ideas on different approaches I'd appreciate learning them. > > > Thanks for giving it some thought > > Paul >
Re: [Pharo-users] PetitParser question parsing HTML meta tags
Thanks. I really appreciate everyone's help on this. Was at a high level of frustration the other day. monty-3 wrote > You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML > (supported on Pharo, Squak, and GS): > > descriptions := OrderedCollection new. > (XMLHTMLParser parseURL: aURL) > allElementsNamed: 'meta' > do: [:each | > ((each attributeAt: 'name') asLowercase = 'description' > or: [(each attributeAt: 'http-equiv') asLowercase = > 'description']) > ifTrue: [descriptions addLast: (each attributeAt: > 'content')]]. > > it accepts messy HTML and produces an XML DOM tree from it. > >> Sent: Thursday, March 30, 2017 at 1:58 PM >> From: "PAUL DEBRUICKER" < > pdebruic@ > > >> To: "Any question about pharo is welcome" < > pharo-users@.pharo > > >> Subject: [Pharo-users] PetitParser question parsing HTML meta tags >> >> This is kind of a "I'm tired of thinking about this and not making much >> progress for the amount of time I'm putting in question" but here it is: >> >> >> >> I'm trying to parse descriptions from HTML meta elements. I can't use >> Soup because there isn't a working GemStone port. >> >> I've got it to work with the structure: >> >> > >> >> and >> >> > >> >> >> but I'm running into instances of: >> >> > >> >> and >> >> > >> >> >> and am having trouble adapting my parsing code (such as it is). >> >> >> The parsing code that addresses the first two cases is: >> >> >> >> parseHtmlPageForDescription: htmlString >> | startParser endParser ppStream descParser result text lower str >> doubleQuoteIndex | >> lower := 'escription' asParser. >> startParser := ' > > >endParser := '>' asParser. >> ppStream := htmlString readStream asPetitStream. >> descParser := ((#'any' asParser starLazy: startParser , lower) >> , (#'any' asParser starLazy: endParser)) ==> #'second'. >> result := descParser parse: ppStream. >> text := (result >> inject: (WriteStream on: String new) >> into: [ :stream :char | >> stream nextPut: char. >> stream ]) >> contents trimBoth. >> str := text copyFrom: (text findString: 'content=') + 9 to: text size. >> doubleQuoteIndex := 8 - ((str last: 7) indexOf: $"). >> ^ str copyFrom: 1 to: str size - doubleQuoteIndex >> >> >> I can't figure out how to change the startParser parser to accept the >> second idiom. And maybe there's a better approach altogether. Anyway. >> If anyone has any ideas on different approaches I'd appreciate learning >> them. >> >> >> Thanks for giving it some thought >> >> Paul >> -- View this message in context: http://forum.world.st/PetitParser-question-parsing-HTML-meta-tags-tp4940587p4941367.html Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.