[Pharo-users] PetitParser question parsing HTML meta tags

2017-03-30 Thread PAUL DEBRUICKER
This is kind of a "I'm tired of thinking about this and not making much 
progress for the amount of time I'm putting in question" but here it is: 



I'm trying to parse descriptions from HTML meta elements.  I can't use Soup 
because there isn't a working GemStone port.  

I've got it to work with the structure:



and 




but I'm running into instances of: 



and




and am having trouble adapting my parsing code (such as it is). 


The parsing code that addresses the first two cases is:



parseHtmlPageForDescription: htmlString
  | startParser endParser ppStream descParser result text lower str 
doubleQuoteIndex |
  lower := 'escription' asParser.
  startParser := ' #'second'.
  result := descParser parse: ppStream.
  text := (result
inject: (WriteStream on: String new)
into: [ :stream :char | 
  stream nextPut: char.
  stream ])
contents trimBoth.
  str := text copyFrom: (text findString: 'content=') + 9 to: text size.
  doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
  ^ str copyFrom: 1 to: str size - doubleQuoteIndex


I can't figure out how to change the startParser parser to accept the second 
idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has 
any ideas on different approaches I'd appreciate learning them.  


Thanks for giving it some thought

Paul


[Pharo-users] PetitParser question parsing HTML meta tags

2017-03-31 Thread Hartmut Krasemann

its easier this way (still in form of a script):

parseHtmlPageForDescription: htmlString
  | parser endParser  metaParser descriptionParser contentParser 
res1Parser res2Parser quoteParser nonQuoteParser |

  metaParser := '  parser := (metaParser, res1Parser, contentParser, res2Parser, 
endParser)  end
   ==> [:nodes| Array with: (nodes at: 2) inputValue with: 
(nodes at: 4) inputValue ].

  ^parser parse: htmlString.


"self parseHtmlPageForDescription:  self htmlString1
self parseHtmlPageForDescription:  self htmlString2
self parseHtmlPageForDescription:  self htmlString3   "

with
htmlString1
  ^''
etc..

you may want to read http://www.lukas-renggli.ch/blog/petitparser-1

good luck
Hartmut

This is kind of a "I'm tired of thinking about this and not making much progress for 
the amount of time I'm putting in question" but here it is:



I'm trying to parse descriptions from HTML meta elements.  I can't use Soup 
because there isn't a working GemStone port.

I've got it to work with the structure:



and




but I'm running into instances of:



and




and am having trouble adapting my parsing code (such as it is).


The parsing code that addresses the first two cases is:



parseHtmlPageForDescription: htmlString
   | startParser endParser ppStream descParser result text lower str 
doubleQuoteIndex |
   lower := 'escription' asParser.
   startParser := ' #'second'.
   result := descParser parse: ppStream.
   text := (result
 inject: (WriteStream on: String new)
 into: [ :stream :char |
   stream nextPut: char.
   stream ])
 contents trimBoth.
   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
   ^ str copyFrom: 1 to: str size - doubleQuoteIndex


I can't figure out how to change the startParser parser to accept the second 
idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone has 
any ideas on different approaches I'd appreciate learning them.


Thanks for giving it some thought

Paul



--
signatur

Hartmut Krasemann
Königsberger Str. 41 c
D 22869 Schenefeld
Tel. 040.8307097
Mobil 0171.6451283
krasem...@acm.org



Re: [Pharo-users] PetitParser question parsing HTML meta tags

2017-03-30 Thread Martin McClure
On 03/30/2017 10:58 AM, PAUL DEBRUICKER wrote:
> I can't figure out how to change the startParser parser to accept the second 
> idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone 
> has any ideas on different approaches I'd appreciate learning them.  

This looks like a job for the ordered choice operator. Perhaps something
very roughly like this...

descTag := nameDescTag / httpDescTag.
nameDescTag := 'name=' , descAnyCase.
httpDescTag := 'http-equiv=' , descAnyCase.
descAnyCase := 'description' asParser / 'Description' asParser.

And so on.

HTH.

Regards,

-Martin



Re: [Pharo-users] PetitParser question parsing HTML meta tags

2017-03-31 Thread monty
You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML (supported on 
Pharo, Squak, and GS):

descriptions := OrderedCollection new.
(XMLHTMLParser parseURL: aURL)
allElementsNamed: 'meta'
do: [:each |
((each attributeAt: 'name') asLowercase = 'description'
or: [(each attributeAt: 'http-equiv') asLowercase = 
'description'])
ifTrue: [descriptions addLast: (each attributeAt: 
'content')]].

it accepts messy HTML and produces an XML DOM tree from it.

> Sent: Thursday, March 30, 2017 at 1:58 PM
> From: "PAUL DEBRUICKER" 
> To: "Any question about pharo is welcome" 
> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>
> This is kind of a "I'm tired of thinking about this and not making much 
> progress for the amount of time I'm putting in question" but here it is: 
> 
> 
> 
> I'm trying to parse descriptions from HTML meta elements.  I can't use Soup 
> because there isn't a working GemStone port.  
> 
> I've got it to work with the structure:
> 
> 
> 
> and 
> 
> 
> 
> 
> but I'm running into instances of: 
> 
> 
> 
> and
> 
> 
> 
> 
> and am having trouble adapting my parsing code (such as it is). 
> 
> 
> The parsing code that addresses the first two cases is:
> 
> 
> 
> parseHtmlPageForDescription: htmlString
>   | startParser endParser ppStream descParser result text lower str 
> doubleQuoteIndex |
>   lower := 'escription' asParser.
>   startParser := '   endParser := '>' asParser.
>   ppStream := htmlString readStream asPetitStream.
>   descParser := ((#'any' asParser starLazy: startParser , lower)
> , (#'any' asParser starLazy: endParser)) ==> #'second'.
>   result := descParser parse: ppStream.
>   text := (result
> inject: (WriteStream on: String new)
> into: [ :stream :char | 
>   stream nextPut: char.
>   stream ])
> contents trimBoth.
>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
> 
> 
> I can't figure out how to change the startParser parser to accept the second 
> idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone 
> has any ideas on different approaches I'd appreciate learning them.  
> 
> 
> Thanks for giving it some thought
> 
> Paul
> 



Re: [Pharo-users] PetitParser question parsing HTML meta tags

2017-04-02 Thread monty
XMLParserHTML is the fastest HTML parser on Pharo, Squeak, and GS. It has DOM 
and SAX parsers and works with other libs such as PharoExtras/XPath and 
PharoExtras/XMLParserStAX.

Element and attribute names are normalized to lowercase, and printing XML DOM 
trees back as HTML is complicated by browsers not recognizing XML-style 
self-closing tags ending with "/>" for some elements (like "script"), so use 
#printedWithoutSelfClosingTags/#printWithoutSelfClosingTagsOn:/#printWithoutSelfClosingTagsToFileNamed:
 instead.

> Sent: Thursday, March 30, 2017 at 1:58 PM
> From: "PAUL DEBRUICKER" 
> To: "Any question about pharo is welcome" 
> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>
> This is kind of a "I'm tired of thinking about this and not making much 
> progress for the amount of time I'm putting in question" but here it is: 
> 
> 
> 
> I'm trying to parse descriptions from HTML meta elements.  I can't use Soup 
> because there isn't a working GemStone port.  
> 
> I've got it to work with the structure:
> 
> 
> 
> and 
> 
> 
> 
> 
> but I'm running into instances of: 
> 
> 
> 
> and
> 
> 
> 
> 
> and am having trouble adapting my parsing code (such as it is). 
> 
> 
> The parsing code that addresses the first two cases is:
> 
> 
> 
> parseHtmlPageForDescription: htmlString
>   | startParser endParser ppStream descParser result text lower str 
> doubleQuoteIndex |
>   lower := 'escription' asParser.
>   startParser := '   endParser := '>' asParser.
>   ppStream := htmlString readStream asPetitStream.
>   descParser := ((#'any' asParser starLazy: startParser , lower)
> , (#'any' asParser starLazy: endParser)) ==> #'second'.
>   result := descParser parse: ppStream.
>   text := (result
> inject: (WriteStream on: String new)
> into: [ :stream :char | 
>   stream nextPut: char.
>   stream ])
> contents trimBoth.
>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
> 
> 
> I can't figure out how to change the startParser parser to accept the second 
> idiom.  And maybe there's a better approach altogether.  Anyway.  If anyone 
> has any ideas on different approaches I'd appreciate learning them.  
> 
> 
> Thanks for giving it some thought
> 
> Paul
> 



Re: [Pharo-users] PetitParser question parsing HTML meta tags

2017-04-05 Thread Paul DeBruicker
Thanks.  I really appreciate everyone's help on this.  Was at a high level of
frustration the other day.  








monty-3 wrote
> You could use XMLHTMLParser from STHub PharoExtras/XMLParserHTML
> (supported on Pharo, Squak, and GS):
> 
> descriptions := OrderedCollection new.
> (XMLHTMLParser parseURL: aURL)
>   allElementsNamed: 'meta'
>   do: [:each |
>   ((each attributeAt: 'name') asLowercase = 'description'
>   or: [(each attributeAt: 'http-equiv') asLowercase = 
> 'description'])
>   ifTrue: [descriptions addLast: (each attributeAt: 
> 'content')]].
> 
> it accepts messy HTML and produces an XML DOM tree from it.
> 
>> Sent: Thursday, March 30, 2017 at 1:58 PM
>> From: "PAUL DEBRUICKER" <

> pdebruic@

> >
>> To: "Any question about pharo is welcome" <

> pharo-users@.pharo

> >
>> Subject: [Pharo-users] PetitParser question parsing HTML meta tags
>>
>> This is kind of a "I'm tired of thinking about this and not making much
>> progress for the amount of time I'm putting in question" but here it is: 
>> 
>> 
>> 
>> I'm trying to parse descriptions from HTML meta elements.  I can't use
>> Soup because there isn't a working GemStone port.  
>> 
>> I've got it to work with the structure:
>> 
>> 
> 
>> 
>> and 
>> 
>> 
> 
>> 
>> 
>> but I'm running into instances of: 
>> 
>> 
> 
>> 
>> and
>> 
>> 
> 
>> 
>> 
>> and am having trouble adapting my parsing code (such as it is). 
>> 
>> 
>> The parsing code that addresses the first two cases is:
>> 
>> 
>> 
>> parseHtmlPageForDescription: htmlString
>>   | startParser endParser ppStream descParser result text lower str
>> doubleQuoteIndex |
>>   lower := 'escription' asParser.
>>   startParser := '
> >
>endParser := '>' asParser.
>>   ppStream := htmlString readStream asPetitStream.
>>   descParser := ((#'any' asParser starLazy: startParser , lower)
>> , (#'any' asParser starLazy: endParser)) ==> #'second'.
>>   result := descParser parse: ppStream.
>>   text := (result
>> inject: (WriteStream on: String new)
>> into: [ :stream :char | 
>>   stream nextPut: char.
>>   stream ])
>> contents trimBoth.
>>   str := text copyFrom: (text findString: 'content=') + 9 to: text size.
>>   doubleQuoteIndex := 8 - ((str last: 7) indexOf: $").
>>   ^ str copyFrom: 1 to: str size - doubleQuoteIndex
>> 
>> 
>> I can't figure out how to change the startParser parser to accept the
>> second idiom.  And maybe there's a better approach altogether.  Anyway. 
>> If anyone has any ideas on different approaches I'd appreciate learning
>> them.  
>> 
>> 
>> Thanks for giving it some thought
>> 
>> Paul
>>





--
View this message in context: 
http://forum.world.st/PetitParser-question-parsing-HTML-meta-tags-tp4940587p4941367.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.