[pywikibot] Re: Text between two comments?

John Thu, 02 Feb 2023 16:49:08 -0800

When I was parsing similar text, I did a .split on the header part and
parsed the sections


On Thu, Feb 2, 2023 at 7:26 PM Roy Smith <r...@panix.com> wrote:

> Thanks.
>
> Sadly, I think treating this as flat text will end up being the most
> straight-forward way to do it.
>
>
>
> On Feb 2, 2023, at 7:03 PM, JJMC89 <jjmc89.wikime...@gmail.com> wrote:
>
> For similar cases, I have used a regex to find the part marked by comments
> and then parse the part between.
>
> START_END  =
> re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$",
> flags=re.I | re.S)
> m = START_END.search(page.text)
> wikicode = mwparserfromhell.parse(m.group(2))
> # do stuff with wikicode
>
>
> You may be able to do it with the parser.
> # assume start and end represent comment objects you found from
> wikicode.filter_comments()
> start_index = wikicode.index(start)
> end_index = wikicode.index(end)
> inside = wikicode.nodes[start_index:end_index]
>
>
> On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <r...@panix.com> wrote:
>
>> I'm trying to parse DYK prep area templates, for example Template:Did
>> you know/Preparation area 3
>> <https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3>.
>> Unfortunately, these are more like flat text files than any kind of nicely
>> structured data.  The stuff of interest is everything between two HTML
>> comments:
>>
>> <!--Hooks-->
>> {{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong,
>> commonly replicated by the Step Chickens<!--the caption length is
>> intentional, it highlights that this image is there for a specific purpose
>> and isn't just any image of Ong – please don't shorten it! Same for the
>> ''(shown)'' –leek -->}}
>> * ... that "Step Chickens" on TikTok replace their profile pictures with
>> an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"?
>> * ... that '''[[interfaith greetings in Indonesia]]''' include phrases
>> from Islam, Christianity, Hinduism, Buddhism, and Confucianism?
>> * ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish
>> Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]?
>> * ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning
>> novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic
>> [[Nazism|Nazi]]?
>> * ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue
>> of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903?
>> * ... that ...
>> * ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led
>> a troupe of [[carnival]] ''[[comparsa]]s''?
>> * ... that [[Winston Churchill]] published '''[[Are There Men on the
>> Moon?|an essay on extraterrestrial life]]''' during the Second World War?
>> <!--HooksEnd-->
>>
>>
>> I can find the comments with Wikicode.filter_comments().  But once I've
>> found the two delimiting comments, how do I grab the text between them?  Or
>> is the parser the wrong tool?  Would I do better to treat the content of
>> the page as flat text and just iterate over it line by line, teasing it
>> apart with regexes?
>>
>>
>> _______________________________________________
>> pywikibot mailing list -- pywikibot@lists.wikimedia.org
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/XA2Y2ZFSFSLRG5TWHIV5G3QRMAK27H56/
>> To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org
>>
> _______________________________________________
> pywikibot mailing list -- pywikibot@lists.wikimedia.org
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/4ABOPXJMDIQ7WRBUTI7KTYYE7MKQ6W2U/
> To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org
>
>
> _______________________________________________
> pywikibot mailing list -- pywikibot@lists.wikimedia.org
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/7HASATQN7PLZ72GRVCR66EID5WJE7R7D/
> To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org
>

_______________________________________________
pywikibot mailing list -- pywikibot@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/TYHHOVQKEYA3Z7GGSANLKLK7LLBKRTCH/
To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org

[pywikibot] Re: Text between two comments?

Reply via email to