When I was parsing similar text, I did a .split on the header part and parsed the sections
On Thu, Feb 2, 2023 at 7:26 PM Roy Smith <r...@panix.com> wrote: > Thanks. > > Sadly, I think treating this as flat text will end up being the most > straight-forward way to do it. > > > > On Feb 2, 2023, at 7:03 PM, JJMC89 <jjmc89.wikime...@gmail.com> wrote: > > For similar cases, I have used a regex to find the part marked by comments > and then parse the part between. > > START_END = > re.compile(r"^(.*?<!--\s*Hooks\s*-->)(.*?)(<!--\s*HooksEnd\s*-->.*)$", > flags=re.I | re.S) > m = START_END.search(page.text) > wikicode = mwparserfromhell.parse(m.group(2)) > # do stuff with wikicode > > > You may be able to do it with the parser. > # assume start and end represent comment objects you found from > wikicode.filter_comments() > start_index = wikicode.index(start) > end_index = wikicode.index(end) > inside = wikicode.nodes[start_index:end_index] > > > On Thu, Feb 2, 2023 at 3:39 PM Roy Smith <r...@panix.com> wrote: > >> I'm trying to parse DYK prep area templates, for example Template:Did >> you know/Preparation area 3 >> <https://en.wikipedia.org/wiki/Template:Did_you_know/Preparation_area_3>. >> Unfortunately, these are more like flat text files than any kind of nicely >> structured data. The stuff of interest is everything between two HTML >> comments: >> >> <!--Hooks--> >> {{main page image/DYK|image=Melissa Ong.webp|caption=Selfie of Ong, >> commonly replicated by the Step Chickens<!--the caption length is >> intentional, it highlights that this image is there for a specific purpose >> and isn't just any image of Ong – please don't shorten it! Same for the >> ''(shown)'' –leek -->}} >> * ... that "Step Chickens" on TikTok replace their profile pictures with >> an image ''(shown)'' of '''[[Melissa Ong]]''', whom they call "Mother Hen"? >> * ... that '''[[interfaith greetings in Indonesia]]''' include phrases >> from Islam, Christianity, Hinduism, Buddhism, and Confucianism? >> * ... that '''[[Kimmo Leinonen]]''' helped establish both the [[Finnish >> Hockey Hall of Fame]] and the [[IIHF Hall of Fame]]? >> * ... that the [[Pulitzer Prize for Fiction|Pulitzer Prize]]-winning >> novel '''''[[All the Light We Cannot See]]''''' contains a sympathetic >> [[Nazism|Nazi]]? >> * ... that a {{Convert|10|ft|m|adj=mid|-tall|0}} '''[[Lady Rainier|statue >> of a woman]]''' in [[Seattle]] was commissioned by a local brewery in 1903? >> * ... that ... >> * ... that prior to entering politics, '''[[Herbert Salvatierra]]''' led >> a troupe of [[carnival]] ''[[comparsa]]s''? >> * ... that [[Winston Churchill]] published '''[[Are There Men on the >> Moon?|an essay on extraterrestrial life]]''' during the Second World War? >> <!--HooksEnd--> >> >> >> I can find the comments with Wikicode.filter_comments(). But once I've >> found the two delimiting comments, how do I grab the text between them? Or >> is the parser the wrong tool? Would I do better to treat the content of >> the page as flat text and just iterate over it line by line, teasing it >> apart with regexes? >> >> >> _______________________________________________ >> pywikibot mailing list -- pywikibot@lists.wikimedia.org >> Public archives at >> https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/XA2Y2ZFSFSLRG5TWHIV5G3QRMAK27H56/ >> To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org >> > _______________________________________________ > pywikibot mailing list -- pywikibot@lists.wikimedia.org > Public archives at > https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/4ABOPXJMDIQ7WRBUTI7KTYYE7MKQ6W2U/ > To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org > > > _______________________________________________ > pywikibot mailing list -- pywikibot@lists.wikimedia.org > Public archives at > https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/7HASATQN7PLZ72GRVCR66EID5WJE7R7D/ > To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org >
_______________________________________________ pywikibot mailing list -- pywikibot@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/pywikibot@lists.wikimedia.org/message/TYHHOVQKEYA3Z7GGSANLKLK7LLBKRTCH/ To unsubscribe send an email to pywikibot-le...@lists.wikimedia.org