I use lxml to work with a large collection of TEI-encoded texts(66,000) that
are linguistically annotated. Each token is wrapped in a <w> or <pc> element
with a unique ID and various attributes. I can march through the texts at the
lowest level of <w> and <pc> elements without paying any attention to the
discursive structure of higher elements. I just do
for w in tree.iter(tei + ‘w’, tei + ‘pc’:
if x:
do this
if y:
do that
But now I want to create a concordance in which tokens meeting some condition
are pulled out and surrounded with seven words on either side. I do this with
itersiblings(), but that is a tricky operation. The next <w> token may not be a
sibling but a child of a higher level sibling. Remembering that “elements are
lists” you have patterns like
[a, b, c, [d, e, f] g, h, i, [k, l, m, n]
Getting from ‘c’ to ‘d’ is one thing, getting from ‘f’ to ‘g’ is another. In a
large archive of sometimes quite weird encodings, the details become very hairy
very fast. Is there are some “Gordian knot” solution, or does one just figure
out this obstacle race one detail at a time? There are “soft” tags that do not
break the continuity of a sentence (hi), hard tags that mark an end beyond
which you don’t want to go anyhow (p), and “jump tags” (note) where your “next
sibling” is the first <w> after the <note> element, which may be quite long.
I am old enough to have grown up with Winnie the Poh and feel like “Bear of
Very Little Brain” when confronted with these problems. I’ll be grateful for
any advice, including a confirmation that it’s the just way it is.
Martin Mueller
Professor of English and Classics emeritus
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]