Hi everyone,
So here is my problem. I have a bunch of tweets and various metadata that I
want to analyze for sociolinguistic purposes. In order to do this, I'm trying
to infer users' ages thanks to the information they provide in their bio, among
others. For that I'm using regular expressions to match a couple of recurring
patterns in users' bio, like a user mentioning a number followed by various
spellings of "years old" as in:
"John, 30 years old, engineer."
The reason why I'm using regexes for this is that there are actually very few
ways people use to mention there age on Twitter, so just three or four regexes
would allow me to infer the age of most users in my dataset. However, in this
case I also want to check for what comes after "years old", as many people
mention their children's age, and I don't want this to be incorrectly
associated to the user's age, as in:
"John, father of a 12 year old kid, engineer"
So cases as the one above should be ignored, so that I can only keep users for
whom a valid age can be inferred.
My program looks like this:
import csv
import re
with open("test_corpus.csv") as corpus:
corpus_read = csv.reader(corpus, delimiter=",")
for row in corpus_read:
if re.findall(r"\d{2}\s?(?=years old\s?|yo\s?|yr old\s?|y o\s?|yrs
old\s?|year old\s?(?!son|daughter|kid|child))",row[5].lower()):
age = re.findall(r"\d{2}\s?",row[5].lower())
for i in age:
print(i)
The program seems to work in some cases, but in the small test file I created
to try it out, it incorrectly matches the age mentioned in the string "I have a
12 yo son", and returns 12 as a matched age, which I don't want it to. I'm
guessing this has something to do with brackets or delimiters at some point in
the program, but I spent a few days on it, and I could not find anything
helpful around here or on other forums, so any help would be appreciated.
Thus, the actual question is: how to make the program not recognize 12 in
"John, father of a 12 year old kid, engineer" as the age of the user, based on
the program I already have?
I am somewhat new at programming, so apologies if I forgot to mention something
important, do not hesitate to tell me if you need more details.
Thanks in advance for any help you could provide!
--
https://mail.python.org/mailman/listinfo/python-list