So... okay. I've got a bunch of PDFs of tournament reports that I want
to sift thru for information. Ended up using 'pdftotext -layout
file.pdf file.txt' to extract the text from the PDF. Still have a few
little glitches to iron out there, but I'm getting decent enough results
for the moment to move on.
I've got my script to where it opens the file, ignores the header lines
at the top, then goes through the rest of the file line by line,
skipping lines if they don't match (don't need the separator lines) and
adding them to a list if they do (and stripping whitespace off the right
side along the way). So far, so good.
# rstatPDF2csv.py
import sys
import re
def convert(file):
lines = []
data = open(file)
# Skip first n lines of headers
for i in range(9):
data.__next__()
# Read remaining lines one at a time
for line in data:
# If the line begins with a capital letter...
if re.match(r'^[A-Z]', line):
# Strip any trailing whitespace and then add to the list
lines.append(line.rstrip())
return lines
if __name__ == '__main__':
print(convert(sys.argv[1]))
What I'm ending up with is a list full of strings that look something
like this:
['JOHN DOE C T HM 445-20*MW* 199-11*MW*
194-5 1HM 393-16*MW* 198-9 1HM 198-11*MW* 396-20*MW*
789-36*MW* 1234-56 *MW*',
Basically... a certain number of characters allotted for competitor
name, then four or five 1-2 char columns for things like classification,
age group, special categories, etc., then a score ('445-20'), then up to
4 char for award (if any), then another score, another award, etc. etc. etc.
Right now (in the PDF) the scores are batched by one criterion, then
sorted within those groups. Makes life easier for the person giving out
awards at the end of the tournament, not so much for someone trying to
see how their individual score ranks against the whole field, not just
their group or sub-group. I want to be able to pull all the scores out
and then re-sort based on score - mainly the final aggregate score, but
potentially also on stage or daily scores. Eventually I'd like to be
able to calculate standardized z-scores so as to be able to compare
scores from one event/location against another.
So back to the lines of text I have stored as strings in a list. I
think I want to convert that to a list of lists, i.e. split each line
up, store that info in another list and ditch the whitespace. Or would
I be better off using dicts? Originally I was thinking of how to
process each line and split it them up based on what information was
where - some sort of nested for/if mess. Now I'm starting to think that
the lines of text are pretty uniform in structure i.e. the same field is
always in the same location, and that list slicing might be the way to
go, if a bit tedious to set up initially...?
Any thoughts or suggestions from people who've gone down this particular
path would be greatly appreciated. I think I have a general
idea/direction, but I'm open to other ideas if the path I'm on is just
blatantly wrong.
Thanks,
Monte
--
Shiny! Let's be bad guys.
Reach me @ memilanuk (at) gmail dot com
--
https://mail.python.org/mailman/listinfo/python-list