So... okay. I've got a bunch of PDFs of tournament reports that I want to sift thru for information. Ended up using 'pdftotext -layout file.pdf file.txt' to extract the text from the PDF. Still have a few little glitches to iron out there, but I'm getting decent enough results for the moment to move on.

I've got my script to where it opens the file, ignores the header lines at the top, then goes through the rest of the file line by line, skipping lines if they don't match (don't need the separator lines) and adding them to a list if they do (and stripping whitespace off the right side along the way). So far, so good.

#  rstatPDF2csv.py

import sys
import re


def convert(file):
    lines = []
    data = open(file)

    # Skip first n lines of headers
    for i in range(9):
        data.__next__()

    # Read remaining lines one at a time
    for line in data:

        # If the line begins with a capital letter...
        if re.match(r'^[A-Z]', line):

            # Strip any trailing whitespace and then add to the list
            lines.append(line.rstrip())

    return lines

if __name__ == '__main__':
    print(convert(sys.argv[1]))



What I'm ending up with is a list full of strings that look something like this:

['JOHN DOE C T HM 445-20*MW* 199-11*MW* 194-5 1HM 393-16*MW* 198-9 1HM 198-11*MW* 396-20*MW* 789-36*MW* 1234-56 *MW*',

Basically... a certain number of characters allotted for competitor name, then four or five 1-2 char columns for things like classification, age group, special categories, etc., then a score ('445-20'), then up to 4 char for award (if any), then another score, another award, etc. etc. etc.

Right now (in the PDF) the scores are batched by one criterion, then sorted within those groups. Makes life easier for the person giving out awards at the end of the tournament, not so much for someone trying to see how their individual score ranks against the whole field, not just their group or sub-group. I want to be able to pull all the scores out and then re-sort based on score - mainly the final aggregate score, but potentially also on stage or daily scores. Eventually I'd like to be able to calculate standardized z-scores so as to be able to compare scores from one event/location against another.

So back to the lines of text I have stored as strings in a list. I think I want to convert that to a list of lists, i.e. split each line up, store that info in another list and ditch the whitespace. Or would I be better off using dicts? Originally I was thinking of how to process each line and split it them up based on what information was where - some sort of nested for/if mess. Now I'm starting to think that the lines of text are pretty uniform in structure i.e. the same field is always in the same location, and that list slicing might be the way to go, if a bit tedious to set up initially...?

Any thoughts or suggestions from people who've gone down this particular path would be greatly appreciated. I think I have a general idea/direction, but I'm open to other ideas if the path I'm on is just blatantly wrong.



Thanks,

Monte


--
Shiny!  Let's be bad guys.

Reach me @ memilanuk (at) gmail dot com

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to