Picking apart a text line

memilanuk Thu, 26 Feb 2015 19:56:12 -0800

So... okay. I've got a bunch of PDFs of tournament reports that I wantto sift thru for information. Ended up using 'pdftotext -layoutfile.pdf file.txt' to extract the text from the PDF. Still have a fewlittle glitches to iron out there, but I'm getting decent enough resultsfor the moment to move on.

I've got my script to where it opens the file, ignores the header linesat the top, then goes through the rest of the file line by line,skipping lines if they don't match (don't need the separator lines) andadding them to a list if they do (and stripping whitespace off the rightside along the way). So far, so good.


#  rstatPDF2csv.py

import sys
import re


def convert(file):
    lines = []
    data = open(file)

    # Skip first n lines of headers
    for i in range(9):
        data.__next__()

    # Read remaining lines one at a time
    for line in data:

        # If the line begins with a capital letter...
        if re.match(r'^[A-Z]', line):

            # Strip any trailing whitespace and then add to the list
            lines.append(line.rstrip())

    return lines

if __name__ == '__main__':
    print(convert(sys.argv[1]))

What I'm ending up with is a list full of strings that look somethinglike this:

['JOHN DOE C T HM 445-20*MW* 199-11*MW*194-5 1HM 393-16*MW* 198-9 1HM 198-11*MW* 396-20*MW*789-36*MW* 1234-56 *MW*',

Basically... a certain number of characters allotted for competitorname, then four or five 1-2 char columns for things like classification,age group, special categories, etc., then a score ('445-20'), then up to4 char for award (if any), then another score, another award, etc. etc. etc.

Right now (in the PDF) the scores are batched by one criterion, thensorted within those groups. Makes life easier for the person giving outawards at the end of the tournament, not so much for someone trying tosee how their individual score ranks against the whole field, not justtheir group or sub-group. I want to be able to pull all the scores outand then re-sort based on score - mainly the final aggregate score, butpotentially also on stage or daily scores. Eventually I'd like to beable to calculate standardized z-scores so as to be able to comparescores from one event/location against another.

So back to the lines of text I have stored as strings in a list. Ithink I want to convert that to a list of lists, i.e. split each lineup, store that info in another list and ditch the whitespace. Or wouldI be better off using dicts? Originally I was thinking of how toprocess each line and split it them up based on what information waswhere - some sort of nested for/if mess. Now I'm starting to think thatthe lines of text are pretty uniform in structure i.e. the same field isalways in the same location, and that list slicing might be the way togo, if a bit tedious to set up initially...?

Any thoughts or suggestions from people who've gone down this particularpath would be greatly appreciated. I think I have a generalidea/direction, but I'm open to other ideas if the path I'm on is justblatantly wrong.




Thanks,

Monte


--
Shiny!  Let's be bad guys.

Reach me @ memilanuk (at) gmail dot com

--
https://mail.python.org/mailman/listinfo/python-list

Picking apart a text line

Reply via email to