Re: [Tutor] parsing a "chunked" text file

Karim Liateni Wed, 03 Mar 2010 17:28:46 -0800


Hello Steven,

Is there a big difference to write your first functions as below becauseI am not familiar with yield keyword?


def skip_blanks(lines):
   """Remove leading and trailing whitespace, ignore blank lines."""
   return [line.strip() in lines if line.strip()]

I tried to write as well the second function but it is not as straightforward.

I begin to understand the use of yield in it.

Regards
Karim

Steven D'Aprano wrote:

On Tue, 2 Mar 2010 05:22:43 pm Andrew Fithian wrote:

Hi tutor,

I have a large text file that has chunks of data like this:

headerA n1
line 1
line 2
...
line n1
headerB n2
line 1
line 2
...
line n2

Where each chunk is a header and the lines that follow it (up to the
next header). A header has the number of lines in the chunk as its
second field.

And what happens if the header is wrong? How do you handle situationslike missing headers and empty sections, header lines which are wrong,and duplicate headers?


line 1
line 2
headerB 0
headerC 1
line 1
headerD 2
line 1
line 2
line 3
line 4
headerE 23
line 1
line 2
headerB 1
line 1

This is a policy decision: do you try to recover, raise an exception,raise a warning, pad missing lines as blank, throw away excess lines,or what?

I would like to turn this file into a dictionary like:
dict = {'headerA':[line 1, line 2, ... , line n1], 'headerB':[line1,
line 2, ... , line n2]}

Is there a way to do this with a dictionary comprehension or do I
have to iterate over the file with a "while 1" loop?

I wouldn't do either. I would treat this as a pipe-line problem: youhave a series of lines that need to be processed. You can feed themthrough a pipe-line of filters:


def skip_blanks(lines):
    """Remove leading and trailing whitespace, ignore blank lines."""
    for line in lines:
        line = line.strip()
        if line:
            yield line

def collate_section(lines):
    """Return a list of lines that belong in a section."""
    current_header = ""
    accumulator = []
    for line in lines:
        if line.startswith("header"):
            yield (current_header, accumulator)
            current_header = line
            accumulator = []
        else:
            accumulator.append(line)
    yield (current_header, accumulator)


Then put them together like this:


fp = open("my_file.dat", "r")
data = {}  # don't shadow the built-in dict
non_blank_lines = skip_blanks(fp)
sections = collate_sections(non_blank_lines)
for (header, lines) in sections:
    data[header] = lines


Of course you can add your own error checking.

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] parsing a "chunked" text file

Reply via email to