Re: A long time helper needs help

Jenda Krynicky Wed, 31 Mar 2004 15:08:21 -0800

From: <[EMAIL PROTECTED]>
> Thanks for your response.  Now I can concentrate on how to hack the
> code. What what is your take on how to represent the table entries
> (cells)?  What is the most efficient way to associate each cell with
> its parent header?


It's a bit hard to give good suggestions if I do not understand the 
data.

Are all the data yearly like the stuff in page6.prn?

If so it would IMHO be best to have one table with the headers and 
another with the data.

So you'd have something like
        HEADERS
        ID| Name
        1 | Domestic nonfinancial sectors: Total
        2 | Domestic nonfinancial sectors: Federal government
        3 | Domestic nonfinancial sectors: Nonfederal: Total nonfederal
        ...

and
        DATA
        HeaderID        | Year  | Value
        1                       | 1969  | 7.2
        2                       | 1969 | -1.1
        ...


Parsing the headers from the document will be tricky. Since the 
columns are not all the same width (and I believe the Nth column in 
one report will be on a different place than in another) you'll have 
to start by looking at the last line and finding the places to split 
the lowest level headers.

Actually .. I felt like doing some Perl again today ... you can find 
the code attached. It does the tricky job of extracting the complete 
headers, extracting the values and inserting into the database should 
be simple.

Jenda
===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed 
to get drunk and croon as much as they like.
        -- Terry Pratchett in Sourcery

use strict;

open IN, '< c:\temp\page6.prn';
my $report_text = do {local $/; <IN>};
close IN;

$report_text =~ s/\n\n+/\n/g; # remove empty lines
$report_text =~ s/^.*^--------+\n//sm; # remove the header

#print 
"=====================================\n$report_text\n=========================================\n";

my @report = split /\n/, $report_text; # split to lines

#print "lines: "[EMAIL PROTECTED]"\n";

my @last_line = ($report[-1] =~ /(.*?-?\d+(?:\.\d+)?)/g); # split the last line into 
fields (including spaces!)

#print join("\n", @last_line),"\n";

my @lengths = map {length($_)} @last_line;
my @end_pos = do {
        my $sum = 0; # this variable is local to the map, I keep the sum of the 
lengths in it
        map {$sum += $_} @lengths
};

#print join(", ", @lengths),"\n";
#print join(", ", @end_pos),"\n";

my (@section_lines, @header_lines);
while ($report[0] =~ /^\s+--/) { # move the rows starting with spaces and -- to 
@section_lines array
        push @section_lines, shift(@report);
}
print "----------------------------\n", join("\n", @section_lines), 
"\n----------------------------\n";

while ($report[0] =~ /^\s+\w/) { # move the rows starting with spaces followed by text 
to @header_lines array
        push @header_lines, shift(@report);
}
#print "----------------------------\n", join("\n", @header_lines), 
"\n----------------------------\n";

shift(@report); # remove the ________________________

my $unpack_format = 'A' . join( 'A', @lengths);

my @headers = unpack( $unpack_format, shift(@header_lines)); # split the first line of 
column headers

foreach my $header_line (@header_lines) {
        my @next = unpack( $unpack_format, $header_line);
        for(my $i=0; $i <= $#headers; $i++) {
                $headers[$i] .= $next[$i];
        }
}

foreach (@headers) {
        s/^\s+//;
        s/\s+$//;
        s/\s+/ /g;
}

#print "----------------------------\n", join("\n", @headers), 
"\n----------------------------\n";

foreach my $section_line (reverse(@section_lines)) {
print "\$section_line=$section_line\n";
        for(my $i=0; $i <= $#headers; $i++) {
                my ($begin, $end) = ( substr($section_line, 0, $end_pos[$i]), 
substr($section_line, $end_pos[$i]) );

                if  ($begin =~ /-([^-]*\w\s*)$/ and (my $tmp = $1) and $end =~ 
/^(\s*\w[^-]*)-/) {
                        $headers[$i] = $tmp . $1 . ': ' . $headers[$i];
                } elsif ($begin =~ /-(\w[^-]*)-+$/) {
                        $headers[$i] = $1 . ': ' . $headers[$i];
                } elsif ($end =~ /^-*(\w[^-]*)-/) {
                        $headers[$i] = $1 . ': ' . $headers[$i];
                }
        }
}

print "----------------------------\n", join("\n", @headers), 
"\n----------------------------\n";

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: A long time helper needs help

Reply via email to