parsing text

Joe Youngquist Tue, 09 Dec 2003 15:19:19 -0800

Hello list,

I've been trying to figure out a generalized method of parsing space formatted text to outout into html tables. The data is verly likely written out using Perl Reports and Pictures, has anyone come up with a general method?

Here's a few examplesof the text that I'd format to html tables:

NOTE: Best to use Courier New font to keep the formatting

|-----------------OVERALL STATISTICS------------------|

TOTALS O-REB D-REB TOTAL PF FO A TO A/TO Hi Pts

---------------------------------------------------------------------------
Lowe, Kenneth.......     0    15    15   15   0   14   11   1.3     26
Teague, David.......     6    16    22    9   0    9    4   2.2     19
Booker, Chris.......    13    21    34    8   0   10   10   1.0     20
Buckley, Melvin.....     5    17    22   11   0   10    8   1.2     20
McKnight, Brandon...     1    11    12   15   1   18   15   1.2     13
Buscher, Brett......     1     9    10   15   0    9    9   1.0     10
Kartelo, Ivan.......    22    19    41   14   0    2    7   0.3     12
Kiefer, Matt........     9    12    21   14   0    4    9   0.4      7
Parkinson, Austin...     3     5     8    4   0   20    7   2.9      8
Nwankwo, Ije........     2     2     4    2   0    2    2   1.0      2
Carroll, Matt.......     1     3     4    6   0    0    2   0.0      2
Ford, Andrew........     0     1     1    2   0    0    1   0.0      0
Garrity, Kevin......     0     1     1    0   0    0    0   0.0      0
Hartley, Chris......     1     0     1    0   0    0    1   0.0      0
Total...............    72   143   215 115   1   98   86   1.1     78

Opponents........... 72 130 202 131 - 62 103 0.6 68

   TEAM STATISTICS                         PUR          OPP
   --------------------------------------------------------
   SCORING.......................          431          352
     Points per game.............         71.8         58.7
     Scoring margin..............        +13.2            -
   FIELD GOALS-ATT...............      142-328      134-336
     Field goal pct..............         .433         .399
   3 POINT FG-ATT................       36-102        25-99
     3-point FG pct..............         .353         .253
     3-pt FG made per game.......          6.0          4.2
   FREE THROWS-ATT...............      111-147        59-99
     Free throw pct..............         .755         .596
   REBOUNDS......................          215          202
     Rebounds per game...........         35.8         33.7
     Rebounding margin...........         +2.2            -
   ASSISTS.......................           98           62
     Assists per game............         16.3         10.3
   TURNOVERS.....................           86          103
     Turnovers per game..........         14.3         17.2
     Turnover margin.............         +2.8            -
     Assist/turnover ratio.......          1.1          0.6
   STEALS........................           44           31
     Steals per game.............          7.3          5.2
   BLOCKS........................           23           23
     Blocks per game.............          3.8          3.8
   WINNING STREAK................            6            -
     Home win streak.............            3            -
   ATTENDANCE....................        33118        23435
     Home games-Avg/Game.........      3-11039          0-0
     Neutral site-Avg/Game.......            -       3-7812

   BY PERIOD     1st 2nd    Total
   ------------ ---- ----     ----
   Team........ 203 228 -   431
   Opponents... 164 188 -   352

The goal I'm trying to reach is to build a method that no matter the table of data sent to it, will find where the columns are for the data. It's easy to "see" where the columns are, but my attempt to tell a program how to "see" the columns has been embarrising to say the least.

The road I was walking down was to take each line of a table and look for spaces (skipping dashes and pipes) when one is found, look "down" the rest of the table in this current column with the space. If all the way "down" the table are spaces (or a dash or pipe) then there is likely a column boundry at this column location. Once the entire table of data has been looked at, where there were changes from text to spaces back to text, there is an ending "cell" of data and the start of a new "cell". So my logic is this looking at the last example table of data:

BY PERIOD     1st 2nd    Total
------------ ---- ----     ----
Team          203 228 -   431
Opponents     164 188 -   352

Line one:
0-9: text ( at col 3 [the space between "by" and "period"]

would be counted as text because "down" the table

there are no other spaces)
10-14: spaces
15-17: text
18-19: spaces
20-22: text
23-26: spaces
27-31: text

Line two:
0-31: spaces (by the logic that dashes are counted like a space)

Line three:
0-4 text
5-14 spaces
15-17: text
18-19: spaces
20-22: text
23-28: spaces
29-31: text

Line four:
0-9: text
10-14: spaces
15-17: text
18-19: spaces
20-22: text
23-28: spaces
29-31: text

From this I can tell the program for each line in the table:
from 0 to 9 grab the text,
from 15 to 17 grab the text,
from 20 to 22 grab the text,
from 27 to 31 grab the text,

I would end up with (after ignoring line two and stripping leading and trailing space)
<table>
<tr>
<td>BY PERIOD</td>
<td>1st</td>
<td>2nd</td>
<td>Total</td>
</tr>
<tr>
<td>Team</td>
<td>203</td>
<td>228</td>
<td>431</td>
</tr>
<tr>
<td>Opponents</td>
<td>164</td>
<td>188</td>
<td>352</td>
</tr>
</table>

I dunno, just tossing this out the list for the hopes for a fresh perspective to the problem. Below is some code I'm trying to tell the program how to spot spaces down the table.

Thanks in advanced for your time in reading all this.

Joe Y.

-----------------------------------Code:-----------------------------------------------

my $text = "                    |-----------------OVERALL STATISTICS------------------|
TOTALS               O-REB D-REB TOTAL   PF FO    A   TO A/TO Hi Pts
---------------------------------------------------------------------------
....................     0    15    15   15   0   14   11   1.3     26
....................     6    16    22    9   0    9    4   2.2     19
....................    13    21    34    8   0   10   10   1.0     20
....................     5    17    22   11   0   10    8   1.2     20
....................     1    11    12   15   1   18   15   1.2     13
.................. ..    1     9    10   15   0    9    9   1.0     10
....................    22    19    41   14   0    2    7   0.3     12
....................     9    12    21   14   0    4    9   0.4      7
....................     3     5     8    4   0   20    7   2.9      8
....................     2     2     4    2   0    2    2   1.0      2
....................     1     3     4    6   0    0    2   0.0      2
....................     0     1     1    2   0    0    1   0.0      0
....................     0     1     1    0   0    0    0   0.0      0
....................     1     0     1    0   0    0    1   0.0      0
Total...............    72   143   215 115   1   98   86   1.1     78
Opponents...........    72   130   202 131   -   62 103   0.6     68
";

my @lines = split(/\n/,$text);

#
## Scan across the line and for each column run down the rows checking for a space or - exists
## if there is a space, then it's likely that there is a pattern for seporating the data in the columns.
##
## if the previous column has spaces or -'s and the current column has numbers, letters, pipes or decimals,
## then the current column is the beggining of a new cell.
#
my $lineCount = @lines;
print "\nNumber of Lines: $lineCount";

##
#   Build a matrix of characters for the data, where we can find row x col values.
##
my $x = 0;
my $MaxCols = 0;
print "\nBuilding Matrix";
foreach my $line (@lines) {
    my @data = "" $line);
    my $y = 0;
    foreach my $char (@data) {
        $Matrix[$x][$y] = "$char";
        $y++;
    }
    $MaxCols = $y if($y >= $MaxCols);
    $x++;
}

my %Cells;
my $lineLoopCount = 0;
for(my $x = 0; $x < $lineCount; $x++) {
    for(my $y = 0; $y < $MaxCols; $y++) {
        next if not defined $Matrix[$x][$y];
        #next if($Cells{$y} eq ' ');
        for(my $z = 0; $z < $lineCount; $z++) {
            #print "\nTesting Col: $y";
            if($Matrix[$z][$y] eq ' ' or $Matrix[$z][$y] eq '-' or $Matrix[$z][$y] eq '|') {
                #print "\n\tSpace Found at [$z][$y]";
                $space = 1;
            } else {
            $space = 0;
            }
        }
        if($space) {
            $Cells{$y} = "|";
        } else {
            $Cells{$y} = " ";
        }
    }
}

print "\n";
foreach my $key (keys %Cells) {
print "$Cells{$key}";
}
print "\n";

exit(0);

parsing text

Reply via email to