Re: parsing text

Joe Youngquist Wed, 10 Dec 2003 07:10:13 -0800

Hello Lee,

My first attempt was to  use a regular expression, but there are no
guaranties on the header format...
The real bugger is sometimes the column headers will not have any spaces
between them, though this is rare, it is something I'll need to keep an eye
on and change manually - I'm not that great of a programmer to tell me
script to "make a judgment call on that there column chief". :)
My hope right now is just to make something that works with my data 99% of
the time and something that will work as close to 100% of the time as long
at the column headers have a space between them.  Once I do, this would be
the first time I'd have the joy of contributing to the Perl community.


JY
----- Original Message -----
From: "Lee Goddard" <[EMAIL PROTECTED]>
To: "Joe Youngquist" <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Wednesday, December 10, 2003 5:30 AM
Subject: RE: parsing text



Nice idea: I'm surprised it's not been done before
(I didn't look on CPAN ...)

Just a thought, fwiw: if you are sure there will be
no spaces in your "leaders" - the bit between the
row name and the data (...) - and if you can be sure
that each column consists of data without white space
then you could surely use a regular expression to
get at the data?

You $text string does have a row (number 6) with
a space in the leader: but maybe you get around
that by requiring a column to have white space on
either side...?

Just a thought.
lee

-----Original Message-----
From: Joe Youngquist [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 09, 2003 9:58 PM
To: [EMAIL PROTECTED]
Subject: parsing text


Hello list,

I've been trying to figure out a generalized method of parsing space
formatted text to outout into html tables.  The data is verly likely written
out using Perl Reports and Pictures, has anyone come up with a general
method?

Here's a few examplesof the text that I'd format to html tables:

NOTE: Best to use Courier New font to keep the formatting


                    |-----------------OVERALL STATISTICS------------------|
TOTALS               O-REB D-REB TOTAL   PF  FO    A   TO  A/TO Hi Pts
---------------------------------------------------------------------------
Lowe, Kenneth.......     0    15    15   15   0   14   11   1.3     26
Teague, David.......     6    16    22    9   0    9    4   2.2     19
Booker, Chris.......    13    21    34    8   0   10   10   1.0     20
Buckley, Melvin.....     5    17    22   11   0   10    8   1.2     20
McKnight, Brandon...     1    11    12   15   1   18   15   1.2     13
Buscher, Brett......     1     9    10   15   0    9    9   1.0     10
Kartelo, Ivan.......    22    19    41   14   0    2    7   0.3     12
Kiefer, Matt........     9    12    21   14   0    4    9   0.4      7
Parkinson, Austin...     3     5     8    4   0   20    7   2.9      8
Nwankwo, Ije........     2     2     4    2   0    2    2   1.0      2
Carroll, Matt.......     1     3     4    6   0    0    2   0.0      2
Ford, Andrew........     0     1     1    2   0    0    1   0.0      0
Garrity, Kevin......     0     1     1    0   0    0    0   0.0      0
Hartley, Chris......     1     0     1    0   0    0    1   0.0      0
Total...............    72   143   215  115   1   98   86   1.1     78
Opponents...........    72   130   202  131   -   62  103   0.6     68

   TEAM STATISTICS                         PUR          OPP
   --------------------------------------------------------
   SCORING.......................          431          352
     Points per game.............         71.8         58.7
     Scoring margin..............        +13.2            -
   FIELD GOALS-ATT...............      142-328      134-336
     Field goal pct..............         .433         .399
   3 POINT FG-ATT................       36-102        25-99
     3-point FG pct..............         .353         .253
     3-pt FG made per game.......          6.0          4.2
   FREE THROWS-ATT...............      111-147        59-99
     Free throw pct..............         .755         .596
   REBOUNDS......................          215          202
     Rebounds per game...........         35.8         33.7
     Rebounding margin...........         +2.2            -
   ASSISTS.......................           98           62
     Assists per game............         16.3         10.3
   TURNOVERS.....................           86          103
     Turnovers per game..........         14.3         17.2
     Turnover margin.............         +2.8            -
     Assist/turnover ratio.......          1.1          0.6
   STEALS........................           44           31
     Steals per game.............          7.3          5.2
   BLOCKS........................           23           23
     Blocks per game.............          3.8          3.8
   WINNING STREAK................            6            -
     Home win streak.............            3            -
   ATTENDANCE....................        33118        23435
     Home games-Avg/Game.........      3-11039          0-0
     Neutral site-Avg/Game.......            -       3-7812

   BY PERIOD     1st  2nd    Total
   ------------ ---- ----     ----
   Team........  203  228  -   431
   Opponents...  164  188  -   352


The goal I'm trying to reach is to build a method that no matter the table
of data sent to it, will find where the columns are for the data.  It's easy
to "see" where the columns are, but my attempt to tell a program how to
"see" the columns has been embarrising to say the least.

The road I was walking down was to take each line of a table and look for
spaces (skipping dashes and pipes) when one is found, look "down" the rest
of the table in this current column with the space.  If all the way "down"
the table are spaces (or a dash or pipe) then there is likely a column
boundry at this column location.  Once the entire table of data has been
looked at, where there were changes from text to spaces back to text, there
is an ending "cell" of data and the start of a new "cell".  So my logic is
this looking at the last example table of data:

BY PERIOD     1st  2nd    Total
------------ ---- ----     ----
Team          203  228  -   431
Opponents     164  188  -   352


Line one:
0-9:    text     ( at col 3 [the space between "by" and "period"]
                   would be counted as text because "down" the table
                   there are no other spaces)
10-14:  spaces
15-17:  text
18-19:  spaces
20-22:  text
23-26:  spaces
27-31:  text

Line two:
0-31:   spaces (by the logic that dashes are counted like a space)

Line three:
0-4     text
5-14    spaces
15-17:  text
18-19:  spaces
20-22:  text
23-28:  spaces
29-31:  text

Line four:
0-9:    text
10-14:  spaces
15-17:  text
18-19:  spaces
20-22:  text
23-28:  spaces
29-31:  text

>From this I can tell the program for each line in the table:
from 0 to 9 grab the text,
from 15 to 17 grab the text,
from 20 to 22 grab the text,
from 27 to 31 grab the text,

I would end up with (after ignoring line two and stripping leading and
trailing space)
<table>
 <tr>
  <td>BY PERIOD</td>
  <td>1st</td>
  <td>2nd</td>
  <td>Total</td>
 </tr>
 <tr>
  <td>Team</td>
  <td>203</td>
  <td>228</td>
  <td>431</td>
 </tr>
 <tr>
  <td>Opponents</td>
  <td>164</td>
  <td>188</td>
  <td>352</td>
 </tr>
</table>


I dunno, just tossing this out the list for the hopes for a fresh
perspective to the problem.  Below is some code I'm trying to tell the
program how to spot spaces down the table.

Thanks in advanced for your time in reading all this.

Joe Y.

-----------------------------------Code:------------------------------------
-----------


my $text = "                    |-----------------OVERALL
STATISTICS------------------|
TOTALS               O-REB D-REB TOTAL   PF  FO    A   TO  A/TO Hi Pts
---------------------------------------------------------------------------
....................     0    15    15   15   0   14   11   1.3     26
....................     6    16    22    9   0    9    4   2.2     19
....................    13    21    34    8   0   10   10   1.0     20
....................     5    17    22   11   0   10    8   1.2     20
....................     1    11    12   15   1   18   15   1.2     13
.................. ..    1     9    10   15   0    9    9   1.0     10
....................    22    19    41   14   0    2    7   0.3     12
....................     9    12    21   14   0    4    9   0.4      7
....................     3     5     8    4   0   20    7   2.9      8
....................     2     2     4    2   0    2    2   1.0      2
....................     1     3     4    6   0    0    2   0.0      2
....................     0     1     1    2   0    0    1   0.0      0
....................     0     1     1    0   0    0    0   0.0      0
....................     1     0     1    0   0    0    1   0.0      0
Total...............    72   143   215  115   1   98   86   1.1     78
Opponents...........    72   130   202  131   -   62  103   0.6     68
";

my @lines = split(/\n/,$text);


#
##   Scan across the line and for each column run down the rows checking for
a space or - exists
## if there is a space, then it's likely that there is a pattern for
seporating the data in the columns.
##
##  if the previous column has spaces or -'s and the current column has
numbers, letters, pipes or decimals,
## then the current column is the beggining of a new cell.
#
my $lineCount = @lines;
print "\nNumber of Lines: $lineCount";



##
#   Build a matrix of characters for the data, where we can find row x col
values.
##
my $x = 0;
my $MaxCols = 0;
print "\nBuilding Matrix";
foreach my $line (@lines) {
    my @data = split(//, $line);
    my $y = 0;
    foreach my $char (@data) {
        $Matrix[$x][$y] = "$char";
        $y++;
    }
    $MaxCols = $y if($y >= $MaxCols);
    $x++;
}

my %Cells;
my $lineLoopCount = 0;
for(my $x = 0; $x < $lineCount; $x++) {
    for(my $y = 0; $y < $MaxCols; $y++) {
        next if not defined $Matrix[$x][$y];
        #next if($Cells{$y} eq ' ');
        for(my $z = 0; $z < $lineCount; $z++) {
            #print "\nTesting Col: $y";
            if($Matrix[$z][$y] eq ' ' or $Matrix[$z][$y] eq '-' or
$Matrix[$z][$y] eq '|') {
                #print "\n\tSpace Found at [$z][$y]";
                $space = 1;
            } else {
             $space = 0;
            }
        }
        if($space) {
            $Cells{$y} = "|";
        } else {
            $Cells{$y} = " ";
        }
    }
}

print "\n";
foreach my $key (keys %Cells) {
    print "$Cells{$key}";
}
print "\n";

exit(0);



---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.529 / Virus Database: 324 - Release Date: 16/10/2003

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.529 / Virus Database: 324 - Release Date: 16/10/2003

_______________________________________________
Perl-Win32-Users mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: parsing text

Reply via email to