Hello Lee, My first attempt was to use a regular expression, but there are no guaranties on the header format... The real bugger is sometimes the column headers will not have any spaces between them, though this is rare, it is something I'll need to keep an eye on and change manually - I'm not that great of a programmer to tell me script to "make a judgment call on that there column chief". :) My hope right now is just to make something that works with my data 99% of the time and something that will work as close to 100% of the time as long at the column headers have a space between them. Once I do, this would be the first time I'd have the joy of contributing to the Perl community.
JY ----- Original Message ----- From: "Lee Goddard" <[EMAIL PROTECTED]> To: "Joe Youngquist" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Wednesday, December 10, 2003 5:30 AM Subject: RE: parsing text Nice idea: I'm surprised it's not been done before (I didn't look on CPAN ...) Just a thought, fwiw: if you are sure there will be no spaces in your "leaders" - the bit between the row name and the data (...) - and if you can be sure that each column consists of data without white space then you could surely use a regular expression to get at the data? You $text string does have a row (number 6) with a space in the leader: but maybe you get around that by requiring a column to have white space on either side...? Just a thought. lee -----Original Message----- From: Joe Youngquist [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 09, 2003 9:58 PM To: [EMAIL PROTECTED] Subject: parsing text Hello list, I've been trying to figure out a generalized method of parsing space formatted text to outout into html tables. The data is verly likely written out using Perl Reports and Pictures, has anyone come up with a general method? Here's a few examplesof the text that I'd format to html tables: NOTE: Best to use Courier New font to keep the formatting |-----------------OVERALL STATISTICS------------------| TOTALS O-REB D-REB TOTAL PF FO A TO A/TO Hi Pts --------------------------------------------------------------------------- Lowe, Kenneth....... 0 15 15 15 0 14 11 1.3 26 Teague, David....... 6 16 22 9 0 9 4 2.2 19 Booker, Chris....... 13 21 34 8 0 10 10 1.0 20 Buckley, Melvin..... 5 17 22 11 0 10 8 1.2 20 McKnight, Brandon... 1 11 12 15 1 18 15 1.2 13 Buscher, Brett...... 1 9 10 15 0 9 9 1.0 10 Kartelo, Ivan....... 22 19 41 14 0 2 7 0.3 12 Kiefer, Matt........ 9 12 21 14 0 4 9 0.4 7 Parkinson, Austin... 3 5 8 4 0 20 7 2.9 8 Nwankwo, Ije........ 2 2 4 2 0 2 2 1.0 2 Carroll, Matt....... 1 3 4 6 0 0 2 0.0 2 Ford, Andrew........ 0 1 1 2 0 0 1 0.0 0 Garrity, Kevin...... 0 1 1 0 0 0 0 0.0 0 Hartley, Chris...... 1 0 1 0 0 0 1 0.0 0 Total............... 72 143 215 115 1 98 86 1.1 78 Opponents........... 72 130 202 131 - 62 103 0.6 68 TEAM STATISTICS PUR OPP -------------------------------------------------------- SCORING....................... 431 352 Points per game............. 71.8 58.7 Scoring margin.............. +13.2 - FIELD GOALS-ATT............... 142-328 134-336 Field goal pct.............. .433 .399 3 POINT FG-ATT................ 36-102 25-99 3-point FG pct.............. .353 .253 3-pt FG made per game....... 6.0 4.2 FREE THROWS-ATT............... 111-147 59-99 Free throw pct.............. .755 .596 REBOUNDS...................... 215 202 Rebounds per game........... 35.8 33.7 Rebounding margin........... +2.2 - ASSISTS....................... 98 62 Assists per game............ 16.3 10.3 TURNOVERS..................... 86 103 Turnovers per game.......... 14.3 17.2 Turnover margin............. +2.8 - Assist/turnover ratio....... 1.1 0.6 STEALS........................ 44 31 Steals per game............. 7.3 5.2 BLOCKS........................ 23 23 Blocks per game............. 3.8 3.8 WINNING STREAK................ 6 - Home win streak............. 3 - ATTENDANCE.................... 33118 23435 Home games-Avg/Game......... 3-11039 0-0 Neutral site-Avg/Game....... - 3-7812 BY PERIOD 1st 2nd Total ------------ ---- ---- ---- Team........ 203 228 - 431 Opponents... 164 188 - 352 The goal I'm trying to reach is to build a method that no matter the table of data sent to it, will find where the columns are for the data. It's easy to "see" where the columns are, but my attempt to tell a program how to "see" the columns has been embarrising to say the least. The road I was walking down was to take each line of a table and look for spaces (skipping dashes and pipes) when one is found, look "down" the rest of the table in this current column with the space. If all the way "down" the table are spaces (or a dash or pipe) then there is likely a column boundry at this column location. Once the entire table of data has been looked at, where there were changes from text to spaces back to text, there is an ending "cell" of data and the start of a new "cell". So my logic is this looking at the last example table of data: BY PERIOD 1st 2nd Total ------------ ---- ---- ---- Team 203 228 - 431 Opponents 164 188 - 352 Line one: 0-9: text ( at col 3 [the space between "by" and "period"] would be counted as text because "down" the table there are no other spaces) 10-14: spaces 15-17: text 18-19: spaces 20-22: text 23-26: spaces 27-31: text Line two: 0-31: spaces (by the logic that dashes are counted like a space) Line three: 0-4 text 5-14 spaces 15-17: text 18-19: spaces 20-22: text 23-28: spaces 29-31: text Line four: 0-9: text 10-14: spaces 15-17: text 18-19: spaces 20-22: text 23-28: spaces 29-31: text >From this I can tell the program for each line in the table: from 0 to 9 grab the text, from 15 to 17 grab the text, from 20 to 22 grab the text, from 27 to 31 grab the text, I would end up with (after ignoring line two and stripping leading and trailing space) <table> <tr> <td>BY PERIOD</td> <td>1st</td> <td>2nd</td> <td>Total</td> </tr> <tr> <td>Team</td> <td>203</td> <td>228</td> <td>431</td> </tr> <tr> <td>Opponents</td> <td>164</td> <td>188</td> <td>352</td> </tr> </table> I dunno, just tossing this out the list for the hopes for a fresh perspective to the problem. Below is some code I'm trying to tell the program how to spot spaces down the table. Thanks in advanced for your time in reading all this. Joe Y. -----------------------------------Code:------------------------------------ ----------- my $text = " |-----------------OVERALL STATISTICS------------------| TOTALS O-REB D-REB TOTAL PF FO A TO A/TO Hi Pts --------------------------------------------------------------------------- .................... 0 15 15 15 0 14 11 1.3 26 .................... 6 16 22 9 0 9 4 2.2 19 .................... 13 21 34 8 0 10 10 1.0 20 .................... 5 17 22 11 0 10 8 1.2 20 .................... 1 11 12 15 1 18 15 1.2 13 .................. .. 1 9 10 15 0 9 9 1.0 10 .................... 22 19 41 14 0 2 7 0.3 12 .................... 9 12 21 14 0 4 9 0.4 7 .................... 3 5 8 4 0 20 7 2.9 8 .................... 2 2 4 2 0 2 2 1.0 2 .................... 1 3 4 6 0 0 2 0.0 2 .................... 0 1 1 2 0 0 1 0.0 0 .................... 0 1 1 0 0 0 0 0.0 0 .................... 1 0 1 0 0 0 1 0.0 0 Total............... 72 143 215 115 1 98 86 1.1 78 Opponents........... 72 130 202 131 - 62 103 0.6 68 "; my @lines = split(/\n/,$text); # ## Scan across the line and for each column run down the rows checking for a space or - exists ## if there is a space, then it's likely that there is a pattern for seporating the data in the columns. ## ## if the previous column has spaces or -'s and the current column has numbers, letters, pipes or decimals, ## then the current column is the beggining of a new cell. # my $lineCount = @lines; print "\nNumber of Lines: $lineCount"; ## # Build a matrix of characters for the data, where we can find row x col values. ## my $x = 0; my $MaxCols = 0; print "\nBuilding Matrix"; foreach my $line (@lines) { my @data = split(//, $line); my $y = 0; foreach my $char (@data) { $Matrix[$x][$y] = "$char"; $y++; } $MaxCols = $y if($y >= $MaxCols); $x++; } my %Cells; my $lineLoopCount = 0; for(my $x = 0; $x < $lineCount; $x++) { for(my $y = 0; $y < $MaxCols; $y++) { next if not defined $Matrix[$x][$y]; #next if($Cells{$y} eq ' '); for(my $z = 0; $z < $lineCount; $z++) { #print "\nTesting Col: $y"; if($Matrix[$z][$y] eq ' ' or $Matrix[$z][$y] eq '-' or $Matrix[$z][$y] eq '|') { #print "\n\tSpace Found at [$z][$y]"; $space = 1; } else { $space = 0; } } if($space) { $Cells{$y} = "|"; } else { $Cells{$y} = " "; } } } print "\n"; foreach my $key (keys %Cells) { print "$Cells{$key}"; } print "\n"; exit(0); --- Incoming mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.529 / Virus Database: 324 - Release Date: 16/10/2003 --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.529 / Virus Database: 324 - Release Date: 16/10/2003 _______________________________________________ Perl-Win32-Users mailing list [EMAIL PROTECTED] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs