At 08:32 AM 7/19/2007, you wrote:
>Mitchell A. Petersen wrote:
> > I  am new to perl, so this may be a dumb question.
> >
> > I have written a perl program that reads firms 10Ks (their financial
> > disclosure) looking for their total assets. Some of the files are in html,
> > so I use HTML::TreeBuilder and HTML::TableContentParser. Some of the files
> > are in text, so I use a regular expression to find the row that says 
> "total
> > assets" and then scans across to find the number. I have copied the files
> > to my hard disk to speed up the process. The program searchs through each
> > file sequentially, then writes out a line of output for each file. The
> > program crashes when the outfile reaches 32,768 bytes. I have changed the
> > file files I feed the program in case this is a problem, and it still
> > crashes at 32,768 bytes.
> >
> > I am running the perl program through a dos window (cmd.exe window) under
> > Windows Vista. If there is a smarter way to do this, I'd love to hear. The
> > output file does not contain any data until the program crashes. I 
> included
> > the command
> > $|++ which I thought would cause the print buffer to flush to the output
> > file -- but this doesn't seem to be working
> >
> > If there is other info that I should add, please let me know. Thanks.
>
>First, if you aren't putting a newline out every line - add that.
>Make sure you're closing each input file after scanning it.

I was writing out a new line every time, but didn't close the read file 
after each use. I added this -- but it didn't solve the problem.


>If that doesn't help, create a complete program snippet that fails as
>you describe (you may not need to read the files if you can reproduce
>it without parsing the files - just write the output as you currently
>are using some static data and see if that fails for you).  Assuming
>you can reproduce the error, post that snippet.

Processing the text files is not the problem -- when I read only these the 
program doesn't crash. It is the HTML files that are causing the problem. 
The program snippet that processes the HTML files is:


use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
use HTML::TableContentParser;

my ($asset_s,$asset_s2,@col_asset,@column,$column,@rows,$row,$total,$yes);
         @col_asset = undef;
         @column = undef;
         @rows = undef;
         $asset_s = 0;
         $asset_s2 = 0;
         $total = 0;
         open (WRITE1, ">\\res\\edgar\\match\\gcu_unchecked3_junk.csv");
         my $old_fh = select(WRITE1);
                 $| = 1;
                 select($old_fh);
         unless (open (READ2, 
"d:\\res\\edgar\\10k\\2178_0000002178-06-000013.txt")) {
                 next; 

                 }
         my $doc = join '', 
<READ2>;

while ($total <= 3000) {
         my $root = 
HTML::TreeBuilder->new;
         $root->parse($doc);
         $root->eof(); 

         my @tables = undef;
         @tables = $root->find_by_tag_name('TABLE');
         foreach my $table (@tables) {
                 if (($table->as_text_trimmed =~ /total asset/is) && 
($table->as_text_trimmed =~ /(\d|,){4,12}/is)) {
                         @rows = $table->find_by_tag_name('tr');
                         foreach $row (@rows) {
                                 if ($row->as_text_trimmed =~ /^total 
asset/i) {
                                         @column = 
$row->find_by_tag_name('td');
                                         foreach $column (@column) {
                                                 if 
($column->as_text_trimmed =~ m/((\d|,|\.){4,12})/) {
                                                         $yes = 
$column->as_text_trimmed;
                                                         push (@col_asset, 
"$yes");
                                                         }
                                                 }
                                         $asset_s = $col_asset[1];
                                         $asset_s2 = $col_asset[-1];
                                         last;
                                         }
                                 }
                         $asset_s =~ s/(,|$| 
|=)//g;
                         $asset_s2 =~ s/(,|$| |=)//g;
                         last;
                         }
                 }
         print WRITE1 "$asset_s,$asset_s2\n";
         $total++
         }

         close(READ2);

Thanks for the advice.
Mitchell


_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to