At 08:32 AM 7/19/2007, you wrote:
>Mitchell A. Petersen wrote:
> > I am new to perl, so this may be a dumb question.
> >
> > I have written a perl program that reads firms 10Ks (their financial
> > disclosure) looking for their total assets. Some of the files are in html,
> > so I use HTML::TreeBuilder and HTML::TableContentParser. Some of the files
> > are in text, so I use a regular expression to find the row that says
> "total
> > assets" and then scans across to find the number. I have copied the files
> > to my hard disk to speed up the process. The program searchs through each
> > file sequentially, then writes out a line of output for each file. The
> > program crashes when the outfile reaches 32,768 bytes. I have changed the
> > file files I feed the program in case this is a problem, and it still
> > crashes at 32,768 bytes.
> >
> > I am running the perl program through a dos window (cmd.exe window) under
> > Windows Vista. If there is a smarter way to do this, I'd love to hear. The
> > output file does not contain any data until the program crashes. I
> included
> > the command
> > $|++ which I thought would cause the print buffer to flush to the output
> > file -- but this doesn't seem to be working
> >
> > If there is other info that I should add, please let me know. Thanks.
>
>First, if you aren't putting a newline out every line - add that.
>Make sure you're closing each input file after scanning it.
I was writing out a new line every time, but didn't close the read file
after each use. I added this -- but it didn't solve the problem.
>If that doesn't help, create a complete program snippet that fails as
>you describe (you may not need to read the files if you can reproduce
>it without parsing the files - just write the output as you currently
>are using some static data and see if that fails for you). Assuming
>you can reproduce the error, post that snippet.
Processing the text files is not the problem -- when I read only these the
program doesn't crash. It is the HTML files that are causing the problem.
The program snippet that processes the HTML files is:
use warnings;
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
use HTML::TableContentParser;
my ($asset_s,$asset_s2,@col_asset,@column,$column,@rows,$row,$total,$yes);
@col_asset = undef;
@column = undef;
@rows = undef;
$asset_s = 0;
$asset_s2 = 0;
$total = 0;
open (WRITE1, ">\\res\\edgar\\match\\gcu_unchecked3_junk.csv");
my $old_fh = select(WRITE1);
$| = 1;
select($old_fh);
unless (open (READ2,
"d:\\res\\edgar\\10k\\2178_0000002178-06-000013.txt")) {
next;
}
my $doc = join '',
<READ2>;
while ($total <= 3000) {
my $root =
HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();
my @tables = undef;
@tables = $root->find_by_tag_name('TABLE');
foreach my $table (@tables) {
if (($table->as_text_trimmed =~ /total asset/is) &&
($table->as_text_trimmed =~ /(\d|,){4,12}/is)) {
@rows = $table->find_by_tag_name('tr');
foreach $row (@rows) {
if ($row->as_text_trimmed =~ /^total
asset/i) {
@column =
$row->find_by_tag_name('td');
foreach $column (@column) {
if
($column->as_text_trimmed =~ m/((\d|,|\.){4,12})/) {
$yes =
$column->as_text_trimmed;
push (@col_asset,
"$yes");
}
}
$asset_s = $col_asset[1];
$asset_s2 = $col_asset[-1];
last;
}
}
$asset_s =~ s/(,|$|
|=)//g;
$asset_s2 =~ s/(,|$| |=)//g;
last;
}
}
print WRITE1 "$asset_s,$asset_s2\n";
$total++
}
close(READ2);
Thanks for the advice.
Mitchell
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs