Mitchell A. Petersen <> wrote:
> Processing the text files is not the problem -- when I read only
> these the
> program doesn't crash. It is the HTML files that are causing the
> problem.
> The program snippet that processes the HTML files is:
>
>
> use warnings;
> use strict;
Good start.
> use LWP::Simple;
I can't see where this is used.
> use HTML::TreeBuilder;
> use HTML::TableContentParser;
I can't see where this is used either.
>
> my
($asset_s,$asset_s2,@col_asset,@column,$column,@rows,$row,$total,$yes);
> @col_asset = undef;
> @column = undef;
> @rows = undef;
Did you realise that the above arrays are not empty? They contain a
single entry whose value is undef. If this is intentional, it seems
strange to me, and therefore warrants a comment explaining why.
> $asset_s = 0;
> $asset_s2 = 0;
> $total = 0;
Avoid global variables as much as possible, declare variables in the
minimum necessary scope. After looking at the rest of the code I can see
no reason for any of these variables to be global, and in some cases, no
real need for the variable. In fact one of those globals may be the
cause of your problem.
> open (WRITE1,
">\\res\\edgar\\match\\gcu_unchecked3_junk.csv");
You should always check the result of open.
> my $old_fh = select(WRITE1);
> $| = 1;
> select($old_fh);
The above is not strictly necessary, as your output lines are terminated
with "\n" which should cause a flush.
> unless (open (READ2,
> "d:\\res\\edgar\\10k\\2178_0000002178-06-000013.txt")) {
> next;
>
> }
Although you appear to be checking whether the open worked, that 'next'
means you are effectively ignoring the result of that check.
> my $doc = join '',
> <READ2>;
>
> while ($total <= 3000) {
for (1..3000) {
> my $root =
> HTML::TreeBuilder->new;
> $root->parse($doc);
> $root->eof();
Why the eof call? I checked the documentation and found out why its
necessary. I've learnt something from your post. Thanks for that.
>
> my @tables = undef;
> @tables = $root->find_by_tag_name('TABLE');
> foreach my $table (@tables) {
array variable not necessary:
foreach my $table ($root->find_by_tag_name('TABLE')) {
> if (($table->as_text_trimmed =~ /total asset/is) &&
> ($table->as_text_trimmed =~ /(\d|,){4,12}/is)) {
> @rows = $table->find_by_tag_name('tr');
> foreach $row (@rows) {
Similarly for @rows...
> if ($row->as_text_trimmed =~ /^total
> asset/i) {
> @column =
> $row->find_by_tag_name('td');
> foreach $column (@column) {
... and @column
> if
> ($column->as_text_trimmed =~ m/((\d|,|\.){4,12})/) {
> $yes =
> $column->as_text_trimmed;
> push
(@col_asset,
> "$yes");
This is could be a problem, I can't see where @col_asset is reset, so it
continues to grow in size throughout program execution. If you declared
@col_asset in the smallest necessary scope you would have avoided that.
> }
> }
> $asset_s = $col_asset[1];
> $asset_s2 = $col_asset[-1];
> last;
> }
> }
> $asset_s =~ s/(,|$|
> |=)//g;
> $asset_s2 =~ s/(,|$| |=)//g;
> last;
> }
> }
> print WRITE1 "$asset_s,$asset_s2\n";
> $total++
> }
>
> close(READ2);
The close would be better immediately after you have read the whole file
Fixing the above and a few minor style issues looks like:
-------------------------------------------------
use strict;
use warnings;
use HTML::TreeBuilder;
my $ofn = "/res/edgar/match/gcu_unchecked3_junk.csv";
my $ifn = "d:/res/edgar/10k/2178_0000002178-06-000013.txt";
open my $ofd, ">", $ofn or die "Failed to open $ofn: $!\n";
my $doc = slurp($ifn);
for (1..3000) {
my $root = HTML::TreeBuilder->new;
$root->parse($doc);
$root->eof();
OUTER_LOOP:
foreach my $table ($root->find_by_tag_name('TABLE')) {
my $txt = $table->as_text_trimmed;
if (($txt =~ /total asset/i) && ($txt =~ /[\d,]{4,12}/)) {
foreach my $row ($table->find_by_tag_name('tr')) {
if ($row->as_text_trimmed =~ /^total asset/i) {
my @col_asset;
foreach my $column ($row->find_by_tag_name('td')) {
my $txt = $column->as_text_trimmed;
push @col_asset, $txt if $txt =~
/([\d,\.]{4,12})/;
}
my @vals = map {s/[,$ =]//g} @col_asset[0,-1];
print join(",", @vals), "\n";
last OUTER_LOOP;
}
}
}
}
}
sub slurp {
my $fn = shift;
open my $fd, "<", $fn or die "Failed to open $fn: $!\n";
local $/;
my $data = <$fd>;
close $fd;
return $data;
}
-------------------------------------------------
I can't test it as I don't have any data, but it compiles.
HTH
--
Brian Raven
=========================================
Atos Euronext Market Solutions Disclaimer
=========================================
The information contained in this e-mail is confidential and solely for the
intended addressee(s). Unauthorised reproduction, disclosure, modification,
and/or distribution of this email may be unlawful.
If you have received this email in error, please notify the sender immediately
and delete it from your system. The views expressed in this message do not
necessarily reflect those of Atos Euronext Market Solutions.
Atos Euronext Market Solutions Limited - Registered in England & Wales with
registration no. 3962327. Registered office address at 25 Bank Street London
E14 5NQ United Kingdom.
Atos Euronext Market Solutions SAS - Registered in France with registration no.
425 100 294. Registered office address at 6/8 Boulevard Haussmann 75009 Paris
France.
L'information contenue dans cet e-mail est confidentielle et uniquement
destinee a la (aux) personnes a laquelle (auxquelle(s)) elle est adressee.
Toute copie, publication ou diffusion de cet email est interdite. Si cet e-mail
vous parvient par erreur, nous vous prions de bien vouloir prevenir
l'expediteur immediatement et d'effacer le e-mail et annexes jointes de votre
systeme. Le contenu de ce message electronique ne represente pas necessairement
la position ou le point de vue d'Atos Euronext Market Solutions.
Atos Euronext Market Solutions Limited Société de droit anglais, enregistrée au
Royaume Uni sous le numéro 3962327, dont le siège social se situe 25 Bank
Street E14 5NQ Londres Royaume Uni.
Atos Euronext Market Solutions SAS, société par actions simplifiée, enregistré
au registre dui commerce et des sociétés sous le numéro 425 100 294 RCS Paris
et dont le siège social se situe 6/8 Boulevard Haussmann 75009 Paris France.
=========================================
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs