OK, I had to try the two ways again to see how much difference it made. I
created a random contents fixed field file 14500 lines long X 80 columns
wide, and tried processing the lines (using substr($_,....)to break lines up
into 4 sections, substitute based on a few patterns, and change a couple of
columns like I had given in my previous real life example) to see if loading
the entire file into an array made as much performance difference as I had
thought previously. The difference on a file that size was so small as to
not be worth mentioning. Either way, it processed the 14,500 line file in
less than three seconds and wrote the new contents to the new file. Granted,
I am using a different OS than when I did that test before, but still, the
difference was virtually indiscernible. Therefore, I'll concede my point
about a significant performance difference.
Steve Howard
-----Original Message-----
From: Steve Howard [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 13, 2001 5:46 PM
To: Jos Boumans
Cc: Stephen Henderson; [EMAIL PROTECTED]
Subject: RE: use of split command
The reason I preferred to read a file into an array (when it is manageable)
then processing it is because of my understanding of what is occurring with
IO. It seems to bear out in the performance when I have tested the two side
by side. When you are using:
while (<HANDLE>)
you are accessing the disk for one line, then processing that line, making
another IO operation if you are writing to another line, then you start over
with another disk IO operation for the next line. Reading the file into an
array in one large chunk then processing it give one IO operation for
reading the file, then almost negligible processing time, and one IO
operation for writing it to another file.
When dealing with files that are too large to handle that way, I try to use
the dbd::csv (since I'm a DBA who works primarily on conversions, I usually
get nicely delimited files). or I have used files reading in about 1000
lines at a time to try to minimize the IO operations without killing my
available RAM.
Your point is well taken about putting error trapping, and using strict and
declaring local variables - those are things normally practiced, but things
I suppose I wrongly assumed to be outside the scope of the question asked
when responding here.
Steve Howard
-----Original Message-----
From: Jos Boumans [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 13, 2001 6:55 AM
To: Steve Howard
Cc: Stephen Henderson; [EMAIL PROTECTED]
Subject: Re: use of split command
I'd like to make a few adjustments on this code, and give you some things
you might want
to concider using:
open (ACCS, "C:\\Perl\\BioPerl-0.7\\Seqdata\\Accession.txt") or die "can't
open Accessions
file ", $!;
# this will produce also the error returned by the system call... usefull
for figuring out
*why* the file wasnt opened
# you might want to concider putting the location in a var name, like $loc
and have open
operate on $loc... makes it easy if you use that file more often in the
script.
# depending on size of the file, you will probably want to do one of the
following:
#this will read the entire file into a string and will save you somewhere
around 24 bytes
per line compared to an array
#quick math says that saves you ~2.4mb of ram on 100k lines
#putting it in a string is nice for doing s///, m// operations and passing
it around to
subroutines etc
*** Option 1 ***
my $in;
{local $/; $in = <HANDLE> }
while(split "\n", $in) {
my ($one, $two) = split "\t"; #assuming a tabdelimited list again
# do something with those vars
}
#however, you might just want to lower memory load and use:
*** Option 2 ***
while (<HANDLE>) {
my ($one, $two) = split "\t"; #assuming a tabdelimited list again
# do something with those vars
}
doing as mentioned below is not very smart, seeing it first copies all the
info to an
array, and then handles it one line at a time anyway in the for loop
this has the memory load of option 1, while it has the functionality of
option 2... the
worst of both worlds so to speak
last remark i want to make is that it's always good to run 'use strict' and
force yourself
to use lexicals (my'd vars).
hope this helps,
Jos Boumans
> open (ACCS, "C:\\Perl\\BioPerl-0.7\\Seqdata\\Accession.txt") or die "can't
> open Accessions file";
> @ets=<ACCS>;
>
> foreach (@ets) {
>
> ($first, $second) = split(/\t/, $_); #(Splits current
on a tab: \t)
>
> # do what you need with the two variables
>
> }
>
> you are right, that is a very fast way to deal with files.
>
> If you have regularly delimited files, and would prefer to work with them
> using SQL like syntax, you might look at DBD::CSV for another alternative.
>