Find regex in Start/Stop segments

2003-06-10 Thread Harry Putnam
I use a homeboy data base technique to keep info about the scripts I
write and other typse of stuff too.  Here I'm just dealing with

Its a simple format to enter key information about what a script
does.  Looks like:

# Keywords: SOME WORDS
# body
# body
# &&

I've written various scripts to search this format in awk and shell.
Now trying it in perl.  I have several working scripts but wanted to
get some ideas from the sharp shooters here how to do this better.

My technique seems like it could be streamlined and improved quite a

The sample below just handles the basic technique and isn't completed
with all tests and etc.  Just some basic ones. But really I'm more
interested in hearing better ways to accomplish this.

The basic task is to locate a formated segment, search its keywords
line for regex then print the segment.  Also a basic check for
misformatted segments.

Not too concerned with how the files are aquired but what comes after.
#!/usr/local/bin/perl -w

($myscript = $0) =~ s:^.*/::; 
$regex = shift;
## Set Keywords start end regex for non script searching (The default)
$keyreg = '^# Keywords:';
$keyend = '^# &&$';

if (!$ARGV[0]) {
## Aquire there files in whatever way
@files = @ARGV;

## Set a marker to know when we are in a new file
$fname_for_line_cnt = '';
for (@files) { 
  $file = $_;
  if ("$fname_for_line_cnt" eq "$file") {
   ## This shouldn't happen
print "We're reading the same file again .. exiting\n";
  } else {
## Set lineno to 0 for start of each file
$lineno = 0;
$fname_for_line_cnt = $file;

  if (-f $file) {
open(FH,"<$file") or die "Cannot open $file: $!";
while () {
  $line = $_;
  if (/$keyreg $regex/) {
print "$file\n";
$hit = "TRUE";
  if ($hit) {
print "$lineno $line\n";
  if ($hit && /$keyend/) {
  ## We've hit the end of a good segment, print delimiter and null out our vars
print "-- \n";
$hit= '';
$line   = '';
  if ($hit && /^[^#]/ || $hit && eof) {
## If we see this situation it means the format is screwed up
## Notify user of the line number, but null out vars and proceed. 
print  "$file:\n   INCOMPLETE SEGMENT ENTRY: Line <$lineno>\n --\n";
$hit= '';
$line   = '';
  } else {
sub usage {

Re: Find regex in Start/Stop segments

2003-06-11 Thread Tassilo von Parseval
On Tue, Jun 10, 2003 at 11:49:25PM -0700 Harry Putnam wrote:

> I use a homeboy data base technique to keep info about the scripts I
> write and other typse of stuff too.  Here I'm just dealing with
> scripts.
> Its a simple format to enter key information about what a script
> does.  Looks like:
> # Keywords: SOME WORDS
> # body
> # body
> # DATE
> # &&
> I've written various scripts to search this format in awk and shell.
> Now trying it in perl.  I have several working scripts but wanted to
> get some ideas from the sharp shooters here how to do this better.
> My technique seems like it could be streamlined and improved quite a
> lot.

Yes, it's a little wordy considering it's Perl.

> The sample below just handles the basic technique and isn't completed
> with all tests and etc.  Just some basic ones. But really I'm more
> interested in hearing better ways to accomplish this.
> The basic task is to locate a formated segment, search its keywords
> line for regex then print the segment.  Also a basic check for
> misformatted segments.
> Not too concerned with how the files are aquired but what comes after.
> ^^
> #!/usr/local/bin/perl -w
> ($myscript = $0) =~ s:^.*/::; 

You are allowed to manipulate $0, too. The new value of $0 is the one
that is eventually showing up in your process-table (unless you are
using Perl5.8.0 where this does not work due to a bug).

> $regex = shift;
> ## Set Keywords start end regex for non script searching (The default)
> $keyreg = '^# Keywords:';
> $keyend = '^# &&$';
> if (!$ARGV[0]) {
>   usage();
>   exit;
> }
> ## Aquire there files in whatever way
> @files = @ARGV;
> ## Set a marker to know when we are in a new file
> $fname_for_line_cnt = '';
> for (@files) { 
>   chomp;

I don't think that the entries in @ARGV contain newlines at the end.
Actually I know they don't. :-)

>   $file = $_;
>   if ("$fname_for_line_cnt" eq "$file") {

There is no reason to put those variables into quotes.

>## This shouldn't happen
> print "We're reading the same file again .. exiting\n";
> exit;

That is better solved using a hash. Fill all the files into a hash (as
keys) and iterate over the keys. That way, it's guaranteed you only
inspect each file once.

>   } else {
> ## Set lineno to 0 for start of each file
> $lineno = 0;
> $fname_for_line_cnt = $file;
>   }
>   if (-f $file) {
> open(FH,"<$file") or die "Cannot open $file: $!";
> while () {
>   chomp;
>   $lineno++;

You don't have to keep track of the line numbers yourself. Perl offers
the special variable $. for that.

>   $line = $_;
>   if (/$keyreg $regex/) {
> print "$file\n";
>   $hit = "TRUE";
>   }
>   if ($hit) {
>   print "$lineno $line\n";
>   }
>   if ($hit && /$keyend/) {
>   ## We've hit the end of a good segment, print delimiter and null
>   ## out our vars
>   print "-- \n";
>   $hit= '';
>   $line   = '';
>   }
>   if ($hit && /^[^#]/ || $hit && eof) {
> ## If we see this situation it means the format is screwed up
> ## Notify user of the line number, but null out vars and proceed. 
>   print  "$file:\n   INCOMPLETE SEGMENT ENTRY: Line <$lineno>\n --\n";
>   $hit= '';
>   $line   = '';
>   }
> }
> close(FH);
>   } else {
> next;
>   }
> }
> sub usage {
>   print< 
> Purpose: Search scripts keyword segments (or any file)
> Usage: \`$myscript "REGEX" file ... fileN (or glob)'
>   (Where REGEX is a regex to be found in Keyword segment) 
> }

I'd probably write it like that:

#!/usr/local/bin/perl -w
use strict;

$0 =~ s:.*/::;

my $regex = shift;
$regex = qr/^# Keywords: $regex/;   # could improve performance a little

my %files;
@files{ @ARGV } = ();   # a hash-slice: see 'perldoc perldata'

usage(), exit if ! @ARGV;

for my $file (keys %files) {
next if ! -f $file;
open FILE, "<", $file or die "Error opening $file: $!";

my $hit;
while () {
$hit++ if /$regex/o;# start of record
print "$. $_\n" if $hit;# $. is the line number
$hit-- if /^# &&$/; # end of record
print "$file:\n\tINCOMPLETE SEGMENT ENTRY: Line <$l.>\n--\n"
and $hit--
if $hit && !/^#/ or $hit && eof;

sub usage {
I didn't test it but it should produce the same result as your script
and doing it considerably more quickly. Please substract any possible
syntax errors or logical flaws from the script before running it. ;-)


Re: Find regex in Start/Stop segments

2003-06-11 Thread Harry Putnam
Tassilo von Parseval <[EMAIL PROTECTED]> writes:

> You don't have to keep track of the line numbers yourself. Perl offers
> the special variable $. for that.

An awkism I guess, hold over from awk use.
Thanks for the tips.

> I'd probably write it like that:

Quite a lot shorter... and to the point.

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Find regex in Start/Stop segments

2003-06-11 Thread John W. Krahn
Tassilo Von Parseval wrote:
> On Tue, Jun 10, 2003 at 11:49:25PM -0700 Harry Putnam wrote:
> >
> > ## Set a marker to know when we are in a new file
> > $fname_for_line_cnt = '';
> > for (@files) {
> >   chomp;
> I don't think that the entries in @ARGV contain newlines at the end.
> Actually I know they don't. :-)

Well it could happen but you would really have to want it that way.

$ perl -le'for(@ARGV){print length; chomp; print length}' one 'two
' three

And of course, if you have a file name with an actual newline in it then
you don't want to remove it.  :-)

use Perl;

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Find regex in Start/Stop segments

2003-06-13 Thread R. Joseph Newton
Tassilo von Parseval wrote:

> >   chomp;
> I don't think that the entries in @ARGV contain newlines at the end.
> Actually I know they don't. :-)

That's good.  and that is why chomp is an excellent choice for this context.
Because the OP may not know, or be sure of, that fact.  The chomp function is
custom-designed for cases of uncertainty,.and is perfectly safe in cases where
there is no tail-junk to remove.  Please don't discourage its use.


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Find regex in Start/Stop segments

2003-06-13 Thread Tassilo von Parseval
On Fri, Jun 13, 2003 at 03:09:20PM -0700 R. Joseph Newton wrote:

> Tassilo von Parseval wrote:
> > >   chomp;
> >
> > I don't think that the entries in @ARGV contain newlines at the end.
> > Actually I know they don't. :-)
> >
> That's good.  and that is why chomp is an excellent choice for this context.
> Because the OP may not know, or be sure of, that fact.  The chomp function is
> custom-designed for cases of uncertainty,.and is perfectly safe in cases where
> there is no tail-junk to remove.  Please don't discourage its use.

I was not discouraging its use. I was rather pointing out that @ARGV
does (usually) not contain trailing newlines. chomp() should be used
when - conceptually - there could be something to remove. In case of
filenames however you either don't have anything to remove or you don't
want to remove it. That way this chomp() could even be wrong (as John
remarked in his follow-up).


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Find regex in Start/Stop segments

2003-06-14 Thread Harry Putnam
Tassilo von Parseval <[EMAIL PROTECTED]> writes:

>> That's good.  and that is why chomp is an excellent choice for this
>> context.  Because the OP may not know, or be sure of, that fact.
>> The chomp function is custom-designed for cases of uncertainty,.and
>> is perfectly safe in cases where there is no tail-junk to remove.
>> Please don't discourage its use.

> I was not discouraging its use. I was rather pointing out that @ARGV
> does (usually) not contain trailing newlines. chomp() should be used
> when - conceptually - there could be something to remove. In case of
> filenames however you either don't have anything to remove or you
> don't want to remove it. That way this chomp() could even be wrong
> (as John remarked in his follow-up).

I'm the OP,  so for clarity here:  Why chomp? Well, it was really a
typo.  It was supposed to say `chop'.

During the night some homeless guy slipped in and created a bunch of
filenames with control chars in them, like this.

`Harry is a jerk^MHarry is a saint'

So you can see why `chop' was called for.  I need to learn to type
better... hehe.

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]