Re: There has to be a way to do this

2003-06-23 Thread Jeff 'japhy' Pinyan
On Jun 23, [EMAIL PROTECTED] said:

>SRED. SREDNE
>SEV.  SEVERN

># Match it at beginning of line
>$cgname =~ s/^SRED\.(?=[\W\s\-\d]+)/SREDNE:/g ;

Three things -- the + modifier on the [...] isn't needed, you don't need
to put \s and - in a character class you've already put \W in, and the /g
modifier is totally worthless here... there's only ONE beginning of the
line!

  $cgname =~ s/^SRED\.(?=[\W\d])/SREDNE:/;

># Match it within the line
>$cgname =~ s/[\W\s\-]+SRED\.(?=[\W\s\-\d]+)/:SREDNE:/g ;

I have a feeling you want to use \b instead of [\W\s-].  It's cleaner and
doesn't actually absorb a character.

  $cgname =~ s/\bSRED\.(?=[\W\d])/:SREDNE:/g;

># Match it at end of line
>$cgname =~ s/[\W\s\-]+SRED\.$/:SREDNE:/g ;

Again, use \b, but there's no need for /g here.

  $cgname =~ s/\bSRED\.$/:SREDNE:/;

># Match if it begins & ends line
>$cgname =~ s/^SRED\.$/:SREDNE:/g ;

Ah, here's an interesting case.  This is actually already handled by my
modifications.  The problem is that you were using

  /[\W\s\-]+SRED\.$/

but if the string is "SRED.", then [\W\s\-] can't match anything.  So
that's why using a word boundary (\b) is smarter.  Also, we can change the
look-aheads to go from positive to negative.

Instead of saying "and I am followed by a non-letter", why not say "and I
am NOT followed by a letter"?

  $cgname =~ s/^SRED\.(?![A-Za-z])/SREDNE:/; # front
  $cgname =~ s/\bSRED\.(?![A-Za-z])/:SREDNE:/g;  # middle
  $cgname =~ s/\bSRED\.$/:SREDNE:/;  # end

If you're worried about hardcoding the letter set (A-Za-z), then you can
use this character class instead:  [^\W\d_].  It means "match anything
that's not:  a non-word character, a digit, or an underscore".  It's a
sneaky way of matching anything that would be matched by \w WITHOUT
matching \d or _.

  $cgname =~ s/^SRED\.(?![^\W\d_])/SREDNE:/; # front
  $cgname =~ s/\bSRED\.(?![^\W\d_])/:SREDNE:/g;  # middle
  $cgname =~ s/\bSRED\.$/:SREDNE:/;  # end

>Right now I'm generating the regexes in a standalone script, then inserting
>the output code into the subroutine that processes names into a "matchable"
>form.
>
>What I'd like to be able to do is take a *set* of abbreviation
>"dictionaries," concatenate them together and dynamically generate the
>regex code in the routine that is going to execute it.

So you want to take the dictionary files, and use them to create a
function that does all the regexes on its input?

-- 
Jeff "japhy" Pinyan  [EMAIL PROTECTED]  http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
 what does y/// stand for?   why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: There has to be a way to do this

2003-06-23 Thread scott . e . robinson

I don't have code to do what I want, but here's the pieces I'm trying to
string together:

Abbreviation dictionary consists of a file like this:

SRED. SREDNE
SEV.  SEVERN
etc.

Each abbreviation is turned into four regexes, like this (doubtless they
could be made more efficient, but they work well enough at present):

# Sred. = SREDNE
$cgname =~ s/^SRED\.(?=[\W\s\-\d]+)/SREDNE:/g ;   # Match it at
beginning of line
$cgname =~ s/[\W\s\-]+SRED\.(?=[\W\s\-\d]+)/:SREDNE:/g ;  # Match it
within the line
$cgname =~ s/[\W\s\-]+SRED\.$/:SREDNE:/g ;  # Match it at end
of line
$cgname =~ s/^SRED\.$/:SREDNE:/g ;  # Match if it
begins & ends line

# Sev.  = SEVERN
$cgname =~ s/^SEV\.(?=[\W\s\-\d]+)/SEVERN:/g ;# Match it at
beginning of line
$cgname =~ s/[\W\s\-]+SEV\.(?=[\W\s\-\d]+)/:SEVERN:/g ;   # Match it
within the line
$cgname =~ s/[\W\s\-]+SEV\.$/:SEVERN:/g ;   # Match it at end
of line
$cgname =~ s/^SEV\.$/:SEVERN:/g ;   # Match if it
begins & ends line

etc.

Right now I'm generating the regexes in a standalone script, then inserting
the output code into the subroutine that processes names into a "matchable"
form.

What I'd like to be able to do is take a *set* of abbreviation
"dictionaries," concatenate them together and dynamically generate the
regex code in the routine that is going to execute it.

Thanks,

Scott

Scott E. Robinson
SWAT Team
UTC Onsite User Support
RR-690 -- 281-654-5169
EMB-2813N -- 713-656-3629


   

  "David Kirol"

  <[EMAIL PROTECTED]To:  <[EMAIL PROTECTED]>   
  
  > cc:
    
                        Subject:   Re: There has to be a 
way to do this
   

  06/20/03 08:38 PM

   

   




Scott,
 Sounds like a fun problem. Can you post some code and an
(abbreviated) set
of example data?
David

"Scott E Robinson" <[EMAIL PROTECTED]> wrote in message
news:<[EMAIL PROTECTED]>...
> I'm still working on the well-name matching program that I've brought up
> here before.  I've received invaluable help to solve the toughest
questions
> in its development, for which I'm very grateful.
>
> Now I'm trying to automate some steps which were previously manual in the
> process, to make it more end-user-friendly.  There has to be a way to do
> this with Perl.
>
> The script uses a "dictionary" of abbreviations to aid its matching.  The
> abbreviations are implemented as a series of substitutions with the "s"
> operator.  I have a Perl script which builds the substitution statements
> from a tab-delimited list of abbreviations and their equivalent long
forms.
> I then manually insert these statements into the subroutine that uses
them.
>
> I kept the abbreviation translation hardcoded into the subroutine for
> performance reasons (this thing compares 14,000 unknown well names
against
> 680,000 match candidates).  Is there a way in Perl to read the
abbreviation
> dicitionary (the tab-delimited list), generate the code, insert it into
the
> right subroutine, and start executing the program, all in one script?
> (Maybe you can tell me that the performance hit from using variables in
the
> substitution statements is negligible, and if so, I'd be happy to go that
> route.)
>
> Thanks in advance,
>
> Scott
>
> Scott E. Robinson
> Data SWAT Team
> UTC Onsite User Support
> RR-690 -- 281-654-5169
> EMB-2813N -- 713-656-3629
>






-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: There has to be a way to do this

2003-06-21 Thread Peter Scott
In article <[EMAIL PROTECTED]>,
 [EMAIL PROTECTED] (Scott E Robinson) writes:
>Is there a way in Perl to read the abbreviation
>dicitionary (the tab-delimited list), generate the code, insert it into the
>right subroutine, and start executing the program, all in one script?

perldoc -f eval

Also there is a good discussion on dynamically generating regex
matching code in "Effective Perl Programming" by Joseph Hall 
(Addison-Wesley).  Doubtless there are free on-line equivalents
but references escape me for the moment.

-- 
Peter Scott
http://www.perldebugged.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



There has to be a way to do this

2003-06-20 Thread scott . e . robinson
I'm still working on the well-name matching program that I've brought up
here before.  I've received invaluable help to solve the toughest questions
in its development, for which I'm very grateful.

Now I'm trying to automate some steps which were previously manual in the
process, to make it more end-user-friendly.  There has to be a way to do
this with Perl.

The script uses a "dictionary" of abbreviations to aid its matching.  The
abbreviations are implemented as a series of substitutions with the "s"
operator.  I have a Perl script which builds the substitution statements
from a tab-delimited list of abbreviations and their equivalent long forms.
I then manually insert these statements into the subroutine that uses them.

I kept the abbreviation translation hardcoded into the subroutine for
performance reasons (this thing compares 14,000 unknown well names against
680,000 match candidates).  Is there a way in Perl to read the abbreviation
dicitionary (the tab-delimited list), generate the code, insert it into the
right subroutine, and start executing the program, all in one script?
(Maybe you can tell me that the performance hit from using variables in the
substitution statements is negligible, and if so, I'd be happy to go that
route.)

Thanks in advance,

Scott

Scott E. Robinson
Data SWAT Team
UTC Onsite User Support
RR-690 -- 281-654-5169
EMB-2813N -- 713-656-3629


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]