Re: There has to be a way to do this

2003-06-23 Thread scott . e . robinson

I don't have code to do what I want, but here's the pieces I'm trying to
string together:

Abbreviation dictionary consists of a file like this:

SRED. SREDNE
SEV.  SEVERN
etc.

Each abbreviation is turned into four regexes, like this (doubtless they
could be made more efficient, but they work well enough at present):

# Sred. = SREDNE
$cgname =~ s/^SRED\.(?=[\W\s\-\d]+)/SREDNE:/g ;   # Match it at
beginning of line
$cgname =~ s/[\W\s\-]+SRED\.(?=[\W\s\-\d]+)/:SREDNE:/g ;  # Match it
within the line
$cgname =~ s/[\W\s\-]+SRED\.$/:SREDNE:/g ;  # Match it at end
of line
$cgname =~ s/^SRED\.$/:SREDNE:/g ;  # Match if it
begins  ends line

# Sev.  = SEVERN
$cgname =~ s/^SEV\.(?=[\W\s\-\d]+)/SEVERN:/g ;# Match it at
beginning of line
$cgname =~ s/[\W\s\-]+SEV\.(?=[\W\s\-\d]+)/:SEVERN:/g ;   # Match it
within the line
$cgname =~ s/[\W\s\-]+SEV\.$/:SEVERN:/g ;   # Match it at end
of line
$cgname =~ s/^SEV\.$/:SEVERN:/g ;   # Match if it
begins  ends line

etc.

Right now I'm generating the regexes in a standalone script, then inserting
the output code into the subroutine that processes names into a matchable
form.

What I'd like to be able to do is take a *set* of abbreviation
dictionaries, concatenate them together and dynamically generate the
regex code in the routine that is going to execute it.

Thanks,

Scott

Scott E. Robinson
SWAT Team
UTC Onsite User Support
RR-690 -- 281-654-5169
EMB-2813N -- 713-656-3629


   

  David Kirol

  [EMAIL PROTECTED]To:  [EMAIL PROTECTED]   
  
   cc:

Subject:   Re: There has to be a 
way to do this
   

  06/20/03 08:38 PM

   

   




Scott,
 Sounds like a fun problem. Can you post some code and an
(abbreviated) set
of example data?
David

Scott E Robinson [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]...
 I'm still working on the well-name matching program that I've brought up
 here before.  I've received invaluable help to solve the toughest
questions
 in its development, for which I'm very grateful.

 Now I'm trying to automate some steps which were previously manual in the
 process, to make it more end-user-friendly.  There has to be a way to do
 this with Perl.

 The script uses a dictionary of abbreviations to aid its matching.  The
 abbreviations are implemented as a series of substitutions with the s
 operator.  I have a Perl script which builds the substitution statements
 from a tab-delimited list of abbreviations and their equivalent long
forms.
 I then manually insert these statements into the subroutine that uses
them.

 I kept the abbreviation translation hardcoded into the subroutine for
 performance reasons (this thing compares 14,000 unknown well names
against
 680,000 match candidates).  Is there a way in Perl to read the
abbreviation
 dicitionary (the tab-delimited list), generate the code, insert it into
the
 right subroutine, and start executing the program, all in one script?
 (Maybe you can tell me that the performance hit from using variables in
the
 substitution statements is negligible, and if so, I'd be happy to go that
 route.)

 Thanks in advance,

 Scott

 Scott E. Robinson
 Data SWAT Team
 UTC Onsite User Support
 RR-690 -- 281-654-5169
 EMB-2813N -- 713-656-3629







-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: There has to be a way to do this

2003-06-23 Thread Jeff 'japhy' Pinyan
On Jun 23, [EMAIL PROTECTED] said:

SRED. SREDNE
SEV.  SEVERN

# Match it at beginning of line
$cgname =~ s/^SRED\.(?=[\W\s\-\d]+)/SREDNE:/g ;

Three things -- the + modifier on the [...] isn't needed, you don't need
to put \s and - in a character class you've already put \W in, and the /g
modifier is totally worthless here... there's only ONE beginning of the
line!

  $cgname =~ s/^SRED\.(?=[\W\d])/SREDNE:/;

# Match it within the line
$cgname =~ s/[\W\s\-]+SRED\.(?=[\W\s\-\d]+)/:SREDNE:/g ;

I have a feeling you want to use \b instead of [\W\s-].  It's cleaner and
doesn't actually absorb a character.

  $cgname =~ s/\bSRED\.(?=[\W\d])/:SREDNE:/g;

# Match it at end of line
$cgname =~ s/[\W\s\-]+SRED\.$/:SREDNE:/g ;

Again, use \b, but there's no need for /g here.

  $cgname =~ s/\bSRED\.$/:SREDNE:/;

# Match if it begins  ends line
$cgname =~ s/^SRED\.$/:SREDNE:/g ;

Ah, here's an interesting case.  This is actually already handled by my
modifications.  The problem is that you were using

  /[\W\s\-]+SRED\.$/

but if the string is SRED., then [\W\s\-] can't match anything.  So
that's why using a word boundary (\b) is smarter.  Also, we can change the
look-aheads to go from positive to negative.

Instead of saying and I am followed by a non-letter, why not say and I
am NOT followed by a letter?

  $cgname =~ s/^SRED\.(?![A-Za-z])/SREDNE:/; # front
  $cgname =~ s/\bSRED\.(?![A-Za-z])/:SREDNE:/g;  # middle
  $cgname =~ s/\bSRED\.$/:SREDNE:/;  # end

If you're worried about hardcoding the letter set (A-Za-z), then you can
use this character class instead:  [^\W\d_].  It means match anything
that's not:  a non-word character, a digit, or an underscore.  It's a
sneaky way of matching anything that would be matched by \w WITHOUT
matching \d or _.

  $cgname =~ s/^SRED\.(?![^\W\d_])/SREDNE:/; # front
  $cgname =~ s/\bSRED\.(?![^\W\d_])/:SREDNE:/g;  # middle
  $cgname =~ s/\bSRED\.$/:SREDNE:/;  # end

Right now I'm generating the regexes in a standalone script, then inserting
the output code into the subroutine that processes names into a matchable
form.

What I'd like to be able to do is take a *set* of abbreviation
dictionaries, concatenate them together and dynamically generate the
regex code in the routine that is going to execute it.

So you want to take the dictionary files, and use them to create a
function that does all the regexes on its input?

-- 
Jeff japhy Pinyan  [EMAIL PROTECTED]  http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
stu what does y/// stand for?  tenderpuss why, yansliterate of course.
[  I'm looking for programming work.  If you like my work, let me know.  ]


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: There has to be a way to do this

2003-06-21 Thread Peter Scott
In article [EMAIL PROTECTED],
 [EMAIL PROTECTED] (Scott E Robinson) writes:
Is there a way in Perl to read the abbreviation
dicitionary (the tab-delimited list), generate the code, insert it into the
right subroutine, and start executing the program, all in one script?

perldoc -f eval

Also there is a good discussion on dynamically generating regex
matching code in Effective Perl Programming by Joseph Hall 
(Addison-Wesley).  Doubtless there are free on-line equivalents
but references escape me for the moment.

-- 
Peter Scott
http://www.perldebugged.com

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



There has to be a way to do this

2003-06-20 Thread scott . e . robinson
I'm still working on the well-name matching program that I've brought up
here before.  I've received invaluable help to solve the toughest questions
in its development, for which I'm very grateful.

Now I'm trying to automate some steps which were previously manual in the
process, to make it more end-user-friendly.  There has to be a way to do
this with Perl.

The script uses a dictionary of abbreviations to aid its matching.  The
abbreviations are implemented as a series of substitutions with the s
operator.  I have a Perl script which builds the substitution statements
from a tab-delimited list of abbreviations and their equivalent long forms.
I then manually insert these statements into the subroutine that uses them.

I kept the abbreviation translation hardcoded into the subroutine for
performance reasons (this thing compares 14,000 unknown well names against
680,000 match candidates).  Is there a way in Perl to read the abbreviation
dicitionary (the tab-delimited list), generate the code, insert it into the
right subroutine, and start executing the program, all in one script?
(Maybe you can tell me that the performance hit from using variables in the
substitution statements is negligible, and if so, I'd be happy to go that
route.)

Thanks in advance,

Scott

Scott E. Robinson
Data SWAT Team
UTC Onsite User Support
RR-690 -- 281-654-5169
EMB-2813N -- 713-656-3629


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]