RE: pulling out "a","an", "the" from beginning of strings

2004-08-25 Thread Bob Showalter
John W. Krahn wrote:
> Bob Showalter wrote:
> > Jose Alves de Castro wrote:
> > 
> > > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> > > 
> > > > I need to pull out articles "a", "an", and "the" from the
> > > > beginning of title strings so that they sort properly in MySQL.
> > > > What is the best way to accomplish that if I have a single
> > > > $scalar with the whole title in it?
> > > 
> > > I would go with substitutions:
> > > 
> > > $scalar =~ s/^(?:a|an|the)//i;
> > 
> > Two problems:
> > 
> > 1. This doesn't remove just the whole words; it removes parts of
> > words as well. i.e. "Analyzing Widgets" would become "alyzing
> > Widgets" 
> 
> Actually it would become "nalyzing Widgets" because 'a' is the first
> alternative.  :-)

Smarty pants :~) 

My brain said "longest, leftmost", but the short-circuiting behavior is
clearly documented in perldoc perlre. If the alternation is written as
(an?|the), the "an" is matched.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




RE: pulling out "a","an", "the" from beginning of strings

2004-08-25 Thread Bob Showalter
Errin Larsen wrote:
> Hey,
> 
> Ok, looking through this ... I'm confused.
> 
> << SNIP >>
> 
> > > 
> > > Perhaps:
> > > 
> > >$scalar =~ s/^(a|an|the)\s*\b//i;
> > > 
> > > would work better.
> 
> <>
> 
> Is this capturing into $1 the a|an|the (yes)

Yes, but that's only a side effect. I'm not doing anything with $1.

> and the rest of the title
> into $2 (no?).

No.

>  After doing so, will it reverse the two ( i.e.
> s/^(a|an|the)\s+(.*)\b/$2, $1/i )?  

No.

> Also, what is the "\b"?

A word boundary assertion. See perldoc perlre.

>  it seems
> that the trailing "i" is for ignoring case; is that correct?

Yes.

It's not concerned with capturing anything; it's just matching a pattern and
then replacing the text matched with an empty string. The parens are used to
delimit the alternation a|an|the.

What I'm trying to match is:

   ^   beginning of line, followed by
   (a|an|the)  one of these sequences, followed by
   \s* any amount of whitespace, followed by
   \b  a word boundary (see perldoc perlre)

The \s* is there so the whitespace following the leading word "a, an, or
the" will be removed along with the word. The \b ensures that the end of
what we capture either is at the start of a new word or is the end of the
string.

If I left off the \b, it would match the "a" in "acme", since \s* can match
the zero-length string between the "a" and the "c". With \b in there, the
match fails, because \b will not match at the "c", since it's not a word
boundary.

An alternative to \s*\b would be \s+ (i.e. match at least one whitespace
char). However, this won't match a single word title like "the", because \s+
doesn't match at the end of the string, while \s*\b does. (How such a title
should be handled is up to the OP; if it should be left alone, then \s+
would be appropriate.)

HTH

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Chris Devers
On Tue, 24 Aug 2004, Errin Larsen wrote:
Perhaps:
   $scalar =~ s/^(a|an|the)\s*\b//i;
would work better.
<>
Is this capturing into $1 the a|an|the (yes) and the rest of the title 
into $2 (no?).
There is only one pair of parentheses, so only $1 is captured.
I still think it's prudent to capture everything else as $2, and then 
substitute such that the article captured in $1 is now the suffix:

$scalar =~ s/^(a|an|the)\s*\b(.*)/$2, $1/i;
Which turns "A Hard Day's Night" into "Hard Day's Night, A". If you ever 
need the original string back -- the person who started this thread was 
trying to get this info into a MySQL database in such a way that it 
would sort by the first significant word and not the article that 
precedes it -- then you can reverse the change with something like:

   $scalar =~ s/(.*), \b(a|an|the)$/$2 $1/i;
and "Hard Day's Night, A" should once again be "A Hard Day's Night".
After doing so, will it reverse the two ( i.e. 
s/^(a|an|the)\s+(.*)\b/$2, $1/i )?
Your version should; the one above won't.
Also, what is the "\b"?
Word boundary.
it seems that the trailing "i" is for ignoring case; is that correct?
Yes.
Just need some help with RE!!
`perldoc perlre`
Or the wonderful -- really! -- book, _Mastering Regular Expressions_.

--
Chris Devers  [EMAIL PROTECTED]
http://devers.homeip.net:8080/blog/
np: 'Colt 45'
 by
 from 'Television Theme Songs'
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread John W. Krahn
Bob Showalter wrote:
Jose Alves de Castro wrote:
On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
I need to pull out articles "a", "an", and "the" from the beginning
of title strings so that they sort properly in MySQL.  What is the
best way to accomplish that if I have a single $scalar with the
whole title in it? 
I would go with substitutions:
$scalar =~ s/^(?:a|an|the)//i;
Two problems:
1. This doesn't remove just the whole words; it removes parts of words as
well. i.e. "Analyzing Widgets" would become "alyzing Widgets"
Actually it would become "nalyzing Widgets" because 'a' is the first 
alternative.  :-)

John
--
use Perl;
program
fulfillment
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Errin Larsen
Hey,

Ok, looking through this ... I'm confused.  

<< SNIP >>

> >
> > Perhaps:
> >
> >$scalar =~ s/^(a|an|the)\s*\b//i;
> >
> > would work better.

<>

Is this capturing into $1 the a|an|the (yes) and the rest of the title
into $2 (no?).  After doing so, will it reverse the two ( i.e.
s/^(a|an|the)\s+(.*)\b/$2, $1/i )?  Also, what is the "\b"?  it seems
that the trailing "i" is for ignoring case; is that correct?

Just need some help with RE!!

thanks,

--Errin

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




RE: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Jose Alves de Castro
On Tue, 2004-08-24 at 16:19, Bob Showalter wrote:
> Jose Alves de Castro wrote:
> > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> > > I need to pull out articles "a", "an", and "the" from the beginning
> > > of title strings so that they sort properly in MySQL.  What is the
> > > best way to accomplish that if I have a single $scalar with the
> > > whole title in it? 
> > 
> > I would go with substitutions:
> > 
> > $scalar =~ s/^(?:a|an|the)//i;
> 
> Two problems:
> 
> 1. This doesn't remove just the whole words; it removes parts of words as
> well. i.e. "Analyzing Widgets" would become "alyzing Widgets"
> 
> 2. It doesn't remove whitespace after the word, so "The Widget Primer"
> becomes " Widget Primer", which won't sort with the w's, due to the leading
> blank.
> 
> Perhaps:
> 
>$scalar =~ s/^(a|an|the)\s*\b//i;
> 
> would work better.

You're absolutely right. I think this is a sign that I need to go out,
eat and drink something, breath some fresh air, etc.

-- 
José Alves de Castro <[EMAIL PROTECTED]>
  http://natura.di.uminho.pt/~jac


signature.asc
Description: This is a digitally signed message part


RE: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Bob Showalter
Jose Alves de Castro wrote:
> On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> > I need to pull out articles "a", "an", and "the" from the beginning
> > of title strings so that they sort properly in MySQL.  What is the
> > best way to accomplish that if I have a single $scalar with the
> > whole title in it? 
> 
> I would go with substitutions:
> 
> $scalar =~ s/^(?:a|an|the)//i;

Two problems:

1. This doesn't remove just the whole words; it removes parts of words as
well. i.e. "Analyzing Widgets" would become "alyzing Widgets"

2. It doesn't remove whitespace after the word, so "The Widget Primer"
becomes " Widget Primer", which won't sort with the w's, due to the leading
blank.

Perhaps:

   $scalar =~ s/^(a|an|the)\s*\b//i;

would work better.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Jose Alves de Castro
On Tue, 2004-08-24 at 15:39, Chris Devers wrote:
> On Tue, 24 Aug 2004, Jose Alves de Castro wrote:
> 
> > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> >> I need to pull out articles "a", "an", and "the" from the beginning of
> >> title strings so that they sort properly in MySQL.  What is the best way
> >> to accomplish that if I have a single $scalar with the whole title in it?
> >
> > I would go with substitutions:
> >
> > $scalar =~ s/^(?:a|an|the)//i;
> 
> Why not save the data for later by moving the article to the end?
> 
>  $scalar =~ s/^(?:a|an|the)\s+(.*)/$2, $1/i;
> 
> That way, "A Tale of Two Cities" should become "Tale of Two Cities, A", 
> and if you have to reconstitute the original title later, you haven't 
> thrown anything away...

I second this :-)

> -- 
> Chris Devers
-- 
José Alves de Castro <[EMAIL PROTECTED]>
  http://natura.di.uminho.pt/~jac


signature.asc
Description: This is a digitally signed message part


Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Chris Devers
On Tue, 24 Aug 2004, Jose Alves de Castro wrote:
On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
I need to pull out articles "a", "an", and "the" from the beginning of
title strings so that they sort properly in MySQL.  What is the best way
to accomplish that if I have a single $scalar with the whole title in it?
I would go with substitutions:
$scalar =~ s/^(?:a|an|the)//i;
Why not save the data for later by moving the article to the end?
$scalar =~ s/^(?:a|an|the)\s+(.*)/$2, $1/i;
That way, "A Tale of Two Cities" should become "Tale of Two Cities, A", 
and if you have to reconstitute the original title later, you haven't 
thrown anything away...


--
Chris Devers
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Tim McGeary
Jose Alves de Castro wrote:
On Tue, 2004-08-24 at 15:16, Tim McGeary wrote:
Jose Alves de Castro wrote:
On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:

I need to pull out articles "a", "an", and "the" from the beginning of 
title strings so that they sort properly in MySQL.  What is the best way 
to accomplish that if I have a single $scalar with the whole title in it?

I would go with substitutions:
$scalar =~ s/^(?:a|an|the)//i;
So that I am understanding this process, what does each part mean?  I 
assume that the ^ means beginning of the variable... is that correct? 
What about "(?:" ?

The ^ means the beginning of the string in $scalar, indeed.
As for the rest, I decided to group "a", "an" and "the" with brackets,
or otherwise the regex would have been /^a|^an|^the/
Regarding the :? , that's just so variable $1 doesn't end up with
whatever was removed, as there was no need for that.
Search for "Non-capturing groupings" under perldoc perlretut, if you
need more information
Great!  Thank you very much!  :)
Tim
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Jose Alves de Castro
On Tue, 2004-08-24 at 15:16, Tim McGeary wrote:
> Jose Alves de Castro wrote:
> > On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> > 
> >>I need to pull out articles "a", "an", and "the" from the beginning of 
> >>title strings so that they sort properly in MySQL.  What is the best way 
> >>to accomplish that if I have a single $scalar with the whole title in it?
> > 
> > 
> > I would go with substitutions:
> > 
> > $scalar =~ s/^(?:a|an|the)//i;
> 
> So that I am understanding this process, what does each part mean?  I 
> assume that the ^ means beginning of the variable... is that correct? 
> What about "(?:" ?

The ^ means the beginning of the string in $scalar, indeed.

As for the rest, I decided to group "a", "an" and "the" with brackets,
or otherwise the regex would have been /^a|^an|^the/

Regarding the :? , that's just so variable $1 doesn't end up with
whatever was removed, as there was no need for that.

Search for "Non-capturing groupings" under perldoc perlretut, if you
need more information

> tyia,
> Tim

HTH, :-)

jac

-- 
José Alves de Castro <[EMAIL PROTECTED]>
  http://natura.di.uminho.pt/~jac


signature.asc
Description: This is a digitally signed message part


Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Tim McGeary
Jose Alves de Castro wrote:
On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
I need to pull out articles "a", "an", and "the" from the beginning of 
title strings so that they sort properly in MySQL.  What is the best way 
to accomplish that if I have a single $scalar with the whole title in it?

I would go with substitutions:
$scalar =~ s/^(?:a|an|the)//i;
So that I am understanding this process, what does each part mean?  I 
assume that the ^ means beginning of the variable... is that correct? 
What about "(?:" ?

tyia,
Tim
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Jose Alves de Castro
On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> I need to pull out articles "a", "an", and "the" from the beginning of 
> title strings so that they sort properly in MySQL.  What is the best way 
> to accomplish that if I have a single $scalar with the whole title in it?

I would go with substitutions:

$scalar =~ s/^(?:a|an|the)//i;

> Thanks,
> Tim
> 
> -- 
> Tim McGeary
> [EMAIL PROTECTED]
-- 
José Alves de Castro <[EMAIL PROTECTED]>
  http://natura.di.uminho.pt/~jac


signature.asc
Description: This is a digitally signed message part


Re: pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Jose Alves de Castro
On Tue, 2004-08-24 at 15:04, Tim McGeary wrote:
> I need to pull out articles "a", "an", and "the" from the beginning of 
> title strings so that they sort properly in MySQL.  What is the best way 
> to accomplish that if I have a single $scalar with the whole title in it?

I would go with substitutions:

$scalar =~ s/^(?:a|an|the)//i;

> Thanks,
> Tim
> 
> -- 
> Tim McGeary
> [EMAIL PROTECTED]
-- 
José Alves de Castro <[EMAIL PROTECTED]>
  http://natura.di.uminho.pt/~jac


signature.asc
Description: This is a digitally signed message part


pulling out "a","an", "the" from beginning of strings

2004-08-24 Thread Tim McGeary
I need to pull out articles "a", "an", and "the" from the beginning of 
title strings so that they sort properly in MySQL.  What is the best way 
to accomplish that if I have a single $scalar with the whole title in it?

Thanks,
Tim
--
Tim McGeary
[EMAIL PROTECTED]

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]