Re: OT: grep and regex patterns

2016-07-15 Thread Jon LaBadie
On Thu, Jul 14, 2016 at 04:02:02PM +1000, c...@zip.com.au wrote:
> On 13Jul2016 22:03, Mike Wright  wrote:
> > OK, thanks everybody.
> > 
> > Had to use egrep. This works:
> > 
> > PATTERN='https?://[^/]*\.in(/.*)*'
> > egrep $PATTERN file.of.links > links.in
> 
> You need quotes around $PATTERN when you use it, thus:
> 
>  egrep "$PATTERN" file.of.links > links.in
> 
...
> > Covers cases with https and where nothing follows the .in
> 
> Your:
> 
>  (/.*)*
> 
> is better written:
> 
>  (/.*)?
> 
I'll mention one of my pet peeves (kinda like RLS's UUOC* award).

In a simple pattern matching grep RE, a repetition operator, '*'
(and in egrep '?'), is useless at the start or end of the pattern
and may reduce grep's efficiency.

  grep .*abc
  grep   abc.*
  grep .*abc.*
  grep   abc

Will all match the same set of lines.

Note, I'm not referring to some situations where anchors are used
(ex. 'abc[0-9]*$'), or sed substitutions, or even grep with the
'-o' option to output the matched portion.

Jon

 * Randal L Schwarz' Useless Use Of Cat
-- 
Jon H. LaBadie  jo...@jgcomp.com
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


[SOLVED] Re: OT: grep and regex patterns

2016-07-14 Thread Mike Wright

On 07/13/2016 11:02 PM, c...@zip.com.au wrote:

On 13Jul2016 22:03, Mike Wright  wrote:

OK, thanks everybody.

Had to use egrep. This works:

PATTERN='https?://[^/]*\.in(/.*)*'
egrep $PATTERN file.of.links > links.in


You need quotes around $PATTERN when you use it, thus:

  egrep "$PATTERN" file.of.links > links.in


Arrrgh!  I'd sloppily lost the double quotes during a cut and paste. 
They've been restored.



You may be getting away with it here, but another pattern may well be
broken up by the shell on whitespace. Not to mention globbing (unquoted
askerisks and question marks, etc).


Covers cases with https and where nothing follows the .in


Your:

  (/.*)*

is better written:

  (/.*)?

i.e. it is there or it is not. As it happens the "*" form you used will
be matched as efficiently in this case, but there are plenty of patterns
where using "*" instead of something more constrained can lead to
exponential cost as the regexp engine tries many many more combinations
as it attempts to match. Always write these things as
pickily/conservatively as possible.


Makes sense.  Exponential earnings = good, costs = bad ;)


The other nit is that you should use $lowercase variable names in the
shell instead of $UPPERCASE names for script local variables which you
do not intend to export. This is a good practice thing, but quite
important for reasons I can explain at length is requested.


Duly noted.  I got bit by that yesterday when I stepped on a system 
level variable.

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


Re: OT: grep and regex patterns

2016-07-14 Thread cs

On 13Jul2016 22:03, Mike Wright  wrote:

OK, thanks everybody.

Had to use egrep. This works:

PATTERN='https?://[^/]*\.in(/.*)*'
egrep $PATTERN file.of.links > links.in


You need quotes around $PATTERN when you use it, thus:

 egrep "$PATTERN" file.of.links > links.in

You may be getting away with it here, but another pattern may well be broken up 
by the shell on whitespace. Not to mention globbing (unquoted askerisks and 
question marks, etc).



Covers cases with https and where nothing follows the .in


Your:

 (/.*)*

is better written:

 (/.*)?

i.e. it is there or it is not. As it happens the "*" form you used will be 
matched as efficiently in this case, but there are plenty of patterns where 
using "*" instead of something more constrained can lead to exponential cost as 
the regexp engine tries many many more combinations as it attempts to match.  
Always write these things as pickily/conservatively as possible.


The other nit is that you should use $lowercase variable names in the shell 
instead of $UPPERCASE names for script local variables which you do not intend 
to export. This is a good practice thing, but quite important for reasons I can 
explain at length is requested.


Cheers,
Cameron Simpson 
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


Re: OT: grep and regex patterns

2016-07-13 Thread Mike Wright

On 07/13/2016 08:46 PM, Chris Adams wrote:

Once upon a time, Jon LaBadie  said:

On Thu, Jul 14, 2016 at 02:09:27AM +, Christopher wrote:

On Wed, Jul 13, 2016, 21:11 Chris Adams  wrote:


Once upon a time, Mike Wright  said:

Putting all that together, I'd recommend:

   PATTERN='https?://[^/]*\.in/'
   grep "$PATTERN" file.of.links > links.in


Minor nit, egrep is needed for the '?' or grep -E.


Oops, yep.  I'm usually writing perl (and occasionally using grep -P) so
I forgot.


OK, thanks everybody.

Had to use egrep. This works:

PATTERN='https?://[^/]*\.in(/.*)*'
egrep $PATTERN file.of.links > links.in

Covers cases with https and where nothing follows the .in
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


Re: OT: grep and regex patterns

2016-07-13 Thread Chris Adams
Once upon a time, Jon LaBadie  said:
> On Thu, Jul 14, 2016 at 02:09:27AM +, Christopher wrote:
> > On Wed, Jul 13, 2016, 21:11 Chris Adams  wrote:
> > 
> > > Once upon a time, Mike Wright  said:
> > >
> > > Putting all that together, I'd recommend:
> > >
> > >   PATTERN='https?://[^/]*\.in/'
> > >   grep "$PATTERN" file.of.links > links.in
> 
> Minor nit, egrep is needed for the '?' or grep -E.

Oops, yep.  I'm usually writing perl (and occasionally using grep -P) so
I forgot.
-- 
Chris Adams 
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


Re: OT: grep and regex patterns

2016-07-13 Thread Jon LaBadie
On Thu, Jul 14, 2016 at 02:09:27AM +, Christopher wrote:
> On Wed, Jul 13, 2016, 21:11 Chris Adams  wrote:
> 
> > Once upon a time, Mike Wright  said:
> >
> > Putting all that together, I'd recommend:
> >
> >   PATTERN='https?://[^/]*\.in/'
> >   grep "$PATTERN" file.of.links > links.in

Minor nit, egrep is needed for the '?' or grep -E.

> >
> > or just:
> >
> >   grep 'https?://[^/]*\.in/' file.of.links > links.in
> >
> > Only potential oddity would be if you have URLs with non-standard ports
> > specified (like "https://foo.in:8080/;); to match that, you could use
> > egrep instead (extended regex):
> >
> >   egrep 'https://[^/]*\.in(:[0-9]+)?/' file.of.links > links.in
> 
> 
> One extra change I'd make, to make it more obvious you are checking for a
> literal dot and not intending to escape, use [.] instead of \.
> 
> So,
> 
> egrep 'https://[^/]*[.]in(:[0-9]+)?/' file.of.links > links.in

> --
> users mailing list
> users@lists.fedoraproject.org
> To unsubscribe or change subscription options:
> https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
> Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
> Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
> Have a question? Ask away: http://ask.fedoraproject.org

>>> End of included message <<<

-- 
Jon H. LaBadie  jo...@jgcomp.com
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


Re: OT: grep and regex patterns

2016-07-13 Thread Christopher
On Wed, Jul 13, 2016, 21:11 Chris Adams  wrote:

> Once upon a time, Mike Wright  said:
>
> Putting all that together, I'd recommend:
>
>   PATTERN='https?://[^/]*\.in/'
>   grep "$PATTERN" file.of.links > links.in
>
> or just:
>
>   grep 'https?://[^/]*\.in/' file.of.links > links.in
>
> Only potential oddity would be if you have URLs with non-standard ports
> specified (like "https://foo.in:8080/;); to match that, you could use
> egrep instead (extended regex):
>
>   egrep 'https://[^/]*\.in(:[0-9]+)?/' file.of.links > links.in


One extra change I'd make, to make it more obvious you are checking for a
literal dot and not intending to escape, use [.] instead of \.

So,

egrep 'https://[^/]*[.]in(:[0-9]+)?/' file.of.links > links.in
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


Re: OT: grep and regex patterns

2016-07-13 Thread Chris Adams
Once upon a time, Mike Wright  said:
> PATTERN="^.*http:\/\/.*\.in.*$"
> grep $PATTERN < file.of.links >links.in

Several issues I see:

- it appears you are using a shell variable to pass the pattern; since
  you are using double quotes, shell interpolation occurs, so all the
  escaping \ characters are just escaping from the shell - grep just
  sees "^.*http://.*.in.*$;

- you'll match any URL with "in" in it anywhere, not just in the
  hostname portion

- "^.*" and ".*$" are essentially useless, because they match anything
  at the start and end of the line respectively (which, since by default
  a pattern isn't anchored to the start/end, is not needed)

- you don't need to escape the / (so // is fine instead of the "leaning
  toothpicks" of "\/\/")

- if you are going to use a variable to set the pattern, you need to use
  double quotes around it when it is used

- may not be a problem for your case, but you won't match HTTPS URLs

- minor nit: grep reads from a file, so shell redirection is superfluous

Putting all that together, I'd recommend:

  PATTERN='https?://[^/]*\.in/'
  grep "$PATTERN" file.of.links > links.in

or just:

  grep 'https?://[^/]*\.in/' file.of.links > links.in

Only potential oddity would be if you have URLs with non-standard ports
specified (like "https://foo.in:8080/;); to match that, you could use
egrep instead (extended regex):

  egrep 'https://[^/]*\.in(:[0-9]+)?/' file.of.links > links.in

-- 
Chris Adams 
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org


OT: grep and regex patterns

2016-07-13 Thread Mike Wright

Hi all,

There is a file with a collection of links from global sites.  While 
trying to sort the links into categories I'm catching the wrong things.


e.g. find links from India:

PATTERN="^.*http:\/\/.*\.in.*$"
grep $PATTERN < file.of.links >links.in

Even though the "." in front of "in" is escaped with "\" PATTERN is 
catching any text containing "in".  This is obviously not what is intended.


Any one out there point me to the error of my ways?

TIA,
Mike Wright
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org