Re: OT: grep and regex patterns
On Thu, Jul 14, 2016 at 04:02:02PM +1000, c...@zip.com.au wrote: > On 13Jul2016 22:03, Mike Wrightwrote: > > OK, thanks everybody. > > > > Had to use egrep. This works: > > > > PATTERN='https?://[^/]*\.in(/.*)*' > > egrep $PATTERN file.of.links > links.in > > You need quotes around $PATTERN when you use it, thus: > > egrep "$PATTERN" file.of.links > links.in > ... > > Covers cases with https and where nothing follows the .in > > Your: > > (/.*)* > > is better written: > > (/.*)? > I'll mention one of my pet peeves (kinda like RLS's UUOC* award). In a simple pattern matching grep RE, a repetition operator, '*' (and in egrep '?'), is useless at the start or end of the pattern and may reduce grep's efficiency. grep .*abc grep abc.* grep .*abc.* grep abc Will all match the same set of lines. Note, I'm not referring to some situations where anchors are used (ex. 'abc[0-9]*$'), or sed substitutions, or even grep with the '-o' option to output the matched portion. Jon * Randal L Schwarz' Useless Use Of Cat -- Jon H. LaBadie jo...@jgcomp.com -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
[SOLVED] Re: OT: grep and regex patterns
On 07/13/2016 11:02 PM, c...@zip.com.au wrote: On 13Jul2016 22:03, Mike Wrightwrote: OK, thanks everybody. Had to use egrep. This works: PATTERN='https?://[^/]*\.in(/.*)*' egrep $PATTERN file.of.links > links.in You need quotes around $PATTERN when you use it, thus: egrep "$PATTERN" file.of.links > links.in Arrrgh! I'd sloppily lost the double quotes during a cut and paste. They've been restored. You may be getting away with it here, but another pattern may well be broken up by the shell on whitespace. Not to mention globbing (unquoted askerisks and question marks, etc). Covers cases with https and where nothing follows the .in Your: (/.*)* is better written: (/.*)? i.e. it is there or it is not. As it happens the "*" form you used will be matched as efficiently in this case, but there are plenty of patterns where using "*" instead of something more constrained can lead to exponential cost as the regexp engine tries many many more combinations as it attempts to match. Always write these things as pickily/conservatively as possible. Makes sense. Exponential earnings = good, costs = bad ;) The other nit is that you should use $lowercase variable names in the shell instead of $UPPERCASE names for script local variables which you do not intend to export. This is a good practice thing, but quite important for reasons I can explain at length is requested. Duly noted. I got bit by that yesterday when I stepped on a system level variable. -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Re: OT: grep and regex patterns
On 13Jul2016 22:03, Mike Wrightwrote: OK, thanks everybody. Had to use egrep. This works: PATTERN='https?://[^/]*\.in(/.*)*' egrep $PATTERN file.of.links > links.in You need quotes around $PATTERN when you use it, thus: egrep "$PATTERN" file.of.links > links.in You may be getting away with it here, but another pattern may well be broken up by the shell on whitespace. Not to mention globbing (unquoted askerisks and question marks, etc). Covers cases with https and where nothing follows the .in Your: (/.*)* is better written: (/.*)? i.e. it is there or it is not. As it happens the "*" form you used will be matched as efficiently in this case, but there are plenty of patterns where using "*" instead of something more constrained can lead to exponential cost as the regexp engine tries many many more combinations as it attempts to match. Always write these things as pickily/conservatively as possible. The other nit is that you should use $lowercase variable names in the shell instead of $UPPERCASE names for script local variables which you do not intend to export. This is a good practice thing, but quite important for reasons I can explain at length is requested. Cheers, Cameron Simpson -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Re: OT: grep and regex patterns
On 07/13/2016 08:46 PM, Chris Adams wrote: Once upon a time, Jon LaBadiesaid: On Thu, Jul 14, 2016 at 02:09:27AM +, Christopher wrote: On Wed, Jul 13, 2016, 21:11 Chris Adams wrote: Once upon a time, Mike Wright said: Putting all that together, I'd recommend: PATTERN='https?://[^/]*\.in/' grep "$PATTERN" file.of.links > links.in Minor nit, egrep is needed for the '?' or grep -E. Oops, yep. I'm usually writing perl (and occasionally using grep -P) so I forgot. OK, thanks everybody. Had to use egrep. This works: PATTERN='https?://[^/]*\.in(/.*)*' egrep $PATTERN file.of.links > links.in Covers cases with https and where nothing follows the .in -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Re: OT: grep and regex patterns
Once upon a time, Jon LaBadiesaid: > On Thu, Jul 14, 2016 at 02:09:27AM +, Christopher wrote: > > On Wed, Jul 13, 2016, 21:11 Chris Adams wrote: > > > > > Once upon a time, Mike Wright said: > > > > > > Putting all that together, I'd recommend: > > > > > > PATTERN='https?://[^/]*\.in/' > > > grep "$PATTERN" file.of.links > links.in > > Minor nit, egrep is needed for the '?' or grep -E. Oops, yep. I'm usually writing perl (and occasionally using grep -P) so I forgot. -- Chris Adams -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Re: OT: grep and regex patterns
On Thu, Jul 14, 2016 at 02:09:27AM +, Christopher wrote: > On Wed, Jul 13, 2016, 21:11 Chris Adamswrote: > > > Once upon a time, Mike Wright said: > > > > Putting all that together, I'd recommend: > > > > PATTERN='https?://[^/]*\.in/' > > grep "$PATTERN" file.of.links > links.in Minor nit, egrep is needed for the '?' or grep -E. > > > > or just: > > > > grep 'https?://[^/]*\.in/' file.of.links > links.in > > > > Only potential oddity would be if you have URLs with non-standard ports > > specified (like "https://foo.in:8080/;); to match that, you could use > > egrep instead (extended regex): > > > > egrep 'https://[^/]*\.in(:[0-9]+)?/' file.of.links > links.in > > > One extra change I'd make, to make it more obvious you are checking for a > literal dot and not intending to escape, use [.] instead of \. > > So, > > egrep 'https://[^/]*[.]in(:[0-9]+)?/' file.of.links > links.in > -- > users mailing list > users@lists.fedoraproject.org > To unsubscribe or change subscription options: > https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org > Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct > Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines > Have a question? Ask away: http://ask.fedoraproject.org >>> End of included message <<< -- Jon H. LaBadie jo...@jgcomp.com -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Re: OT: grep and regex patterns
On Wed, Jul 13, 2016, 21:11 Chris Adamswrote: > Once upon a time, Mike Wright said: > > Putting all that together, I'd recommend: > > PATTERN='https?://[^/]*\.in/' > grep "$PATTERN" file.of.links > links.in > > or just: > > grep 'https?://[^/]*\.in/' file.of.links > links.in > > Only potential oddity would be if you have URLs with non-standard ports > specified (like "https://foo.in:8080/;); to match that, you could use > egrep instead (extended regex): > > egrep 'https://[^/]*\.in(:[0-9]+)?/' file.of.links > links.in One extra change I'd make, to make it more obvious you are checking for a literal dot and not intending to escape, use [.] instead of \. So, egrep 'https://[^/]*[.]in(:[0-9]+)?/' file.of.links > links.in -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
Re: OT: grep and regex patterns
Once upon a time, Mike Wrightsaid: > PATTERN="^.*http:\/\/.*\.in.*$" > grep $PATTERN < file.of.links >links.in Several issues I see: - it appears you are using a shell variable to pass the pattern; since you are using double quotes, shell interpolation occurs, so all the escaping \ characters are just escaping from the shell - grep just sees "^.*http://.*.in.*$; - you'll match any URL with "in" in it anywhere, not just in the hostname portion - "^.*" and ".*$" are essentially useless, because they match anything at the start and end of the line respectively (which, since by default a pattern isn't anchored to the start/end, is not needed) - you don't need to escape the / (so // is fine instead of the "leaning toothpicks" of "\/\/") - if you are going to use a variable to set the pattern, you need to use double quotes around it when it is used - may not be a problem for your case, but you won't match HTTPS URLs - minor nit: grep reads from a file, so shell redirection is superfluous Putting all that together, I'd recommend: PATTERN='https?://[^/]*\.in/' grep "$PATTERN" file.of.links > links.in or just: grep 'https?://[^/]*\.in/' file.of.links > links.in Only potential oddity would be if you have URLs with non-standard ports specified (like "https://foo.in:8080/;); to match that, you could use egrep instead (extended regex): egrep 'https://[^/]*\.in(:[0-9]+)?/' file.of.links > links.in -- Chris Adams -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
OT: grep and regex patterns
Hi all, There is a file with a collection of links from global sites. While trying to sort the links into categories I'm catching the wrong things. e.g. find links from India: PATTERN="^.*http:\/\/.*\.in.*$" grep $PATTERN < file.of.links >links.in Even though the "." in front of "in" is escaped with "\" PATTERN is catching any text containing "in". This is obviously not what is intended. Any one out there point me to the error of my ways? TIA, Mike Wright -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org