Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
2013/5/29 Matijn Woudt tijn...@gmail.com On Wed, May 29, 2013 at 10:51 PM, Sebastian Krebs krebs@gmail.comwrote: 2013/5/29 Matijn Woudt tijn...@gmail.com On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.com wrote: On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote: On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote: I'm adding some minification to our cache.class.php and am running into an edge case that is causing me grief. I want to remove all comments of the // variety, HOWEVER I don't want to remove URLs... KISS. To make it simple, straight-forward, and understandable next year when I have to re-read what I've written: I'd change all :// to QqQ -- or any unlikely text string. Then I'd do whatever needs to be done to the // occurances. Finally, I'd change all QqQ back to ://. Jonesy Wow. This is just a spectacularly bad suggestion. First off, this task is probably a bit beyond the capabilities of a regex. Yes, you may be able to come up with something that works 99% of the time, but this is really a job for a parser of some sort. I'm sorry I don't have any suggestions on exactly where to go with that, however I'm sure Google can be of assistance. The main problem is that regex doesn't understand context. It just blindly finds patterns. A parser understands context, and can figure out which //'s are comments and which are something else. As a bonus, it can probably understand other forms of comments like /* */, which regex would completely die on. It is possible to write a whole parser as a single regex, being it terribly long and complex. No, it isn't. It's better if you throw some smart words on the screen if you want to convince someone. Just thinking about it, it makes sense as a true regular expression can only describe a regular language, and I think all the programming languages are not regular languages. But, We have PHP PCRE with extensions like Recursive patterns[1] and Back references[2], which can describe much more than just a regular language. And I do believe it would be able to handle it. Too bad it probably takes months to complete a regular expression like this. Then you start as soon as possible, so that you not realitze, that this is wrong, when it is too late. I am not going to start explaining this again, because it becomes a waste of time. You call it smart words on the screen, I call it advice. - Matijn [1] http://php.net/manual/en/regexp.reference.recursive.php [2] http://php.net/manual/en/regexp.reference.back-references.php -- github.com/KingCrunch
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
On Wed, May 29, 2013 at 10:20 AM, Matijn Woudt tijn...@gmail.com wrote: It is possible to write a whole parser as a single regex, being it terribly long and complex. While regular expressions are often used in the lexer--the part that scans the input stream and breaks it up into meaningful tokens like { keyword: function } { operator: + } and { identifier: $foo } that form the building blocks of the language--they aren't combined into a single expression. Instead, a lexer generator is used to build a state machine that switches the active expressions to check based on the previous tokens and context. Each expression recognizes a different type of token, and many times these aren't even regular expressions. The second stage--combining tokens based on the rules of the grammar--is more complex and beyond the abilities of regular expressions. There are plenty of books on the subject and tools [1] to build the pieces such as Lex, Yacc, Flex, and Bison. Someone even asked this question on Stack Overflow [2] a few years ago. And I'm sure if you look you can find someone that did a masters thesis proving that regular expressions cannot handle a context-free grammar. And finally I leave you with Jeff Atwood's article about (not) parsing HTML with regex. [3] Peace, David [1] http://dinosaur.compilertools.net/ [2] http://stackoverflow.com/questions/3487089/are-regular-expressions-used-to-build-parsers [3] http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
[PHP] Re: need some regex help to strip out // comments but not http:// urls
On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote: I'm adding some minification to our cache.class.php and am running into an edge case that is causing me grief. I want to remove all comments of the // variety, HOWEVER I don't want to remove URLs... KISS. To make it simple, straight-forward, and understandable next year when I have to re-read what I've written: I'd change all :// to QqQ -- or any unlikely text string. Then I'd do whatever needs to be done to the // occurances. Finally, I'd change all QqQ back to ://. Jonesy -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote: On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote: I'm adding some minification to our cache.class.php and am running into an edge case that is causing me grief. I want to remove all comments of the // variety, HOWEVER I don't want to remove URLs... KISS. To make it simple, straight-forward, and understandable next year when I have to re-read what I've written: I'd change all :// to QqQ -- or any unlikely text string. Then I'd do whatever needs to be done to the // occurances. Finally, I'd change all QqQ back to ://. Jonesy Wow. This is just a spectacularly bad suggestion. First off, this task is probably a bit beyond the capabilities of a regex. Yes, you may be able to come up with something that works 99% of the time, but this is really a job for a parser of some sort. I'm sorry I don't have any suggestions on exactly where to go with that, however I'm sure Google can be of assistance. The main problem is that regex doesn't understand context. It just blindly finds patterns. A parser understands context, and can figure out which //'s are comments and which are something else. As a bonus, it can probably understand other forms of comments like /* */, which regex would completely die on. Blindly replacing a string with any unlikely text string is just bad. I don't care how unlikely your text string is, it _will_ eventually show up in a page. It may take 5 years, but it'll happen. And when it does, this little hack will blow up spectacularly. I'm sorry to rain on your parade, but this is not KISS. This may seem simple, but the submarine bugs it introduces will be a nightmare to track down, and then you'll be in the same boat that you are in right now. Don't do that to yourself. Do it right the first time. -- --Zootboy Sent from some sort of computing device. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.comwrote: On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote: On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote: I'm adding some minification to our cache.class.php and am running into an edge case that is causing me grief. I want to remove all comments of the // variety, HOWEVER I don't want to remove URLs... KISS. To make it simple, straight-forward, and understandable next year when I have to re-read what I've written: I'd change all :// to QqQ -- or any unlikely text string. Then I'd do whatever needs to be done to the // occurances. Finally, I'd change all QqQ back to ://. Jonesy Wow. This is just a spectacularly bad suggestion. First off, this task is probably a bit beyond the capabilities of a regex. Yes, you may be able to come up with something that works 99% of the time, but this is really a job for a parser of some sort. I'm sorry I don't have any suggestions on exactly where to go with that, however I'm sure Google can be of assistance. The main problem is that regex doesn't understand context. It just blindly finds patterns. A parser understands context, and can figure out which //'s are comments and which are something else. As a bonus, it can probably understand other forms of comments like /* */, which regex would completely die on. It is possible to write a whole parser as a single regex, being it terribly long and complex. That said, there's no other simple syntax that would work, for example in javascript you could to the following: var http = 5; switch(value) { case http:// Http case here! (this whould not be deleted) // Do something } But most likely you wouldn't care about that.. - Matijn
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
It is possible to write a whole parser as a single regex, being it terribly long and complex. That said, there's no other simple syntax that would work, for example in javascript you could to the following: var http = 5; switch(value) { case http:// Http case here! (this whould not be deleted) // Do something } But most likely you wouldn't care about that.. - Matijn I would have to disagree. There are things that regex just can't at a fundamental level grok. Things like nested brackets (e.g. the standard blocking syntax of C, javascript, php, etc.). It's not a parser, and despite all the little lookahead/behind tricks that enhanced regex can do, it can't at a fundamental level _interret_ the text it sees. This task involves interpreting what the text you're looking for actually means, and should therefore be handled by something that can interpret. Also, (I haven't tested it, but) I don't think that example you gave would work. Without any sort of quoting around the http://; , I would assume the JS interpreter would take that double slash as a comment starter. Do tell me if I'm wrong, though. -- --Zootboy Sent from some sort of computing device. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
On Wed, May 29, 2013 at 7:27 PM, Sean Greenslade zootboys...@gmail.comwrote: It is possible to write a whole parser as a single regex, being it terribly long and complex. That said, there's no other simple syntax that would work, for example in javascript you could to the following: var http = 5; switch(value) { case http:// Http case here! (this whould not be deleted) // Do something } But most likely you wouldn't care about that.. - Matijn I would have to disagree. There are things that regex just can't at a fundamental level grok. Things like nested brackets (e.g. the standard blocking syntax of C, javascript, php, etc.). It's not a parser, and despite all the little lookahead/behind tricks that enhanced regex can do, it can't at a fundamental level _interret_ the text it sees. This task involves interpreting what the text you're looking for actually means, and should therefore be handled by something that can interpret. I think it should be possible, but as I said, very very complex. Let's not try it;) Also, (I haven't tested it, but) I don't think that example you gave would work. Without any sort of quoting around the http://; , I would assume the JS interpreter would take that double slash as a comment starter. Do tell me if I'm wrong, though. Which is exactly what I meant. Because http is a var set to 5, it is a valid case statement, it would be equal to: switch(value) { case 5: // Http case here! (this whould not be deleted) // Do something } But any regex given above would treat the first one as a http url, and won't strip the // and everything after it, though in this modified case it will strip the comments. - Matijn
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
On Wed, May 29, 2013 at 1:33 PM, Matijn Woudt tijn...@gmail.com wrote: On Wed, May 29, 2013 at 7:27 PM, Sean Greenslade zootboys...@gmail.com wrote: It is possible to write a whole parser as a single regex, being it terribly long and complex. That said, there's no other simple syntax that would work, for example in javascript you could to the following: var http = 5; switch(value) { case http:// Http case here! (this whould not be deleted) // Do something } But most likely you wouldn't care about that.. SNIP I think it should be possible, but as I said, very very complex. Let's not try it;) Also, (I haven't tested it, but) I don't think that example you gave would work. Without any sort of quoting around the http://; , I would assume the JS interpreter would take that double slash as a comment starter. Do tell me if I'm wrong, though. Which is exactly what I meant. Because http is a var set to 5, it is a valid case statement, it would be equal to: switch(value) { case 5: // Http case here! (this whould not be deleted) // Do something } But any regex given above would treat the first one as a http url, and won't strip the // and everything after it, though in this modified case it will strip the comments. - Matijn Sorry, I slightly mis-interpreted what that code was intending to do. Regardless, it is still something that should be done by an interpreter. So this is another edge case where regexes would more than likely break down but an interpreter should (I do say should) do The Right Thing. -- --Zootboy Sent from some sort of computing device. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
2013/5/29 Matijn Woudt tijn...@gmail.com On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.com wrote: On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote: On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote: I'm adding some minification to our cache.class.php and am running into an edge case that is causing me grief. I want to remove all comments of the // variety, HOWEVER I don't want to remove URLs... KISS. To make it simple, straight-forward, and understandable next year when I have to re-read what I've written: I'd change all :// to QqQ -- or any unlikely text string. Then I'd do whatever needs to be done to the // occurances. Finally, I'd change all QqQ back to ://. Jonesy Wow. This is just a spectacularly bad suggestion. First off, this task is probably a bit beyond the capabilities of a regex. Yes, you may be able to come up with something that works 99% of the time, but this is really a job for a parser of some sort. I'm sorry I don't have any suggestions on exactly where to go with that, however I'm sure Google can be of assistance. The main problem is that regex doesn't understand context. It just blindly finds patterns. A parser understands context, and can figure out which //'s are comments and which are something else. As a bonus, it can probably understand other forms of comments like /* */, which regex would completely die on. It is possible to write a whole parser as a single regex, being it terribly long and complex. No, it isn't. That said, there's no other simple syntax that would work, for example in javascript you could to the following: var http = 5; switch(value) { case http:// Http case here! (this whould not be deleted) // Do something } But most likely you wouldn't care about that.. - Matijn -- github.com/KingCrunch
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
On Wed, May 29, 2013 at 10:51 PM, Sebastian Krebs krebs@gmail.comwrote: 2013/5/29 Matijn Woudt tijn...@gmail.com On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.com wrote: On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote: On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote: I'm adding some minification to our cache.class.php and am running into an edge case that is causing me grief. I want to remove all comments of the // variety, HOWEVER I don't want to remove URLs... KISS. To make it simple, straight-forward, and understandable next year when I have to re-read what I've written: I'd change all :// to QqQ -- or any unlikely text string. Then I'd do whatever needs to be done to the // occurances. Finally, I'd change all QqQ back to ://. Jonesy Wow. This is just a spectacularly bad suggestion. First off, this task is probably a bit beyond the capabilities of a regex. Yes, you may be able to come up with something that works 99% of the time, but this is really a job for a parser of some sort. I'm sorry I don't have any suggestions on exactly where to go with that, however I'm sure Google can be of assistance. The main problem is that regex doesn't understand context. It just blindly finds patterns. A parser understands context, and can figure out which //'s are comments and which are something else. As a bonus, it can probably understand other forms of comments like /* */, which regex would completely die on. It is possible to write a whole parser as a single regex, being it terribly long and complex. No, it isn't. It's better if you throw some smart words on the screen if you want to convince someone. Just thinking about it, it makes sense as a true regular expression can only describe a regular language, and I think all the programming languages are not regular languages. But, We have PHP PCRE with extensions like Recursive patterns[1] and Back references[2], which can describe much more than just a regular language. And I do believe it would be able to handle it. Too bad it probably takes months to complete a regular expression like this. - Matijn [1] http://php.net/manual/en/regexp.reference.recursive.php [2] http://php.net/manual/en/regexp.reference.back-references.php
Re: [PHP] Re: need some regex help to strip out // comments but not http:// urls
Matijn Woudt tijn...@gmail.com wrote: On Wed, May 29, 2013 at 10:51 PM, Sebastian Krebs krebs@gmail.comwrote: 2013/5/29 Matijn Woudt tijn...@gmail.com On Wed, May 29, 2013 at 6:08 PM, Sean Greenslade zootboys...@gmail.com wrote: On Wed, May 29, 2013 at 9:57 AM, Jonesy gm...@jonz.net wrote: On Tue, 28 May 2013 14:17:06 -0700, Daevid Vincent wrote: I'm adding some minification to our cache.class.php and am running into an edge case that is causing me grief. I want to remove all comments of the // variety, HOWEVER I don't want to remove URLs... KISS. To make it simple, straight-forward, and understandable next year when I have to re-read what I've written: I'd change all :// to QqQ -- or any unlikely text string. Then I'd do whatever needs to be done to the // occurances. Finally, I'd change all QqQ back to ://. Jonesy Wow. This is just a spectacularly bad suggestion. First off, this task is probably a bit beyond the capabilities of a regex. Yes, you may be able to come up with something that works 99% of the time, but this is really a job for a parser of some sort. I'm sorry I don't have any suggestions on exactly where to go with that, however I'm sure Google can be of assistance. The main problem is that regex doesn't understand context. It just blindly finds patterns. A parser understands context, and can figure out which //'s are comments and which are something else. As a bonus, it can probably understand other forms of comments like /* */, which regex would completely die on. It is possible to write a whole parser as a single regex, being it terribly long and complex. No, it isn't. It's better if you throw some smart words on the screen if you want to convince someone. Just thinking about it, it makes sense as a true regular expression can only describe a regular language, and I think all the programming languages are not regular languages. But, We have PHP PCRE with extensions like Recursive patterns[1] and Back references[2], which can describe much more than just a regular language. And I do believe it would be able to handle it. Too bad it probably takes months to complete a regular expression like this. - Matijn [1] http://php.net/manual/en/regexp.reference.recursive.php [2] http://php.net/manual/en/regexp.reference.back-references.php Sometimes when all you know is regex, everything looks like a nail... Thanks, Ash -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php