siegfried wrote: > I need to search large amounts of source code and grep is not doing the job. > The problem is that I keep matching stuff in the comments of the > C++/Java/Perl/Groovy/Javascript source code. > > Can someone give me some hints on where I might start on rewriting grep in > perl so that it ignores the contents of /* and */ comments?
Instead of rewriting grep, consider writing a comment filter. Have it read from standard input and write to standard output; pipe the file that you want to grep into it, and pipe its output into grep. As for the file itself: there's probably an elegant way to use regexes to trim out the comments; the 'perldoc -q comments' suggestion made by Chas shows one possibility. However, that approach generally involves slurping the entire file into the perl script, applying the regex to the whole thing, and then spitting the result out again. From what I understand, this generally isn't very good form. A messier approach that has the benefit of being less memory-intensive and of producing output with less of a delay would be to write a contextual switchboard: read the input stream a little at a time; decide if a given block of character is code, quote, or comment; and send it on to the output stream if it is code or quote. (The reason for distinguishing between code and quote is that '/*' and '*/' don't denote comments when they appear within quotes.) The key to this approach is the context. Start the program in 'code' context, and start reading the input stream a line at a time. For a given line, search for the first instance of characters that would denote the beginning of a comment or quote. If you don't find anything, send the whole line to the output stream; if you find something, send everything before it to the output stream, switch to quote or comment context as appropriate, and examine the rest of the line under the new context. Under quote context, do exactly the same as in code context, except that what you're looking for is the earliest character combination that will end the quote. In particular, you are _not_ looking for the start of a comment. When you find the end of the quote, you switch back to code context for the rest of the line. Comment context works almost exactly like quote context, except that you're looking exclusively for the end of the comment, and you throw away everything prior to it instead of sending it to the output stream. For the purpose of maintaining line counts, send a newline character to the output stream every time you hit the end of a line while in comment context. (Alternatively, you might consider writing the script to dump the comment contents to stderr, which could be useful if you ever want to grep the contents of the comments instead of the code. If you do this be sure to send a newline to stderr whenever you end a line in code or quote context.) -- Jonathan "Dataweaver" Lang -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/