Re: How to reinvent grep in perl?

Jonathan Lang Wed, 03 Oct 2007 14:00:53 -0700

siegfried wrote:
> I need to search large amounts of source code and grep is not doing the job.
> The problem is that I keep matching stuff in the comments of the
> C++/Java/Perl/Groovy/Javascript source code.
>
> Can someone give me some hints on where I might start on rewriting grep in
> perl so that it ignores the contents of /* and */ comments?


Instead of rewriting grep, consider writing a comment filter.  Have it
read from standard input and write to standard output; pipe the file
that you want to grep into it, and pipe its output into grep.

As for the file itself: there's probably an elegant way to use regexes
to trim out the comments; the 'perldoc -q comments' suggestion made by
Chas shows one possibility.  However, that approach generally involves
slurping the entire file into the perl script, applying the regex to
the whole thing, and then spitting the result out again.  From what I
understand, this generally isn't very good form.

A messier approach that has the benefit of being less memory-intensive
and of producing output with less of a delay would be to write a
contextual switchboard: read the input stream a little at a time;
decide if a given block of character is code, quote, or comment; and
send it on to the output stream if it is code or quote.  (The reason
for distinguishing between code and quote is that '/*' and '*/' don't
denote comments when they appear within quotes.)

The key to this approach is the context.  Start the program in 'code'
context, and start reading the input stream a line at a time.  For a
given line, search for the first instance of characters that would
denote the beginning of a comment or quote.  If you don't find
anything, send the whole line to the output stream; if you find
something, send everything before it to the output stream, switch to
quote or comment context as appropriate, and examine the rest of the
line under the new context.

Under quote context, do exactly the same as in code context, except
that what you're looking for is the earliest character combination
that will end the quote.  In particular, you are _not_ looking for the
start of a comment.  When you find the end of the quote, you switch
back to code context for the rest of the line.

Comment context works almost exactly like quote context, except that
you're looking exclusively for the end of the comment, and you throw
away everything prior to it instead of sending it to the output
stream.  For the purpose of maintaining line counts, send a newline
character to the output stream every time you hit the end of a line
while in comment context.  (Alternatively, you might consider writing
the script to dump the comment contents to stderr, which could be
useful if you ever want to grep the contents of the comments instead
of the code.  If you do this be sure to send a newline to stderr
whenever you end a line in code or quote context.)

-- 
Jonathan "Dataweaver" Lang

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: How to reinvent grep in perl?

Reply via email to