Re: remove duplicated words in comments .. across lines

Antonin Houska Sun, 09 Sep 2018 02:10:33 -0700

Justin Pryzby <pry...@telsasoft.com> wrote:

> Resending to -hackers as I realized this isn't a documentation issue so not
> appropriate or apparently interesting to readers of -doc.
> 
> Inspired by David's patch [0], find attached fixing words duplicated, across
> line boundaries.
> 
> I should probably just call the algorithm proprietary, but if you really 
> wanted to know, I've suffered again through sed's black/slashes.
> 
> time find . -name '*.c' -o -name '*.h' |xargs sed -srn '/\/\*/!d; :l; 
> /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g; 
> /(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'
> 
> Alternately:
> time for f in `find . -name '*.c' -o -name '*.h'`; do x=`<"$f" sed -rn 
> '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g; 
> /(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'`; [ -n "$x" ] && 
> echo "$f:" && echo "$x"; done |less


Alternatively you could have used awk as it can maintain variables across
lines. This is a script that I used to find those duplicates in a single file
(Just out of fun, I know that your findings have already been processed.):

BEGIN{prev_line_last_token = NULL}
{
    if (NF > 1 && $1 == "*" && length(prev_line_last_token) > 0)
    {
        if ($2 == prev_line_last_token &&
            # Characters used in ASCII charts are not duplicate words.
            $2 != "|" && $2 != "}")
            # Found a duplicate.
            printf("%s:%s, duplicate token: %s\n", FILENAME, FNR, $2);
    }

    if (NF > 1 && ($1 == "*" || $1 == "/*"))
        prev_line_last_token = $NF;
    else
    {
        # Empty line or not a comment line. Start a new search.
        prev_line_last_token = NULL;
    }
}

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

Re: remove duplicated words in comments .. across lines

Reply via email to