Re: [PATCH v2] checkpatch: fix false positives in REPEATED_WORD warning

Aditya Thu, 22 Oct 2020 12:15:12 -0700

On 22/10/20 9:40 pm, Joe Perches wrote:
> On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
>> Presence of hexadecimal address or symbol results in false warning
>> message by checkpatch.pl.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -3051,7 +3051,10 @@ sub process {
>>              }
>>  
>>  # check for repeated words separated by a single space
>> -            if ($rawline =~ /^\+/ || $in_commit_log) {
>> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
>> +            if (($rawline =~ /^\+/ || $in_commit_log) &&
>> +            $rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
> 
> Alignment and use \b before and after the regex please.


If we use \b either before or after or both it does not match patterns
such as:
+   -rw-r--r--. 1 root root 112K Mar 20 12:16
selinux-policy-3.14.4-48.fc31.noarch.rpm

This is happening probably because it is counting '-' for '\b'
I have not observed any negatives of using this though.

> 
>               if (($rawline =~ /^\+/ || $in_commit_log) &&
>                   $rawline !~ /\b[bcCdDlMnpPs\?-][rwxsStT-]{9}\b/) {
>> @@ -3065,6 +3068,34 @@ sub process {
>>                              next if ($first ne $second);
>>                              next if ($first eq 'long');
>>  
>> +                            # avoid repeating hex occurrences like 'ff ff 
>> fe 09 ...'
>> +                            if ($first =~ /\b[0-9a-f]{2,}/) {
>> +                                    # if such sequence occurs more than 4, 
>> it is most probably part of some of code
>> +                                    next if ((scalar @hex_seq)>4);
>> +                                    # for hex occurrences which are less 
>> than 4
>> +                                    # get first hex word in the line
>> +                                    if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +                                            my $post_hex_seq = $';
>> +
>> +                                            # set suffieciently high 
>> default values to avoid ignoring or counting in absence of another
>> +                                            my $non_hex_char_pos = 1000;
>> +                                            my $special_chars_pos = 500;
>> +
>> +                                            if ($post_hex_seq =~ /[g-z]+/) {
>> +                                                    # first non hex 
>> character in post_hex_seq
>> +                                                    $non_hex_char_pos = 
>> $-[0];
>> +                                            }
>> +                                            if($post_hex_seq =~ 
>> /[^a-zA-Z0-9]{2,}/) {
>> +                                                    # first occurrence of 2 
>> or more special chars
>> +                                                    $special_chars_pos = 
>> $-[0];
>> +                                            }
> 
> What does all this code actually avoid?
> 
> 

Sir, there are multiple variations of hex for which this warning is
occurring, for eg:
1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
2) ffffffff ffffffff 00000000 c070058c
3)     f5a:       48 c7 44 24 78 ff ff    movq
$0xffffffffffffffff,0x78(%rsp)
4) +    fe fe
5) +    fe fe   - ? end marker ?
6) Code: ff ff 48 (...)

So I first check if the repeated word matches /\b[0-9a-f]{2,}/ . If it
does and occurs as a sequence of such repetitions more than 4(ie more
than or equal to 5), then it is most probably a part of hexadecimal
code. This is implemented here,

+                               if ($first =~ /\b[0-9a-f]{2,}/) {
+                                       # if such sequence occurs more than 4, 
it is most probably part
of some of code
+                                       next if ((scalar @hex_seq)>4);

This addresses our issues for warning similar to example (1),(2) and (3).

But still we haven't detected 4,5,6. One can argue that we can modify:

+                                       next if ((scalar @hex_seq)>4);

with (scalar @hex_seq)>2 or (scalar @hex_seq)>3

but then, we'll not be able to account for warnings such as:

7) +     * sets this to -1, the slack value will be calculated to be be
halfway
8) + * @seg: index of packet segment whose raw fields are to be be
extracted
9) The data in destination buffer is expected to be be parsed in big
10) +    *   1. New session or device can'be be created - session sysfs
files

Here I observed that in hex codes, there are atleast 2 special
characters present before any non-hex character, for eg. in (5). Also
generally such occurrences are very rare in writing english, and it is
also helpful in our case.

This is implemented here:

>> +                            # avoid repeating hex occurrences like 'ff ff 
>> fe 09 ...'
>> +                            if ($first =~ /\b[0-9a-f]{2,}/) {
>> +                                    # if such sequence occurs more than 4, 
>> it is most probably
part of some of code
>> +                                    next if ((scalar @hex_seq)>4);
>> +                                    # for hex occurrences which are less 
>> than 4
>> +                                    # get first hex word in the line
>> +                                    if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +                                            my $post_hex_seq = $';
>> +
>> +                                            # set suffieciently high 
>> default values to avoid ignoring or
counting in absence of another
>> +                                            my $non_hex_char_pos = 1000;
>> +                                            my $special_chars_pos = 500;
>> +
>> +                                            if ($post_hex_seq =~ /[g-z]+/) {
>> +                                                    # first non hex 
>> character in post_hex_seq
>> +                                                    $non_hex_char_pos = 
>> $-[0];
>> +                                            }
>> +                                            if($post_hex_seq =~ 
>> /[^a-zA-Z0-9]{2,}/) {
>> +                                                    # first occurrence of 2 
>> or more special chars
>> +                                                    $special_chars_pos = 
>> $-[0];
>> +                                            }

I have used these two lines for cases like example(4):
+                                               my $non_hex_char_pos = 1000;
+                                               my $special_chars_pos = 500;

Here, non-hex characters are missing, thus the default character helps
us to get desired result.
Also, I have set higher values such that if one of them occurs in a
line, the result remain unaffected, than with lower default values.


Thanks
Aditya

Re: [PATCH v2] checkpatch: fix false positives in REPEATED_WORD warning

Reply via email to