Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-12 Thread Masanari Iida
On Fri, Sep 12, 2014 at 1:45 PM, Joe Perches  wrote:
> On Fri, 2014-09-12 at 13:09 +0900, Masanari Iida wrote:
>> Test with "reseting" case,  codespell found 21, grep found 26.
>
> Hello Masanari.
>
> How did codespell find any uses of reseting?
> What version of codespell are you using?
> (I tested with 1.7)
>
> Looking at the git tree for codespell,
> https://github.com/lucasdemarchi/codespell.git
> the dictionary there doesn't have reseting.
>
Joe,

First of all, I use codespell 1.4 scripts with my original dictionary
based on 1.4.
So I believe the "reseting" was added by me some times ago.

> If I add reseting->resetting to the dictionary,
> then codespell finds the same 31 uses that
> git grep -i does.
>
My codespell 1.4 works as case sensitive.
That's why we saw a little bit different result.

Masanari
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-11 Thread Joe Perches
On Fri, 2014-09-12 at 13:09 +0900, Masanari Iida wrote:
> Test with "reseting" case,  codespell found 21, grep found 26.

Hello Masanari.

How did codespell find any uses of reseting?
What version of codespell are you using?
(I tested with 1.7)

Looking at the git tree for codespell,
https://github.com/lucasdemarchi/codespell.git
the dictionary there doesn't have reseting.

If I add reseting->resetting to the dictionary,
then codespell finds the same 31 uses that
git grep -i does.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-11 Thread Masanari Iida
Talking about codespell,  it detected 76 "informations" in 3.17-rc4.
" grep -R informations * |wc -l"  found 120 typos.
Test with "occured",  codespell found 46,  grep found 110.
Test with "reseting" case,  codespell found 21, grep found 26.

So I expect about half of the incoming typos will be detected by the tool,
and be fixed.
Masanari
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-11 Thread Joe Perches
On Thu, 2014-09-11 at 09:19 +0200, Geert Uytterhoeven wrote:
> On Thu, Sep 11, 2014 at 12:52 AM, Andrew Morton
>  wrote:
> > On Mon, 8 Sep 2014 11:15:24 -0700 Kees Cook  wrote:
> >> Check for misspellings, based on Debian's lintian list. Several false
> >> positives were removed, and several additional words added that were
[]
> > I have a feeling this is going to be a rat hole and that
> > scripts/spelling.txt will grow to consume the planet.  Oh well, whatev.
> 
> What about making checkpatch use the codespell dictionay if codespell
> is installed?
> 
> Codespell is in Ubuntu 14.04LTS (but not in 12.04LTS).

I'm a little concerned about false positives if that's
done, but it seems simple enough.

Maybe both of:

codespell:  /usr/share/codespell/dictionary.txt
lintian:/usr/share/lintian/data/spelling/corrections



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-11 Thread Kees Cook
On Thu, Sep 11, 2014 at 12:19 AM, Geert Uytterhoeven
 wrote:
> On Thu, Sep 11, 2014 at 12:52 AM, Andrew Morton
>  wrote:
>> On Mon, 8 Sep 2014 11:15:24 -0700 Kees Cook  wrote:
>>
>>> Check for misspellings, based on Debian's lintian list. Several false
>>> positives were removed, and several additional words added that were
>>> common in the kernel:
>>>
>>>   backword backwords
>>>   invalide valide
>>>   recieves
>>>   singed unsinged
>>>
>>> While going back and fixing existing spelling mistakes isn't a high
>>> priority, it'd be nice to try to catch them before they hit the tree.
>>
>> I have a feeling this is going to be a rat hole and that
>> scripts/spelling.txt will grow to consume the planet.  Oh well, whatev.
>
> What about making checkpatch use the codespell dictionay if codespell
> is installed?
>
> Codespell is in Ubuntu 14.04LTS (but not in 12.04LTS).

It's probably not a bad idea, but given the level of pruning that's
been needed already to keep down the false positive rate, I'm nervous
about a larger "general" corpus.

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-11 Thread Geert Uytterhoeven
On Thu, Sep 11, 2014 at 12:52 AM, Andrew Morton
 wrote:
> On Mon, 8 Sep 2014 11:15:24 -0700 Kees Cook  wrote:
>
>> Check for misspellings, based on Debian's lintian list. Several false
>> positives were removed, and several additional words added that were
>> common in the kernel:
>>
>>   backword backwords
>>   invalide valide
>>   recieves
>>   singed unsinged
>>
>> While going back and fixing existing spelling mistakes isn't a high
>> priority, it'd be nice to try to catch them before they hit the tree.
>
> I have a feeling this is going to be a rat hole and that
> scripts/spelling.txt will grow to consume the planet.  Oh well, whatev.

What about making checkpatch use the codespell dictionay if codespell
is installed?

Codespell is in Ubuntu 14.04LTS (but not in 12.04LTS).

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-10 Thread Joe Perches
On Wed, 2014-09-10 at 15:52 -0700, Andrew Morton wrote:
> Have a kernel joke:
[]
> @@ -553,6 +553,7 @@ jeffies||jiffies
> +kubys|linus

Gimmu Smftre///


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-10 Thread Andrew Morton
On Mon, 8 Sep 2014 11:15:24 -0700 Kees Cook  wrote:

> Check for misspellings, based on Debian's lintian list. Several false
> positives were removed, and several additional words added that were
> common in the kernel:
> 
>   backword backwords
>   invalide valide
>   recieves
>   singed unsinged
> 
> While going back and fixing existing spelling mistakes isn't a high
> priority, it'd be nice to try to catch them before they hit the tree.

I have a feeling this is going to be a rat hole and that
scripts/spelling.txt will grow to consume the planet.  Oh well, whatev.

Have a kernel joke:

--- a/scripts/spelling.txt~checkpatch-look-for-common-misspellings-fix
+++ a/scripts/spelling.txt
@@ -553,6 +553,7 @@ jeffies||jiffies
 juse||just
 jus||just
 kown||known
+kubys|linus
 langage||language
 langauage||language
 langauge||language
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-10 Thread Joe Perches
On Wed, 2014-09-10 at 13:37 +0900, Masanari Iida wrote:
> Hello Joe, Kees,

Hello Masanari-san.

> Sorry for late reply.
> I was on holiday when the version 1 patch discussions were posted.

No worries, holidays are far more important
than patches like this...

These patches are simple niceties, not fixes
for bugs, so review and acceptance timing is
not urgent.

> I am using codespell ( https://github.com/lucasdemarchi/codespell/ ).
> The codespell has its own typo dictionary.
> The dictionary format is
> 
> typo->good   (1 candidate)
> typo->good1,good2,  (multiple candidates)
> typo->good, comment  (1 candidate with special remark)
> 
> Its similar to your  typo||good  format.
> 
> The license of the codespell is GPLv2 according to COPYING file in tar ball.
> 
> Compare number of typo samples in dictionary.
> Your dictionary :  1033
> codespell-1.4 : 4261
> codespell-1.4 + my adding 5245
> Your dictionary + codespell-1.4 + my adding - remove duplicate:  5742
> 
> Latest version of codespell is 1.7.
> My dictionary is based on codespell-1.4. So I use the number as of 1.4.
> 
> I can provide my typo samples under GPLv2 license.

Thanks.

Any additions you have to the dictionary would be
gladly welcomed.

Using a common format for the dictionary and any
suggested corrections would be good too.

Maybe the dictionary and code should be changed to
use the codespell format.  It seems a bit more
flexible than the lintian form.

I do not know if one project is more active than
the other, but perhaps that should be the deciding
factor.  Or maybe just Kees' preference...

Merging all these together might not be a good
solution though.

Right now, the checkpatch spelling code uses word
boundaries that include an underscore.

checkpatch spelling tests are done on 4 segments of
a #define like "PREFIX_PREFERED_SEG_ABC" finding the
misspelling of PREFERED.

Some sifting of the dictionary is still necessary to
eliminate some common prefixes to avoid too many false
positives.

For example, "ths" was dropped because it's a prefix
used by several modules even though it's a somewhat
frequent typo.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-09 Thread Masanari Iida
Hello Joe, Kees,

Sorry for late reply.
I was on holiday when the version 1 patch discussions were posted.

I am using codespell ( https://github.com/lucasdemarchi/codespell/ ).
The codespell has its own typo dictionary.
The dictionary format is

typo->good   (1 candidate)
typo->good1,good2,  (multiple candidates)
typo->good, comment  (1 candidate with special remark)

Its similar to your  typo||good  format.

The license of the codespell is GPLv2 according to COPYING file in tar ball.

Compare number of typo samples in dictionary.
Your dictionary :  1033
codespell-1.4 : 4261
codespell-1.4 + my adding 5245
Your dictionary + codespell-1.4 + my adding - remove duplicate:  5742

Latest version of codespell is 1.7.
My dictionary is based on codespell-1.4. So I use the number as of 1.4.

I can provide my typo samples under GPLv2 license.

Masanari
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] checkpatch: look for common misspellings

2014-09-08 Thread Joe Perches
On Mon, 2014-09-08 at 11:15 -0700, Kees Cook wrote:
> Check for misspellings, based on Debian's lintian list. Several false
> positives were removed, and several additional words added that were
> common in the kernel:
> 
>   backword backwords
>   invalide valide
>   recieves
>   singed unsinged
> 
> While going back and fixing existing spelling mistakes isn't a high
> priority, it'd be nice to try to catch them before they hit the tree.

Seems sensible enough.

Acked-by: Joe Perches 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] checkpatch: look for common misspellings

2014-09-08 Thread Kees Cook
Check for misspellings, based on Debian's lintian list. Several false
positives were removed, and several additional words added that were
common in the kernel:

backword backwords
invalide valide
recieves
singed unsinged

While going back and fixing existing spelling mistakes isn't a high
priority, it'd be nice to try to catch them before they hit the tree.

In the 13830 commits between 3.15 and 3.16, the script would have noticed
560 spelling mistakes. The top 25 are shown here:

$ git log --pretty=oneline v3.15..v3.16 | wc -l
13830
$ git log --format='%H' v3.15..v3.16 | \
   while read commit ; do \
 echo "commit $commit" ; \
 git log --format=email --stat -p -1 $commit | \
   ./scripts/checkpatch.pl --types=typo_spelling --no-summary - ; \
   done | tee spell_v3.15..v3.16.txt | grep "may be misspelled" | \
   awk '{print $2}' | tr A-Z a-z | sort | uniq -c | sort -rn
 21 'seperate'
 17 'endianess'
 15 'sucess'
 13 'noticable'
 11 'occured'
 11 'accomodate'
 10 'interrup'
  9 'prefered'
  8 'unecessary'
  8 'explicitely'
  7 'supress'
  7 'overriden'
  7 'immediatly'
  7 'funtion'
  7 'defult'
  7 'childs'
  6 'succesful'
  6 'splitted'
  6 'specifc'
  6 'reseting'
  6 'recieve'
  6 'changable'
  5 'tmis'
  5 'singed'
  5 'preceeding'

Thanks to Joe Perches for rewrites, suggestions, additional misspelling
entries, and testing.

Signed-off-by: Kees Cook 
---
v2:
- Joe Perches made several improvements, including:
  - relocated test to catch commit messages
  - handle alternative capitalizations
  - catch all mistakes in a line
  - additional misspelling fix entries
---
 scripts/checkpatch.pl |   44 ++-
 scripts/spelling.txt  | 1042 +
 2 files changed, 1085 insertions(+), 1 deletion(-)
 create mode 100644 scripts/spelling.txt

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index b385bcbbf2f5..d0ac3d30d93e 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -9,7 +9,8 @@ use strict;
 use POSIX;
 
 my $P = $0;
-$P =~ s@.*/@@g;
+$P =~ s@(.*)/@@g;
+my $D = $1;
 
 my $V = '0.32';
 
@@ -43,6 +44,7 @@ my $configuration_file = ".checkpatch.conf";
 my $max_line_length = 80;
 my $ignore_perl_version = 0;
 my $minimum_perl_version = 5.10.0;
+my $spelling_file = "$D/spelling.txt";
 
 sub help {
my ($exitcode) = @_;
@@ -429,6 +431,29 @@ our $allowed_asm_includes = qr{(?x:
 )};
 # memory.h: ARM has a custom one
 
+# Load common spelling mistakes and build regular expression list.
+my $misspellings;
+my @spelling_list;
+my %spelling_fix;
+open(my $spelling, '<', $spelling_file)
+or die "$P: Can't open $spelling_file for reading: $!\n";
+while (<$spelling>) {
+   my $line = $_;
+
+   $line =~ s/\s*\n?$//g;
+   $line =~ s/^\s*//g;
+
+   next if ($line =~ m/^\s*#/);
+   next if ($line =~ m/^\s*$/);
+
+   my ($suspect, $fix) = split(/\|\|/, $line);
+
+   push(@spelling_list, $suspect);
+   $spelling_fix{$suspect} = $fix;
+}
+close($spelling);
+$misspellings = join("|", @spelling_list);
+
 sub build_types {
my $mods = "(?x:  \n" . join("|\n  ", @modifierList) . "\n)";
my $all = "(?x:  \n" . join("|\n  ", @typeList) . "\n)";
@@ -2212,6 +2237,23 @@ sub process {
"8-bit UTF-8 used in possible commit log\n" . 
$herecurr);
}
 
+# Check for various typo / spelling mistakes
+   if ($in_commit_log || $line =~ /^\+/) {
+   while ($rawline =~ 
/(?:^|[^a-z@])($misspellings)(?:$|[^a-z@])/gi) {
+   my $typo = $1;
+   my $typo_fix = $spelling_fix{lc($typo)};
+   $typo_fix = ucfirst($typo_fix) if ($typo =~ 
/^[A-Z]/);
+   $typo_fix = uc($typo_fix) if ($typo =~ 
/^[A-Z]+$/);
+   my $msg_type = \&WARN;
+   $msg_type = \&CHK if ($file);
+   if (&{$msg_type}("TYPO_SPELLING",
+"'$typo' may be misspelled - 
perhaps '$typo_fix'?\n" . $herecurr) &&
+   $fix) {
+   $fixed[$fixlinenr] =~ 
s/(^|[^A-Za-z@])($typo)($|[^A-Za-z@])/$1$typo_fix$3/;
+   }
+   }
+   }
+
 # ignore non-hunk lines and lines being removed
next if (!$hunk_line || $line =~ /^-/);
 
diff --git a/scripts/spelling.txt b/scripts/spelling.txt
new file mode 100644
index ..fc7fd52b5e03
--- /dev/null
+++ b/scripts/spelling.txt
@@ -0,0 +1,1042 @@
+# Originally from Debian's Lintian tool. Various false positives have been
+# removed, and various additions have been made as they've been discovered
+# in the kernel source.
+#
+# License: GPLv2
+#
+# The