Re: [PATCH] Fix overeager spelling corrections (PR c/82967)

2018-09-14 Thread Jeff Law
On 8/24/18 9:24 AM, David Malcolm wrote:
> This patch tunes class best_match's cutoff for rejecting meaningless
> spelling suggestions.
> 
> Previously, we allowed an edit distance of up to half of the length of the
> longer of the goal string and closest candidate strings, rounded down.
> 
> With this patch, we now allow only up to a third - with some tuning of
> rounding (and for very short strings), to ensure that:
> (a) everything that worked before still works (with the removal of a
> couple of cases that shouldn't), and that
> (b) the new threshold is always at least as conservative as the old
> threshold and thus shouldn't offer new nonsensical suggestions (with
> the possible exception of cases where transposition has helped; see
> r261521 aka Damerau-Levenshtein; PR other/69968).
> 
> In particular, all of the bogus suggestions from PR c/82967 are now
> no longer offered.
> 
> Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu;
> adds 4 PASS results to gcc.sum.
> 
> OK for trunk?
> 
> OK for backporting to gcc 8? (with the caveat that gcc 8's
> edit distance doesn't do transpositions)
> 
> gcc/ChangeLog:
>   PR c/82967
>   * spellcheck.c (get_edit_distance_cutoff): New function.
>   (selftest::test_edit_distance_unit_test_oneway): Rename to...
>   (selftest::test_get_edit_distance_one_way): ...this.
>   (selftest::test_get_edit_distance_unit): Rename to...
>   (selftest::test_get_edit_distance_both_ways): ...this.
>   (selftest::test_edit_distances): Move tests to this new function,
>   and test some more pairs of strings.  Update for above renaming.
>   (selftest::get_old_cutoff): New function.
>   (selftest::test_get_edit_distance_cutoff): New function.
>   (selftest::assert_suggested_for): New function.
>   (ASSERT_SUGGESTED_FOR): New macro.
>   (selftest::assert_not_suggested_for): New function.
>   (ASSERT_NOT_SUGGESTED_FOR): New macro.
>   (selftest::test_suggestions): New function.
>   (selftest::spellcheck_c_tests): Move test_get_edit_distance_unit
>   tests to selftest::test_edit_distances and call it.  Add calls to
>   selftest::test_get_edit_distance_cutoff and
>   selftest::test_suggestions.
>   * spellcheck.h (get_edit_distance_cutoff): New function declaration.
>   (best_match::consider): Replace hard-coded cutoff calculation with
>   a call to...
>   (best_match::get_cutoff): New declaration.
>   (best_match::get_best_meaningful_candidate): Likewise.
> 
> gcc/testsuite/ChangeLog:
>   PR c/82967
>   * c-c++-common/attributes-1.c: Remove bogus suggestion from
>   dg-prune-output.
>   * gcc.dg/diagnostic-token-ranges.c (undeclared_identifier): Remove
>   bogus suggestion.
>   * gcc.dg/spellcheck-identifiers-4.c: New test.
ISTM that while not technically listed as a maintainer for this stuff,
you're by far the most well versed in the implementation.
Rubber-stamped with an OK.

jeff


[PATCH] Fix overeager spelling corrections (PR c/82967)

2018-08-24 Thread David Malcolm
This patch tunes class best_match's cutoff for rejecting meaningless
spelling suggestions.

Previously, we allowed an edit distance of up to half of the length of the
longer of the goal string and closest candidate strings, rounded down.

With this patch, we now allow only up to a third - with some tuning of
rounding (and for very short strings), to ensure that:
(a) everything that worked before still works (with the removal of a
couple of cases that shouldn't), and that
(b) the new threshold is always at least as conservative as the old
threshold and thus shouldn't offer new nonsensical suggestions (with
the possible exception of cases where transposition has helped; see
r261521 aka Damerau-Levenshtein; PR other/69968).

In particular, all of the bogus suggestions from PR c/82967 are now
no longer offered.

Successfully bootstrapped & regrtested on x86_64-pc-linux-gnu;
adds 4 PASS results to gcc.sum.

OK for trunk?

OK for backporting to gcc 8? (with the caveat that gcc 8's
edit distance doesn't do transpositions)

gcc/ChangeLog:
PR c/82967
* spellcheck.c (get_edit_distance_cutoff): New function.
(selftest::test_edit_distance_unit_test_oneway): Rename to...
(selftest::test_get_edit_distance_one_way): ...this.
(selftest::test_get_edit_distance_unit): Rename to...
(selftest::test_get_edit_distance_both_ways): ...this.
(selftest::test_edit_distances): Move tests to this new function,
and test some more pairs of strings.  Update for above renaming.
(selftest::get_old_cutoff): New function.
(selftest::test_get_edit_distance_cutoff): New function.
(selftest::assert_suggested_for): New function.
(ASSERT_SUGGESTED_FOR): New macro.
(selftest::assert_not_suggested_for): New function.
(ASSERT_NOT_SUGGESTED_FOR): New macro.
(selftest::test_suggestions): New function.
(selftest::spellcheck_c_tests): Move test_get_edit_distance_unit
tests to selftest::test_edit_distances and call it.  Add calls to
selftest::test_get_edit_distance_cutoff and
selftest::test_suggestions.
* spellcheck.h (get_edit_distance_cutoff): New function declaration.
(best_match::consider): Replace hard-coded cutoff calculation with
a call to...
(best_match::get_cutoff): New declaration.
(best_match::get_best_meaningful_candidate): Likewise.

gcc/testsuite/ChangeLog:
PR c/82967
* c-c++-common/attributes-1.c: Remove bogus suggestion from
dg-prune-output.
* gcc.dg/diagnostic-token-ranges.c (undeclared_identifier): Remove
bogus suggestion.
* gcc.dg/spellcheck-identifiers-4.c: New test.
---
 gcc/spellcheck.c| 231 
 gcc/spellcheck.h|  19 +-
 gcc/testsuite/c-c++-common/attributes-1.c   |   2 +-
 gcc/testsuite/gcc.dg/diagnostic-token-ranges.c  |   3 +-
 gcc/testsuite/gcc.dg/spellcheck-identifiers-4.c |  10 +
 5 files changed, 224 insertions(+), 41 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/spellcheck-identifiers-4.c

diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index e2a8b92..690e6fa 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -162,6 +162,36 @@ find_closest_string (const char *target,
   return bm.get_best_meaningful_candidate ();
 }
 
+/* Generate the maximum edit distance for which we consider a suggestion
+   to be meaningful, given a goal of length GOAL_LEN and a candidate of
+   length CANDIDATE_LEN.
+
+   This is a third of the the length of the candidate or of the goal,
+   whichever is bigger.  */
+
+edit_distance_t
+get_edit_distance_cutoff (size_t goal_len, size_t candidate_len)
+{
+  size_t max_length = MAX (goal_len, candidate_len);
+  size_t min_length = MIN (goal_len, candidate_len);
+
+  gcc_assert (max_length >= min_length);
+
+  /* Special case: don't offer suggestions for a pair of
+ length == 1 strings (or empty strings).  */
+  if (max_length <= 1)
+return 0;
+
+  /* If the lengths are close, then round down.  */
+  if (max_length - min_length <= 1)
+/* ...but allow an edit distance of at least 1.  */
+return MAX (max_length / 3, 1);
+
+  /* Otherwise, round up (thus giving a little extra leeway to some cases
+ involving insertions/deletions).  */
+  return (max_length + 2) / 3;
+}
+
 #if CHECKING_P
 
 namespace selftest {
@@ -171,8 +201,8 @@ namespace selftest {
 /* Verify that get_edit_distance (A, B) equals the expected value.  */
 
 static void
-test_edit_distance_unit_test_oneway (const char *a, const char *b,
-   edit_distance_t expected)
+test_get_edit_distance_one_way (const char *a, const char *b,
+   edit_distance_t expected)
 {
   edit_distance_t actual = get_edit_distance (a, b);
   ASSERT_EQ (actual, expected);
@@ -185,11 +215,169 @@ test_edit_distance_unit_test_oneway (const char *a, 
const