Re: [PATCH 1/12] implement -Wformat-diag to detect quoting and spelling issues in GCC diagnostics

2019-05-16 Thread Martin Liška
On 5/16/19 12:36 AM, Martin Sebor wrote:
> On 5/15/19 8:50 AM, Martin Sebor wrote:
>> On 5/15/19 5:40 AM, Martin Liška wrote:
>>> On 5/14/19 11:31 PM, Martin Sebor wrote:
 The attached patch implements the -Wformat-diag warning to help find
 quoting, spelling, and other formatting issues in diagnostics issued
 by GCC.
>>>
>>> Just a general comment about this part. Wouldn't it make sense to use regex
>>> for some of the string patterns that you address in the patch?
>>
>> I'm not sure.  Are you wondering about the variants of type
>> names like "signed int" and "unsigned int" and the contractions?
>>
>> The plain format checker scans and advances one word at a time
>> to be efficient.  It does linearly loop over the black-listed
>> words and operators and that could be made more efficient by
>> using a binary search.  Let me do that.  I would expect
>> introducing regexp to the checking to slow it down but
>> I suppose I could try it and see if you think it's worth it.
> 
> I tried both approaches for the operators and keywords:

Hi.

Thanks for considering. It was just an idea I did.

> 
> 1) using bsearch
> 2) using regex
> 
> I finished (1) but did only a proof of concept of (2).  Neither
> made a noticeable difference in my (admittedly very simple)
> benchmarks but both have drawbacks:
> 
> (1) means that the entries in the tables of known operators and
> keywords have to be sorted.  For operators, the sort order is
> different between ASCII and EBCDIC, and would mean duplicating
> the operator table and making sure it stays sorted in each
> encoding after a change.  That seems like a pain in the butt
> for little gain.
> 
> In my prototype of (2), the regular expression quickly started
> to get complicated and hard to read as I tried to handle some
> of the same special cases as I'm handling now.  It might be
> possible to redo the whole thing using regex but it would mean
> starting from scratch.  I'm not convinced it's worth the effort
> (in fact, I would worry about the regular expressions becoming
> too complex to easily extend).
> 
> So I'm inclined to leave things as they are.  I did make a few
> simplifications to the code in the process of this experiment
> and beefed up the test a little bit so I attach an updated
> revision.

Works for me. I'm going to test your patchset on ppc64le and I'll
prepare patch for the target.

Martin

> 
> Martin



Re: [PATCH 1/12] implement -Wformat-diag to detect quoting and spelling issues in GCC diagnostics

2019-05-15 Thread Martin Sebor

On 5/15/19 8:50 AM, Martin Sebor wrote:

On 5/15/19 5:40 AM, Martin Liška wrote:

On 5/14/19 11:31 PM, Martin Sebor wrote:

The attached patch implements the -Wformat-diag warning to help find
quoting, spelling, and other formatting issues in diagnostics issued
by GCC.


Just a general comment about this part. Wouldn't it make sense to use 
regex

for some of the string patterns that you address in the patch?


I'm not sure.  Are you wondering about the variants of type
names like "signed int" and "unsigned int" and the contractions?

The plain format checker scans and advances one word at a time
to be efficient.  It does linearly loop over the black-listed
words and operators and that could be made more efficient by
using a binary search.  Let me do that.  I would expect
introducing regexp to the checking to slow it down but
I suppose I could try it and see if you think it's worth it.


I tried both approaches for the operators and keywords:

1) using bsearch
2) using regex

I finished (1) but did only a proof of concept of (2).  Neither
made a noticeable difference in my (admittedly very simple)
benchmarks but both have drawbacks:

(1) means that the entries in the tables of known operators and
keywords have to be sorted.  For operators, the sort order is
different between ASCII and EBCDIC, and would mean duplicating
the operator table and making sure it stays sorted in each
encoding after a change.  That seems like a pain in the butt
for little gain.

In my prototype of (2), the regular expression quickly started
to get complicated and hard to read as I tried to handle some
of the same special cases as I'm handling now.  It might be
possible to redo the whole thing using regex but it would mean
starting from scratch.  I'm not convinced it's worth the effort
(in fact, I would worry about the regular expressions becoming
too complex to easily extend).

So I'm inclined to leave things as they are.  I did make a few
simplifications to the code in the process of this experiment
and beefed up the test a little bit so I attach an updated
revision.

Martin
gcc/c-family/ChangeLog:

	* c-common.h (GCC_DIAG_STYLE): Adjust.
	 (GCC_DIAG_RAW_STYLE): New macro.
	c-format.c (function_format_info::format_type): Adjust type.
	(function_format_info::is_raw): New member.
	(decode_format_type): Adjust signature.  Handle "raw" diag attributes.
	(decode_format_attr): Adjust call to decode_format_type.
	Avoid a redundant call to convert_format_name_to_system_name.
	Avoid abbreviating the word "arguments" in a diagnostic.
	(format_warning_substr): New function.
	(avoid_dollar_number): Quote dollar sign in a diagnostic.
	(finish_dollar_format_checking): Same.
	(check_format_info): Same.
	(struct baltoks_t, c_opers, cxx_opers): New.
	(c_keywords, cxx_keywords): New.
	(maybe_diag_unbalanced_tokens, check_tokens, check_plain): New
	functions.
	(check_format_info_main): Call check_plain.  Use baltoks_t.  Call
	maybe_diag_unbalanced_tokens.
	(handle_format_attribute): Spell out the word "arguments" in
	a diagnostic.
	* c.opt (-Wformat-diag): New option.

gcc/testsuite/ChangeLog:

	* gcc.dg/format/gcc_diag-11.c: New test.

diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 1cf2cae6395..9d067028416 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -39,7 +39,10 @@ framework extensions, you must include this file before diagnostic-core.h \
 never after.
 #endif
 #ifndef GCC_DIAG_STYLE
-#define GCC_DIAG_STYLE __gcc_cdiag__
+#  define GCC_DIAG_STYLE   __gcc_cdiag__
+#endif
+#ifndef GCC_DIAG_RAW_STYLE
+#  define GCC_DIAG_RAW_STYLE   __gcc_cdiag_raw__
 #endif
 #include "diagnostic-core.h"
 
diff --git a/gcc/c-family/c-format.c b/gcc/c-family/c-format.c
index a7f76c1c01d..696d3c8ccb6 100644
--- a/gcc/c-family/c-format.c
+++ b/gcc/c-family/c-format.c
@@ -52,7 +52,13 @@ enum format_type { printf_format_type, asm_fprintf_format_type,
 
 struct function_format_info
 {
-  int format_type;			/* type of format (printf, scanf, etc.) */
+  enum format_type format_type;		/* type of format (printf, scanf, etc.) */
+  /* IS_RAW is relevant only for GCC diagnostic format functions.
+ It is set for "raw" formatting functions like pp_printf that
+ are not intended to produce complete diagnostics according to
+ GCC guidelines, and clear for others like error and warning
+ whose format string is checked for proper quoting and spelling.  */
+  bool is_raw;
   unsigned HOST_WIDE_INT format_num;	/* number of format argument */
   unsigned HOST_WIDE_INT first_arg_num;	/* number of first arg (zero for varargs) */
 };
@@ -65,7 +71,7 @@ static GTY(()) tree locus;
 
 static bool decode_format_attr (const_tree, tree, tree, function_format_info *,
 bool);
-static int decode_format_type (const char *);
+static format_type decode_format_type (const char *, bool * = NULL);
 
 static bool check_format_string (const_tree argument,
  unsigned HOST_WIDE_INT format_num,
@@ -111,6 +117,32 @@ 

Re: [PATCH 1/12] implement -Wformat-diag to detect quoting and spelling issues in GCC diagnostics

2019-05-15 Thread Martin Sebor

On 5/15/19 5:40 AM, Martin Liška wrote:

On 5/14/19 11:31 PM, Martin Sebor wrote:

The attached patch implements the -Wformat-diag warning to help find
quoting, spelling, and other formatting issues in diagnostics issued
by GCC.


Just a general comment about this part. Wouldn't it make sense to use regex
for some of the string patterns that you address in the patch?


I'm not sure.  Are you wondering about the variants of type
names like "signed int" and "unsigned int" and the contractions?

The plain format checker scans and advances one word at a time
to be efficient.  It does linearly loop over the black-listed
words and operators and that could be made more efficient by
using a binary search.  Let me do that.  I would expect
introducing regexp to the checking to slow it down but
I suppose I could try it and see if you think it's worth it.

Martin


Re: [PATCH 1/12] implement -Wformat-diag to detect quoting and spelling issues in GCC diagnostics

2019-05-15 Thread Martin Liška
On 5/14/19 11:31 PM, Martin Sebor wrote:
> The attached patch implements the -Wformat-diag warning to help find
> quoting, spelling, and other formatting issues in diagnostics issued
> by GCC.

Just a general comment about this part. Wouldn't it make sense to use regex
for some of the string patterns that you address in the patch?

Martin


[PATCH 1/12] implement -Wformat-diag to detect quoting and spelling issues in GCC diagnostics

2019-05-14 Thread Martin Sebor

The attached patch implements the -Wformat-diag warning to help find
quoting, spelling, and other formatting issues in diagnostics issued
by GCC.

Martin
gcc/c-family/ChangeLog:

	* c-common.h (GCC_DIAG_STYLE): Adjust.
	 (GCC_DIAG_RAW_STYLE): New macro.
	c-format.c (function_format_info::format_type): Adjust type.
	(function_format_info::is_raw): New member.
	(decode_format_type): Adjust signature.  Handle "raw" diag attributes.
	(decode_format_attr): Adjust call to decode_format_type.
	Avoid a redundant call to convert_format_name_to_system_name.
	Avoid abbreviating the word "arguments" in a diagnostic.
	(format_warning_substr): New function.
	(avoid_dollar_number): Quote dollar sign in a diagnostic.
	(finish_dollar_format_checking): Same.
	(check_format_info): Same.
	(struct baltoks_t): New.
	(maybe_diag_unbalanced_tokens, check_tokens, check_plain): New
	functions.
	(check_format_info_main): Call check_plain.  Use baltoks_t.  Call
	maybe_diag_unbalanced_tokens.
	(handle_format_attribute): Spell out the word "arguments" in
	a diagnostic.
	* c.opt (-Wformat-diag): New option.

gcc/testsuite/ChangeLog:

	* gcc.dg/format/gcc_diag-11.c: New test.

diff --git a/gcc/c-family/c-common.h b/gcc/c-family/c-common.h
index 1cf2cae6395..9d067028416 100644
--- a/gcc/c-family/c-common.h
+++ b/gcc/c-family/c-common.h
@@ -39,7 +39,10 @@ framework extensions, you must include this file before diagnostic-core.h \
 never after.
 #endif
 #ifndef GCC_DIAG_STYLE
-#define GCC_DIAG_STYLE __gcc_cdiag__
+#  define GCC_DIAG_STYLE   __gcc_cdiag__
+#endif
+#ifndef GCC_DIAG_RAW_STYLE
+#  define GCC_DIAG_RAW_STYLE   __gcc_cdiag_raw__
 #endif
 #include "diagnostic-core.h"
 
diff --git a/gcc/c-family/c-format.c b/gcc/c-family/c-format.c
index a7f76c1c01d..372ab124661 100644
--- a/gcc/c-family/c-format.c
+++ b/gcc/c-family/c-format.c
@@ -52,7 +52,13 @@ enum format_type { printf_format_type, asm_fprintf_format_type,
 
 struct function_format_info
 {
-  int format_type;			/* type of format (printf, scanf, etc.) */
+  enum format_type format_type;		/* type of format (printf, scanf, etc.) */
+  /* IS_RAW is relevant only for GCC diagnostic format functions.
+ It is set for "raw" formatting functions like pp_printf that
+ are not intended to produce complete diagnostics according to
+ GCC guidelines, and clear for others like error and warning
+ whose format string is checked for proper quoting and spelling.  */
+  bool is_raw;
   unsigned HOST_WIDE_INT format_num;	/* number of format argument */
   unsigned HOST_WIDE_INT first_arg_num;	/* number of first arg (zero for varargs) */
 };
@@ -65,7 +71,7 @@ static GTY(()) tree locus;
 
 static bool decode_format_attr (const_tree, tree, tree, function_format_info *,
 bool);
-static int decode_format_type (const char *);
+static format_type decode_format_type (const char *, bool * = NULL);
 
 static bool check_format_string (const_tree argument,
  unsigned HOST_WIDE_INT format_num,
@@ -111,6 +117,32 @@ format_warning_at_char (location_t fmt_string_loc, tree format_string_cst,
   return warned;
 }
 
+
+/* Emit a warning as per format_warning_va, but construct the substring_loc
+   for the substring at offset (POS1, POS2 - 1) within a string constant
+   FORMAT_STRING_CST at FMT_STRING_LOC.  */
+
+ATTRIBUTE_GCC_DIAG (6,7)
+static bool
+format_warning_substr (location_t fmt_string_loc, tree format_string_cst,
+		   int pos1, int pos2, int opt, const char *gmsgid, ...)
+{
+  va_list ap;
+  va_start (ap, gmsgid);
+  tree string_type = TREE_TYPE (format_string_cst);
+
+  pos2 -= 1;
+
+  substring_loc fmt_loc (fmt_string_loc, string_type, pos1, pos1, pos2);
+  format_string_diagnostic_t diag (fmt_loc, NULL, UNKNOWN_LOCATION, NULL,
+   NULL);
+  bool warned = diag.emit_warning_va (opt, gmsgid, );
+  va_end (ap);
+
+  return warned;
+}
+
+
 /* Check that we have a pointer to a string suitable for use as a format.
The default is to check for a char type.
For objective-c dialects, this is extended to include references to string
@@ -320,10 +352,8 @@ decode_format_attr (const_tree fntype, tree atname, tree args,
 {
   const char *p = IDENTIFIER_POINTER (format_type_id);
 
-  p = convert_format_name_to_system_name (p);
+  info->format_type = decode_format_type (p, >is_raw);
 
-  info->format_type = decode_format_type (p);
-  
   if (!c_dialect_objc ()
 	   && info->format_type == gcc_objc_string_format_type)
 	{
@@ -359,7 +389,7 @@ decode_format_attr (const_tree fntype, tree atname, tree args,
   if (info->first_arg_num != 0 && info->first_arg_num <= info->format_num)
 {
   gcc_assert (!validated_p);
-  error ("format string argument follows the args to be formatted");
+  error ("format string argument follows the arguments to be formatted");
   return false;
 }
 
@@ -1067,27 +1097,55 @@ static void format_type_warning (const substring_loc _loc,
  char conversion_char);
 
 /* Decode a format type from a string,