Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 08/04/2016 08:27 AM, David Malcolm wrote: As for test coverage, v2 and v3 of the kit add over a thousand lines of selftest code that heavily exercise string lexing, using the line_table_case machinery to run the tests with various interesting boundary conditions with line_table (e.g. near LINE_MAP_MAX_LOCATION_WITH_PACKED_RANGES). In terms of test coverage of the fallbacks, patch 2 of v3 of the kit directly exercises the substr_loc.get_range in gcc.dg/plugin/diagnostic_plugin_test_string_literals.c via gcc.dg/plugin/diagnostic-test-string-literals-1.c, and some of the tests there cover the failures, via: error_at (strloc, "unable to read substring range: %s", err); which we wouldn't do in a normal diagnostic (but which is appropriate for testing the machinery itself). Patch 3 of the v3 kit adds a format_warning_va function to c-format.c which is responsible for dealing with failures: https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00204.html THanks for pointing this out. I hadn't started looking at the meat of the on-demand locations until this morning. Looking at patch 3, there's a fair amount of end-to-end testing in gcc.dg/format/diagnostic-ranges.c but it looks like I forgot to add an end-to-end test there of failure due to stringification; I can add one. Is the rest of the v3 patch kit reviewable? Absolutely. I wasn't trying to imply that it wasn't -- in fact most of it is self-approvable stuff and I've only got a couple questions about the rest. Jeff
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Wed, 2016-08-03 at 09:59 -0600, Jeff Law wrote: > On 07/29/2016 03:42 PM, Joseph Myers wrote: > > On Tue, 26 Jul 2016, David Malcolm wrote: > > > > > This patch implements precise tracking of source locations for > > > the > > > individual chars within string literals, so that we can e.g. > > > underline > > > specific ranges in -Wformat diagnostics. It handles macros, > > > concatenated tokens, escaped characters etc. > > > > What if the string literal results from stringizing other tokens > > (which > > might have arisen in turn from macro expansion, including expansion > > of > > built-in macros not just those defined in source files, etc.)? > > "You don't > > get precise locations" would be a fine answer for such cases - > > provided > > there is good testsuite coverage of them to show they don't crash > > the > > compiler or underline nonsensical characters. > I think losing precise locations in some circumstances would be fine > as > well -- as long as we understand the limitations. In v3 of the patch, this fails gracefully. > And, yes, crashing or underlining nonsensical characters would be > bad, The API in input.c is get_source_range_for_substring, which returns an error message (intended for us, rather than end-users); it is wrapped by this method in c-common.c: /* Attempt to determine the source range of the substring. If successful, return NULL and write the source range to *OUT_RANGE. Otherwise return an error message. Error messages are intended for GCC developers (to help debugging) rather than for end-users. */ const char * substring_loc::get_range (source_range *out_range) const > so it'd be obviously good to test some of that to ensure the > fallbacks > work as expected. As for test coverage, v2 and v3 of the kit add over a thousand lines of selftest code that heavily exercise string lexing, using the line_table_case machinery to run the tests with various interesting boundary conditions with line_table (e.g. near LINE_MAP_MAX_LOCATION_WITH_PACKED_RANGES). In terms of test coverage of the fallbacks, patch 2 of v3 of the kit directly exercises the substr_loc.get_range in gcc.dg/plugin/diagnostic_plugin_test_string_literals.c via gcc.dg/plugin/diagnostic-test-string-literals-1.c, and some of the tests there cover the failures, via: error_at (strloc, "unable to read substring range: %s", err); which we wouldn't do in a normal diagnostic (but which is appropriate for testing the machinery itself). Patch 3 of the v3 kit adds a format_warning_va function to c-format.c which is responsible for dealing with failures: https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00204.html Sadly the comment got a bit mangled by git in that patch due to the proximity to the deleted function location_column_from_byte_offset; here's an inline copy (after patch 4, which adds param CORRECTED_SUBSTRING for doing fix-it hints for bad format strings): /* Emit a warning governed by option OPT, using GMSGID as the format string and AP as its arguments. Attempt to obtain precise location information within a string literal from FMT_LOC. Case 1: if substring location is available, and is within the range of the format string itself, the primary location of the diagnostic is the substring range obtained from FMT_LOC, with the caret at the *end* of the substring range. For example: test.c:90:10: warning: problem with '%i' here [-Wformat=] printf ("hello %i", msg); ~^ Case 2: if the substring location is available, but is not within the range of the format string, the primary location is that of the format string, and an note is emitted showing the substring location. For example: test.c:90:10: warning: problem with '%i' here [-Wformat=] printf("hello " INT_FMT " world", msg); ^ test.c:19: note: format string is defined here #define INT_FMT "%i" ~^ Case 3: if precise substring information is unavailable, the primary location is that of the whole string passed to FMT_LOC's constructor. For example: test.c:90:10: warning: problem with '%i' here [-Wformat=] printf(fmt, msg); ^~~ For each of cases 1-3, if param_range is non-NULL, then it is used as a secondary range within the warning. For example, here it is used with case 1: test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=] printf ("foo %s bar", long_i + long_j); ~^ ~~~ and here with case 2: test.c:90:16: warning: '%s' here but arg 2 has 'long' type [-Wformat=] printf ("foo " STR_FMT " bar", long_i + long_j); ^ ~~~ test.c:89:16: note: format string is defined here #define STR_FMT "%s" ~^ and with case 3: test.c:90:10: warning: '%i' here, but arg 2 is "const char *' [-Wformat=]
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 07/29/2016 03:42 PM, Joseph Myers wrote: On Tue, 26 Jul 2016, David Malcolm wrote: This patch implements precise tracking of source locations for the individual chars within string literals, so that we can e.g. underline specific ranges in -Wformat diagnostics. It handles macros, concatenated tokens, escaped characters etc. What if the string literal results from stringizing other tokens (which might have arisen in turn from macro expansion, including expansion of built-in macros not just those defined in source files, etc.)? "You don't get precise locations" would be a fine answer for such cases - provided there is good testsuite coverage of them to show they don't crash the compiler or underline nonsensical characters. I think losing precise locations in some circumstances would be fine as well -- as long as we understand the limitations. And, yes, crashing or underlining nonsensical characters would be bad, so it'd be obviously good to test some of that to ensure the fallbacks work as expected. jeff
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 07/29/2016 11:27 AM, David Malcolm wrote: On Fri, 2016-07-29 at 17:53 +0100, Manuel López-Ibáñez wrote: On 29 July 2016 at 16:25, David Malcolm wrote: FWIW, it appears that clang uses the on-demand approach; the relevant code appears to be StringLiteral::getLocationOfByte: http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008 As far as I know, llvm doesn't do language diagnostics from the middle-end/LTO. Thus, they do not have those problems. If you really want to have middle-end diagnostics from LTO, I can make the on-demand approach work. I can also do the stored-location approach, but it would mean rewriting all the patches again, I think, would be less efficient. I would prefer the on-demand approach. Who is empowered to make a decision here? ISTM we've got a bit of a deadlock here with the two intertwined patches. I'm wondering if we can move both forward, perhaps without the higher quality diagnostics for Martin's work initially. Then iterate on what's in-tree to add the higher quality diagnostics, then figure out how to deal with some of the issues we have in the LTO space. Martin's model of running early or late depending on flags is, IMHO, the right approach. And more generally its a good solution for other problems in this space. With that in mind, finding a way to get at the diagnostics framework from within the middle end and eventually LTO is, IMHO, important. Given that the diagnostics are the uncommon case, I would strongly prefer an on-demand approach rather than recording a ton of stuff in the front-end for the unlikely case that we're going to want a diagnostic in the middle-end or LTO. Jeff
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Thu, 28 Jul 2016, Martin Sebor wrote: > like it as well. So perhaps the problem to solve is how to teach > LTO to talk to the front end. One way to do it would be to build > the front ends as shared libraries. I think building front ends as shared libraries would run into different platforms (e.g. Windows) having very different conceptual models for shared libraries, especially when you get into shared libraries depending on symbols from the main executable (you might need to make all the language-independent parts of the compiler into a shared library as well). But a useful starting point could be to eliminate all cases where different front ends define external functions / variables with the same name (which would also enable statically linking multiple front ends together, to do such things without depending on shared libraries at all). -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 29 July 2016 at 18:27, David Malcolm wrote: > On Fri, 2016-07-29 at 17:53 +0100, Manuel López-Ibáñez wrote: >> On 29 July 2016 at 16:25, David Malcolm wrote: >> > >> > FWIW, it appears that clang uses the on-demand approach; the >> > relevant >> > code appears to be StringLiteral::getLocationOfByte: >> > http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008 >> >> As far as I know, llvm doesn't do language diagnostics from the >> middle-end/LTO. Thus, they do not have those problems. > > If you really want to have middle-end diagnostics from LTO, I can make > the on-demand approach work. Personally, I'm happy with having this work only on the FEs. I haven't had time to look at what Martin is doing, so he may prefer otherwise. In any case, making it work from LTO could be done as a follow-up, no? > I can also do the stored-location approach, but it would mean rewriting > all the patches again, I think, would be less efficient. Agreed, FWIW. > I would prefer the on-demand approach. > > Who is empowered to make a decision here? I thought you were the diagnostics maintainer ;-) Cheers, Manuel.
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Fri, 2016-07-29 at 21:42 +, Joseph Myers wrote: > On Tue, 26 Jul 2016, David Malcolm wrote: > > > This patch implements precise tracking of source locations for the > > individual chars within string literals, so that we can e.g. > > underline > > specific ranges in -Wformat diagnostics. It handles macros, > > concatenated tokens, escaped characters etc. > > What if the string literal results from stringizing other tokens > (which > might have arisen in turn from macro expansion, including expansion > of > built-in macros not just those defined in source files, etc.)? "You > don't > get precise locations" would be a fine answer for such cases - > provided > there is good testsuite coverage of them to show they don't crash the > compiler or underline nonsensical characters. Good question. I briefly tested it just now, and it happens to fail gracefully. I'll add proper test coverage for this. > > + return "range starts after > > LINE_MAP_MAX_LOCATION_WITH_COLS"; > > Where do these strings get used? Hopefully not in diagnostics for > users, > as they aren't written in user terms, and any diagnostic string like > that > would need to be marked up to be extracted for translation. Quoting from the comment for get_source_range_for_substring: Return NULL if successful, or an error message if any errors occurred. Error messages are intended for GCC developers (to help debugging) rather than for end-users. and various functions in the patch follow this pattern (maybe I need to add this to more comments?) I initially had these functions return bool, but found that a const char * was much more useful when debugging failures. (In the testsuite I do happen to use it in a diagnostic, but that's in a plugin, and is purely intended for verifying that various cases are hitting various error paths - analogous to looking for messages in a dumpfile). Thanks Dave
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Tue, 26 Jul 2016, David Malcolm wrote: > This patch implements precise tracking of source locations for the > individual chars within string literals, so that we can e.g. underline > specific ranges in -Wformat diagnostics. It handles macros, > concatenated tokens, escaped characters etc. What if the string literal results from stringizing other tokens (which might have arisen in turn from macro expansion, including expansion of built-in macros not just those defined in source files, etc.)? "You don't get precise locations" would be a fine answer for such cases - provided there is good testsuite coverage of them to show they don't crash the compiler or underline nonsensical characters. > + return "range starts after LINE_MAP_MAX_LOCATION_WITH_COLS"; Where do these strings get used? Hopefully not in diagnostics for users, as they aren't written in user terms, and any diagnostic string like that would need to be marked up to be extracted for translation. -- Joseph S. Myers jos...@codesourcery.com
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Fri, 2016-07-29 at 17:53 +0100, Manuel López-Ibáñez wrote: > On 29 July 2016 at 16:25, David Malcolm wrote: > > > > FWIW, it appears that clang uses the on-demand approach; the > > relevant > > code appears to be StringLiteral::getLocationOfByte: > > http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008 > > As far as I know, llvm doesn't do language diagnostics from the > middle-end/LTO. Thus, they do not have those problems. If you really want to have middle-end diagnostics from LTO, I can make the on-demand approach work. I can also do the stored-location approach, but it would mean rewriting all the patches again, I think, would be less efficient. I would prefer the on-demand approach. Who is empowered to make a decision here?
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 29 July 2016 at 16:25, David Malcolm wrote: > > FWIW, it appears that clang uses the on-demand approach; the relevant > code appears to be StringLiteral::getLocationOfByte: > http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008 As far as I know, llvm doesn't do language diagnostics from the middle-end/LTO. Thus, they do not have those problems. Cheers, Manuel.
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Fri, 2016-07-29 at 10:46 -0400, David Malcolm wrote: > On Fri, 2016-07-29 at 08:22 -0600, Martin Sebor wrote: > > > Currently all that we need from the C family of frontends is the > > > cpp_reader and the string concatenation records. I think we can > > > reconstruct the cpp_reader if we have the options, though > > > presumably > > > that's per TU, so to support all this we'd need to capture e.g. > > > the > > > per > > > -TU encoding information in the LTO records, for the case where > > > one > > > TU > > > is UTF-8 encoded source to UTF-8 execution, and another TU is > > > EBCDIC > > > -encoded source to UCS-4 execution (or whatever). And there's an > > > issue > > > if different TUs compiled the same header with different encoding > > > options. > > > > > > Or... we could not bother. This is a Quality of Implementation > > > thing, > > > for improving diagnostics, and in each case, the diagnostic is > > > required > > > to cope with substring location information not being available > > > (and > > > the code I posted in patch 2 of the kit makes it trivial to > > > handle > > > that > > > case from a diagnostic). So we could simply have LTO use the > > > fallback mode. > > > > > > There are two high-level approaches I've tried: > > > > > > (a) capture the substring location information in the > > > lexer/parser > > > in > > > the frontend as it runs, and store it somehow. > > > > > > (b) regenerate it "on-demand" when a diagnostic needs it. > > > > > > Approach (b) is inherently going to be prone to the LTO issues > > > you > > > describe, but it avoids adding to the CPU cycles/memory > > > consumption > > > for > > > the common case of not needing the information. [1] > > > > > > Is approach (b) acceptable? > > > > If (b) means potentially reduced quality of the location ranges > > in the -Wformat-length pass (e.g., with funky C++ format strings) > > then I don't think that's enough of a problem to worry about, at > > least not for this warning. > > > > If it means not being able to use the solution you're working > > on in the middle end at all (unless I misunderstood that doesn't > > seem to be what you're implying, but just to be sure) then that > > would seem like a serious shortcoming. I would continue to use > > the code I copied from c-format.c (assuming that will still work), > > but as more warnings are implemented in later passes it would > > lead to duplicating code or reinventing the wheel just to get > > around the limitation (or simply worse quality diagnostics). > > It'll work fine for the middle-end within cc1 and cc1plus. > > I'm specifically referring to LTO here, and it would be fixable from > LTO if we can encode information about the TU encoding options into > the > LTO data stream, and capture the string concatenation records there > too > (but that would be followup work). FWIW, it appears that clang uses the on-demand approach; the relevant code appears to be StringLiteral::getLocationOfByte: http://clang.llvm.org/doxygen/Expr_8cpp_source.html#l01008 > > > Martin > > > > > > > > Thanks > > > Dave > > > > > > [1] with the exception of the string concatenation records, but I > > > believe those are tiny > > > > >
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Fri, 2016-07-29 at 08:22 -0600, Martin Sebor wrote: > > Currently all that we need from the C family of frontends is the > > cpp_reader and the string concatenation records. I think we can > > reconstruct the cpp_reader if we have the options, though > > presumably > > that's per TU, so to support all this we'd need to capture e.g. the > > per > > -TU encoding information in the LTO records, for the case where one > > TU > > is UTF-8 encoded source to UTF-8 execution, and another TU is > > EBCDIC > > -encoded source to UCS-4 execution (or whatever). And there's an > > issue > > if different TUs compiled the same header with different encoding > > options. > > > > Or... we could not bother. This is a Quality of Implementation > > thing, > > for improving diagnostics, and in each case, the diagnostic is > > required > > to cope with substring location information not being available > > (and > > the code I posted in patch 2 of the kit makes it trivial to handle > > that > > case from a diagnostic). So we could simply have LTO use the > > fallback mode. > > > > There are two high-level approaches I've tried: > > > > (a) capture the substring location information in the lexer/parser > > in > > the frontend as it runs, and store it somehow. > > > > (b) regenerate it "on-demand" when a diagnostic needs it. > > > > Approach (b) is inherently going to be prone to the LTO issues you > > describe, but it avoids adding to the CPU cycles/memory consumption > > for > > the common case of not needing the information. [1] > > > > Is approach (b) acceptable? > > If (b) means potentially reduced quality of the location ranges > in the -Wformat-length pass (e.g., with funky C++ format strings) > then I don't think that's enough of a problem to worry about, at > least not for this warning. > > If it means not being able to use the solution you're working > on in the middle end at all (unless I misunderstood that doesn't > seem to be what you're implying, but just to be sure) then that > would seem like a serious shortcoming. I would continue to use > the code I copied from c-format.c (assuming that will still work), > but as more warnings are implemented in later passes it would > lead to duplicating code or reinventing the wheel just to get > around the limitation (or simply worse quality diagnostics). It'll work fine for the middle-end within cc1 and cc1plus. I'm specifically referring to LTO here, and it would be fixable from LTO if we can encode information about the TU encoding options into the LTO data stream, and capture the string concatenation records there too (but that would be followup work). > Martin > > > > > Thanks > > Dave > > > > [1] with the exception of the string concatenation records, but I > > believe those are tiny > > >
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
Currently all that we need from the C family of frontends is the cpp_reader and the string concatenation records. I think we can reconstruct the cpp_reader if we have the options, though presumably that's per TU, so to support all this we'd need to capture e.g. the per -TU encoding information in the LTO records, for the case where one TU is UTF-8 encoded source to UTF-8 execution, and another TU is EBCDIC -encoded source to UCS-4 execution (or whatever). And there's an issue if different TUs compiled the same header with different encoding options. Or... we could not bother. This is a Quality of Implementation thing, for improving diagnostics, and in each case, the diagnostic is required to cope with substring location information not being available (and the code I posted in patch 2 of the kit makes it trivial to handle that case from a diagnostic). So we could simply have LTO use the fallback mode. There are two high-level approaches I've tried: (a) capture the substring location information in the lexer/parser in the frontend as it runs, and store it somehow. (b) regenerate it "on-demand" when a diagnostic needs it. Approach (b) is inherently going to be prone to the LTO issues you describe, but it avoids adding to the CPU cycles/memory consumption for the common case of not needing the information. [1] Is approach (b) acceptable? If (b) means potentially reduced quality of the location ranges in the -Wformat-length pass (e.g., with funky C++ format strings) then I don't think that's enough of a problem to worry about, at least not for this warning. If it means not being able to use the solution you're working on in the middle end at all (unless I misunderstood that doesn't seem to be what you're implying, but just to be sure) then that would seem like a serious shortcoming. I would continue to use the code I copied from c-format.c (assuming that will still work), but as more warnings are implemented in later passes it would lead to duplicating code or reinventing the wheel just to get around the limitation (or simply worse quality diagnostics). Martin Thanks Dave [1] with the exception of the string concatenation records, but I believe those are tiny
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Thu, 2016-07-28 at 15:16 -0600, Martin Sebor wrote: > On 07/28/2016 02:38 PM, Martin Sebor wrote: > > On 07/28/2016 02:12 PM, David Malcolm wrote: > > > On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote: > > > > On 27 July 2016 at 15:30, David Malcolm > > > > wrote: > > > > > > Perhaps it could live for now in c-format.c, since it is > > > > > > the only > > > > > > place using it? > > > > > > > > > > Martin Sebor [CC-ed] wants to use it from the middle-end: > > > > >https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html > > > > > so it's unclear to me that c-format.c would be a better > > > > > location. > > > > > > > > Fine. He will have to figure out how to get a cpp_reader from > > > > the > > > > middle-end, though. > > > > > > It seems to me that on-demand reconstruction of source locations > > > for > > > STRING_CST nodes is inherently frontend-specific: unless we have > > > the > > > frontend record the information in some fe-independent way (which > > > I > > > assume we *don't* want to do, for space-efficiency), we need to > > > be able > > > to effectively re-run part of the frontend. > > > > > > So maybe this needs to be a langhook; the c-family can use the > > > global > > > cpp_reader * there, and everything else can return a "not > > > supported" > > > code if a diagnostic requests substring location information (and > > > the > > > diagnostic needs to be able to cope with that). > > > > The problem with the lanhook approach, as I learned from my first > > -Wformat-length attempt, is that it doesn't make the front end > > implementation available to LTO. So passes that run late enough > > with LTO (like the latest version of the -Wformat-length pass > > does) would not be bale to make use of it. > > I'm sorry, I didn't mean to sound like I was dismissing the idea. > I agree that string processing is language and front-end specific. > Having the middle end call back into the front-end also seems like > the right thing to do, not just to make this case work, but others > like it as well. So perhaps the problem to solve is how to teach > LTO to talk to the front end. One way to do it would be to build > the front ends as shared libraries. Turning frontends into shared libraries as a prerequisite would seem to be imposing a significant burden on the patch. Currently all that we need from the C family of frontends is the cpp_reader and the string concatenation records. I think we can reconstruct the cpp_reader if we have the options, though presumably that's per TU, so to support all this we'd need to capture e.g. the per -TU encoding information in the LTO records, for the case where one TU is UTF-8 encoded source to UTF-8 execution, and another TU is EBCDIC -encoded source to UCS-4 execution (or whatever). And there's an issue if different TUs compiled the same header with different encoding options. Or... we could not bother. This is a Quality of Implementation thing, for improving diagnostics, and in each case, the diagnostic is required to cope with substring location information not being available (and the code I posted in patch 2 of the kit makes it trivial to handle that case from a diagnostic). So we could simply have LTO use the fallback mode. There are two high-level approaches I've tried: (a) capture the substring location information in the lexer/parser in the frontend as it runs, and store it somehow. (b) regenerate it "on-demand" when a diagnostic needs it. Approach (b) is inherently going to be prone to the LTO issues you describe, but it avoids adding to the CPU cycles/memory consumption for the common case of not needing the information. [1] Is approach (b) acceptable? Thanks Dave [1] with the exception of the string concatenation records, but I believe those are tiny
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 07/28/2016 02:38 PM, Martin Sebor wrote: On 07/28/2016 02:12 PM, David Malcolm wrote: On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote: On 27 July 2016 at 15:30, David Malcolm wrote: Perhaps it could live for now in c-format.c, since it is the only place using it? Martin Sebor [CC-ed] wants to use it from the middle-end: https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html so it's unclear to me that c-format.c would be a better location. Fine. He will have to figure out how to get a cpp_reader from the middle-end, though. It seems to me that on-demand reconstruction of source locations for STRING_CST nodes is inherently frontend-specific: unless we have the frontend record the information in some fe-independent way (which I assume we *don't* want to do, for space-efficiency), we need to be able to effectively re-run part of the frontend. So maybe this needs to be a langhook; the c-family can use the global cpp_reader * there, and everything else can return a "not supported" code if a diagnostic requests substring location information (and the diagnostic needs to be able to cope with that). The problem with the lanhook approach, as I learned from my first -Wformat-length attempt, is that it doesn't make the front end implementation available to LTO. So passes that run late enough with LTO (like the latest version of the -Wformat-length pass does) would not be bale to make use of it. I'm sorry, I didn't mean to sound like I was dismissing the idea. I agree that string processing is language and front-end specific. Having the middle end call back into the front-end also seems like the right thing to do, not just to make this case work, but others like it as well. So perhaps the problem to solve is how to teach LTO to talk to the front end. One way to do it would be to build the front ends as shared libraries. Martin
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 07/28/2016 02:12 PM, David Malcolm wrote: On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote: On 27 July 2016 at 15:30, David Malcolm wrote: Perhaps it could live for now in c-format.c, since it is the only place using it? Martin Sebor [CC-ed] wants to use it from the middle-end: https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html so it's unclear to me that c-format.c would be a better location. Fine. He will have to figure out how to get a cpp_reader from the middle-end, though. It seems to me that on-demand reconstruction of source locations for STRING_CST nodes is inherently frontend-specific: unless we have the frontend record the information in some fe-independent way (which I assume we *don't* want to do, for space-efficiency), we need to be able to effectively re-run part of the frontend. So maybe this needs to be a langhook; the c-family can use the global cpp_reader * there, and everything else can return a "not supported" code if a diagnostic requests substring location information (and the diagnostic needs to be able to cope with that). The problem with the lanhook approach, as I learned from my first -Wformat-length attempt, is that it doesn't make the front end implementation available to LTO. So passes that run late enough with LTO (like the latest version of the -Wformat-length pass does) would not be bale to make use of it. Martin
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Wed, 2016-07-27 at 23:41 +0100, Manuel López-Ibáñez wrote: > On 27 July 2016 at 15:30, David Malcolm wrote: > > > Perhaps it could live for now in c-format.c, since it is the only > > > place using it? > > > > Martin Sebor [CC-ed] wants to use it from the middle-end: > > https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html > > so it's unclear to me that c-format.c would be a better location. > > Fine. He will have to figure out how to get a cpp_reader from the > middle-end, though. It seems to me that on-demand reconstruction of source locations for STRING_CST nodes is inherently frontend-specific: unless we have the frontend record the information in some fe-independent way (which I assume we *don't* want to do, for space-efficiency), we need to be able to effectively re-run part of the frontend. So maybe this needs to be a langhook; the c-family can use the global cpp_reader * there, and everything else can return a "not supported" code if a diagnostic requests substring location information (and the diagnostic needs to be able to cope with that). > > There are various places it could live; but getting it working took > > a > > lot of effort to achieve - the currently proposed mixture of > > libcpp, > > input.c and c-format.c for the locations of the various pieces > > works > > (for example, auto_vec isn't available in libcpp). > > I don't doubt it. I tried to do something similar in the past and I > failed, this is why I ended up with the poor approximation that was > in > place until now. This is a significant step forward. Thanks (for the current implementation, and for the kind words). > Is libcpp still C? When would be the time to move it to C++ already > and start using common utilities? libcpp is very much C++: I converted the linemap types to use inheritance as part of gcc 6 (and it helped a lot when implementing the range-tracking stuff). > Also, moving vec.h, sbitmap, etc to their own directory/library so > that they can be used by other parts of the compiler (hey! maybe even > by other parts of the toolchain?) is desirable. Richard has said in > the past that he supports such moves. Did I understand correctly > Richard? FWIW, I'd want the selftest framework there too; part of the reason things are in input.c rather than libcpp in the current patch is that selftests aren't yet available from libcpp (and reworking that seems orthogonal). > > Given that both Martin and I have candidate patches that are > > touching > > the same area, I'd prefer to focus on getting this code in to > > trunk, > > rather than rewrite it out-of-tree, so that we can at least have > > the > > improvement to location-handling for Wformat. Once the code is in > > the > > tree, it should be easier to figure out how to access it from the > > middle-end. > > Sure, I think this version is fine. I'm a big proponent of > step-by-step, even if the steps are only approximations to the > optimal > solution :) > It may be enough to motivate someone else more capable to improve > over > my poor approximations ;-) :) > > > [*] In an ideal world, we would have a language-agnostic > > > diagnostics > > > library > > > that would include line-map and that would be used by libcpp and > > > the > > > rest of > > > GCC, so that we can remove all the error-routines in libcpp and > > > the > > > awkward > > > glue code that ties it into diagnostics.c., > > > > Agreed, though that may have to wait until gcc 8 at this point. > > (Given that the proposed diagnostics library would use line maps, > > and > > would be used by libcpp, would it make sense to move the > > diagnostics > > into libcpp itself? Diagnostics would seem to be intimately > > related to > > location-tracking) > > I don't think so. There is nothing in diagnostic.* pretty-print.* > input.* line-map.* that requires libcpp (and only two mentions of > tree > that could be easily abstracted out). This was a deliberate design > goal of Gabriel and followed by most of us later working on > diagnostics. Of course, cpp may make use of the new library, but also > other parts of the toolchain (GAS?). The main obstacle I faced when > trying to do this move was the build machinery to make both libcpp > and > gcc build and statically link with this new library. > > Once that move is done, the main abstraction challenge to remove the > glue is that libcpp has its own flags for options and diagnostics > that > are independent from those of gcc (see c_cpp_error in c-common.c). It > would be great if libcpp used the common flags, but then one would > have to figure out a way to reorder things so that the diagnostic > library, libcpp and gcc can use (or avoid being dependent on) the > same > flags. Thanks. Dave
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 27 July 2016 at 15:30, David Malcolm wrote: >> Perhaps it could live for now in c-format.c, since it is the only >> place using it? > > Martin Sebor [CC-ed] wants to use it from the middle-end: > https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html > so it's unclear to me that c-format.c would be a better location. Fine. He will have to figure out how to get a cpp_reader from the middle-end, though. > There are various places it could live; but getting it working took a > lot of effort to achieve - the currently proposed mixture of libcpp, > input.c and c-format.c for the locations of the various pieces works > (for example, auto_vec isn't available in libcpp). I don't doubt it. I tried to do something similar in the past and I failed, this is why I ended up with the poor approximation that was in place until now. This is a significant step forward. Is libcpp still C? When would be the time to move it to C++ already and start using common utilities? Also, moving vec.h, sbitmap, etc to their own directory/library so that they can be used by other parts of the compiler (hey! maybe even by other parts of the toolchain?) is desirable. Richard has said in the past that he supports such moves. Did I understand correctly Richard? > Given that both Martin and I have candidate patches that are touching > the same area, I'd prefer to focus on getting this code in to trunk, > rather than rewrite it out-of-tree, so that we can at least have the > improvement to location-handling for Wformat. Once the code is in the > tree, it should be easier to figure out how to access it from the > middle-end. Sure, I think this version is fine. I'm a big proponent of step-by-step, even if the steps are only approximations to the optimal solution :) It may be enough to motivate someone else more capable to improve over my poor approximations ;-) >> [*] In an ideal world, we would have a language-agnostic diagnostics >> library >> that would include line-map and that would be used by libcpp and the >> rest of >> GCC, so that we can remove all the error-routines in libcpp and the >> awkward >> glue code that ties it into diagnostics.c., > > Agreed, though that may have to wait until gcc 8 at this point. > (Given that the proposed diagnostics library would use line maps, and > would be used by libcpp, would it make sense to move the diagnostics > into libcpp itself? Diagnostics would seem to be intimately related to > location-tracking) I don't think so. There is nothing in diagnostic.* pretty-print.* input.* line-map.* that requires libcpp (and only two mentions of tree that could be easily abstracted out). This was a deliberate design goal of Gabriel and followed by most of us later working on diagnostics. Of course, cpp may make use of the new library, but also other parts of the toolchain (GAS?). The main obstacle I faced when trying to do this move was the build machinery to make both libcpp and gcc build and statically link with this new library. Once that move is done, the main abstraction challenge to remove the glue is that libcpp has its own flags for options and diagnostics that are independent from those of gcc (see c_cpp_error in c-common.c). It would be great if libcpp used the common flags, but then one would have to figure out a way to reorder things so that the diagnostic library, libcpp and gcc can use (or avoid being dependent on) the same flags. Cheers, Manuel.
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On Tue, 2016-07-26 at 19:05 +0100, Manuel López-Ibáñez wrote: > On 26/07/16 18:11, David Malcolm wrote: > > > gcc/ChangeLog: > > * gcc.c (cpp_options): Rename string to... > > (cpp_options_): ...this, to avoid clashing with struct in > > cpplib.h. > > It seems to me that you need this because now gcc.c includes > cpplib.h via > input.h, which seems wrong. > > input.h was FE-independent (it depends on line-map.h but it is an > accident of > history that line-map.h is in libcpp since it doesn't depend on > anything from > libcpp [*]). Note that input.h is included in coretypes.h, so this > means that > now cpplib.h is included almost everywhere! [**] > > There is the following in coretypes.h: > > /* Provide forward struct declaration so that we don't have to > include > all of cpplib.h whenever a random prototype includes a pointer. > Note that the cpp_reader and cpp_token typedefs remain part of > cpplib.h. */ > > struct cpp_reader; > struct cpp_token; > > precisely to avoid including cpplib.h. > > > If I understand correctly, cpplib.h is needed in input.h because of > this > declaration: > > +extern const char *get_source_range_for_substring (cpp_reader > *pfile, > +string_concat_db > *concats, > +location_t > strloc, > +enum cpp_ttype > type, > +int start_idx, > int end_idx, > +source_range > *out_range); > > > Does this really need to be in input.h ? It seems something that > only C-family > languages will be able to use. Note that you need a reader to use > this > function, and for that, you need to already include cpplib.h. Fair point; the attached modification to patch 1 compiles cleanly, and moves it to a new header. > Perhaps it could live for now in c-format.c, since it is the only > place using it? Martin Sebor [CC-ed] wants to use it from the middle-end: https://gcc.gnu.org/ml/gcc-patches/2016-07/msg01088.html so it's unclear to me that c-format.c would be a better location. There are various places it could live; but getting it working took a lot of effort to achieve - the currently proposed mixture of libcpp, input.c and c-format.c for the locations of the various pieces works (for example, auto_vec isn't available in libcpp). Given that both Martin and I have candidate patches that are touching the same area, I'd prefer to focus on getting this code in to trunk, rather than rewrite it out-of-tree, so that we can at least have the improvement to location-handling for Wformat. Once the code is in the tree, it should be easier to figure out how to access it from the middle-end. > Cheers, > > Manuel. > > [*] In an ideal world, we would have a language-agnostic diagnostics > library > that would include line-map and that would be used by libcpp and the > rest of > GCC, so that we can remove all the error-routines in libcpp and the > awkward > glue code that ties it into diagnostics.c., Agreed, though that may have to wait until gcc 8 at this point. (Given that the proposed diagnostics library would use line maps, and would be used by libcpp, would it make sense to move the diagnostics into libcpp itself? Diagnostics would seem to be intimately related to location-tracking) > [**] And it seems that we are slowly undoing all the work that was > done by > Andrew MacLeod to clean up the .h web and remove dependencies > (https://gcc.gnu.org/wiki/rearch). > > From 09824cb27c0e817b29de1c7eb9b53c603116f13e Mon Sep 17 00:00:00 2001 From: David Malcolm Date: Wed, 27 Jul 2016 10:33:52 -0400 Subject: [PATCH] Avoid including cpplib.h from input.h gcc/c-family/ChangeLog: * c-common.c: Include "substring-locations.h". gcc/ChangeLog: * input.h: Don't include cpplib.h. (get_source_range_for_substring): Move to... * substring-locations.h: New header. --- gcc/c-family/c-common.c | 1 + gcc/input.h | 8 gcc/substring-locations.h | 30 ++ 3 files changed, 31 insertions(+), 8 deletions(-) create mode 100644 gcc/substring-locations.h diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c index f4ffc0e..c4843db 100644 --- a/gcc/c-family/c-common.c +++ b/gcc/c-family/c-common.c @@ -45,6 +45,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-iterator.h" #include "opts.h" #include "gimplify.h" +#include "substring-locations.h" cpp_reader *parse_in; /* Declared in c-pragma.h. */ diff --git a/gcc/input.h b/gcc/input.h index 24d9115..c17e440 100644 --- a/gcc/input.h +++ b/gcc/input.h @@ -22,7 +22,6 @@ along with GCC; see the file COPYING3. If not see #define GCC_INPUT_H #include "line-map.h" -#include extern GTY(()) struct line_maps *line_table; @@ -131,11 +130,4 @@ class GTY(()) string_concat_db hash_map
Re: [PATCH 1/3] (v2) On-demand locations within string-literals
On 26/07/16 18:11, David Malcolm wrote: gcc/ChangeLog: * gcc.c (cpp_options): Rename string to... (cpp_options_): ...this, to avoid clashing with struct in cpplib.h. It seems to me that you need this because now gcc.c includes cpplib.h via input.h, which seems wrong. input.h was FE-independent (it depends on line-map.h but it is an accident of history that line-map.h is in libcpp since it doesn't depend on anything from libcpp [*]). Note that input.h is included in coretypes.h, so this means that now cpplib.h is included almost everywhere! [**] There is the following in coretypes.h: /* Provide forward struct declaration so that we don't have to include all of cpplib.h whenever a random prototype includes a pointer. Note that the cpp_reader and cpp_token typedefs remain part of cpplib.h. */ struct cpp_reader; struct cpp_token; precisely to avoid including cpplib.h. If I understand correctly, cpplib.h is needed in input.h because of this declaration: +extern const char *get_source_range_for_substring (cpp_reader *pfile, + string_concat_db *concats, + location_t strloc, + enum cpp_ttype type, + int start_idx, int end_idx, + source_range *out_range); Does this really need to be in input.h ? It seems something that only C-family languages will be able to use. Note that you need a reader to use this function, and for that, you need to already include cpplib.h. Perhaps it could live for now in c-format.c, since it is the only place using it? Cheers, Manuel. [*] In an ideal world, we would have a language-agnostic diagnostics library that would include line-map and that would be used by libcpp and the rest of GCC, so that we can remove all the error-routines in libcpp and the awkward glue code that ties it into diagnostics.c. [**] And it seems that we are slowly undoing all the work that was done by Andrew MacLeod to clean up the .h web and remove dependencies (https://gcc.gnu.org/wiki/rearch).
[PATCH 1/3] (v2) On-demand locations within string-literals
This is an updated version of: "[PATCH] RFC: On-demand locations within string-literals" https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00441.html Changes in v2: - Tweaks to substring location selftests - Many more selftests (EBCDIC, the various wide string types, etc) - Clean up conditions in charset.c; require source == execution charset to have substring locations - Make string_concat_db field private - Return error messages rather than bool - Fix source_range for charset.c:convert_escape - Introduce class substring_loc - Handle bad input locations more gracefully - Ensure that we can read substring information for a token which starts in one linemap and ends in another (seen in gcc.dg/cpp/pr69985.c) This patch implements precise tracking of source locations for the individual chars within string literals, so that we can e.g. underline specific ranges in -Wformat diagnostics. It handles macros, concatenated tokens, escaped characters etc. The idea is to replace the limited implementation of this we currently have in c-format.c (see r223470 [1]). Doing so happens in patch 2 of the kit; this patch just provides the infrastructure to do so. As before the patch implements a new mode within libcpp's string literal lexer. It's disabled during the regular lexer, but it's available through a low-level interface in input.{c|h} which can rerun the libcpp code and capture the per-char source_ranges for when we need to issue a diagnostic. It also now adds a higher-level interface in c-common.h: class substring_loc. As before, to handle concatentation the patch adds some extra data storage: every time a string concatenation happens in c-lex.c: it stores the locations of the component tokens in a hash_map, keyed by the spelling location of the start first token (see class string_concat_db in input.h). Hence it's only storing extra data for string concatenations, not for simple string literals. As before, this doesn't support the C++ frontend yet, but it doesn't regress the status quo for c-format.c from C++. I have a patch for the C++ FE that records string concatenation information to the lexer, but given that it's not used yet, I didn't add that in this patch, as the data would be redundant. This version of the patch properly handles encodings (and adds a lot of test coverage for this to input.c). It makes the simplifying restriction that precise source location information is only available if source charset == execution charset, as discussed on this list, failing gracefully when this isn't the case. I believe I can self-approve the changes to input.c, input.h, libcpp, and the testsuite; the remaining changes needing approval are those to c-family, to gcc.c, and to selftest.h. Successfully bootstrapped®rtested in conjunction with the rest of the patch kit on x86_64-pc-linux-gnu. Successful selftest run for stage 1 on powerpc-ibm-aix7.1.3.0 (gcc111) in conjunction with the rest of the patch kit. config-list.mk test run is in progress. OK for trunk if it passes testing? (by itself) [1] https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=d5a2ddc76a109258297ff345957c35cb50116c94#patch2 gcc/c-family/ChangeLog: * c-common.c (get_cpp_ttype_from_string_type): New function. (g_string_concat_db): New global. (substring_loc::get_range): New method. * c-common.h (g_string_concat_db): New declaration. (class substring_loc): New class. * c-lex.c (lex_string): When concatenating strings, capture the locations of all tokens using a new obstack, and record the concatenation locations within g_string_concat_db. * c-opts.c (c_common_init_options): Construct g_string_concat_db on the ggc-heap. gcc/ChangeLog: * gcc.c (cpp_options): Rename string to... (cpp_options_): ...this, to avoid clashing with struct in cpplib.h. (static_specs): Update initialize for above renaming * input.c (string_concat::string_concat): New constructor. (string_concat_db::string_concat_db): New constructor. (string_concat_db::record_string_concatenation): New method. (string_concat_db::get_string_concatenation): New method. (string_concat_db::get_key_loc): New method. (class auto_cpp_string_vec): New class. (get_substring_ranges_for_loc): New function. (get_source_range_for_substring): New function. (get_num_source_ranges_for_substring): New function. (class selftest::lexer_test_options): New class. (struct selftest::lexer_test): New struct. (class selftest::ebcdic_execution_charset): New class. (selftest::ebcdic_execution_charset::s_singleton): New variable. (selftest::lexer_test::lexer_test): New constructor. (selftest::lexer_test::~lexer_test): New destructor. (selftest::lexer_test::get_token): New method. (selftest::assert_char_at_range): New function. (ASSERT_CHAR_AT_RANG