Hi Mark,
On Wed, Feb 19, 2020 at 10:40 AM Mark Dilger
<[email protected]> wrote:
> This would only make sense to me if the string held in $_ had already been
> checked for safety, but Catalog.pm does very little to verify that the string
> is safe to eval. The assumption here, so far as I can infer, is that we
> don’t embed anything dangerous in our .dat files, so this should be ok. That
> may be true for the moment, but I can imagine a day when we start embedding
> perl functions as quoted text inside a data file, or shell commands as text
> which look enough like perl for eval() to be able to execute them.
> Developers who edit these .dat files and mess up the quoting, and then rerun
> ‘make’ to get the new .c and .h files generated, may not like the side
> effects. Perhaps I’m being overly paranoid….
The use case for that seems slim. However, at a brief glance your
module seems more robust in other ways.
> Rather than add more code generation logic based on the design of Catalog.pm,
> I wrote a perl based data file parser that parses .dat files and returns
> vivified perl data, as Catalog.pm does, but with much stricter parsing logic
> to make certain nothing dangerous gets eval()ed. I put the new module in
> DataFile.pm.
> [...]
> The new parser is more flexible about the structure of the data, which seems
> good to me for making it easier to add or modify data files in the future.
> The new parser does not yet have a means of hacking up the data to add
> autogenerated fields and such that Catalog.pm does, but I think a more clean
> break between parsing and autovivifying fields would be good anyway.
Separation of concerns sounds like a good idea, but I've not fully
thought it through. For one advantage, I think it might be nicer to
have indexing.dat and toasting.dat instead of having to dig the info
out of C macros. This was evident while recently experimenting with
generating catcache control data.
As for the patch, I have not done a full review, but I have some
comments based on a light read-through:
utils/Makefile:
+# location of commandtag.dat
+headerdir = $(top_srcdir)/src/include/utils
This variable name is too generic for what the comment says it is. A
better abstraction, if we want one, would be the full path to the
commandtag input file. The other script invocations in this Makefile
don't do it this way, but that's a separate patch.
+# location to write generated headers
+sourcedir = $(top_srcdir)/src/backend/utils
Calling the output the source is bound to confuse people. The comment
implies all generated headers, not just the ones introduced by the
patch. I would just output to the current directory (i.e. have an
--output option with a default empty string). Also, if we want to
output somewhere else, I would imagine it'd be under the top builddir,
not srcdir.
+$(PERL) -I $(top_srcdir)/src/include/utils $<
--headerdir=$(headerdir) --sourcedir=$(sourcedir)
--inputfile=$(headerdir)/commandtag.dat
1. headerdir is entirely unused by the script
2. We can default to working dir for the output as mentioned above
3. -I $(top_srcdir)/src/include/utils is supposed to point to the dir
containing DataFile.pm, but since gencommandtag.pl has "use lib..."
it's probably not needed here. I don't recall why we keep the "-I"
elsewhere. (ditto in Solution.pm)
I'm thinking it would look something like this:
+$(PERL) $< --inputfile=$(top_srcdir)/src/include/utils/commandtag.dat
--
utils/misc/Makefile
+all: distprep
+
# Note: guc-file.c is not deleted by 'make clean',
# since we want to ship it in distribution tarballs.
clean:
@rm -f lex.yy.c
+
+maintainer-clean: clean
Seems non-functional.
--
DataFiles.pm
I haven't studied this in detail, but I'll suggest that if this meant
to have wider application, maybe it should live in src/tools ?
I'm not familiar with using different IO routines depending on the OS
-- what's the benefit of that?
--
gencommandtag.pl
slurp_without_comments() is unused.
sanity_check_data() seems longer than the main body of the script
(minus header boilerplate), and I wonder if we can pare it down some.
For one, I have difficulty imagining anyone would accidentally type an
unprintable or non-ascii character in a command tag and somehow not
realize it. For another, duplicating checks that were done earlier
seems like a maintenance headache.
dataerror() is defined near the top, but other functions are defined
at the bottom.
+# Generate all output internally before outputting anything, to avoid
+# partially overwriting generated files under error conditions
My personal preference is, having this as a design requirement
sacrifices readability for unclear gain, especially since a "chunk"
also includes things like header boilerplate. That said, the script is
also short enough that it doesn't make a huge difference either way.
Speaking of boilerplate, it's better for readability to separate that
from actual code such as:
typedef enum CommandTag
{
#define FIRST_COMMANDTAG COMMANDTAG_$sorted[0]->{taglabel})
--
tcop/dest.c
+ * We no longer display LastOid, but to preserve the wire protocol,
+ * we write InvalidOid where the LastOid used to be written. For
+ * efficiency in the snprintf(), hard-code InvalidOid as zero.
Hmm, is hard-coding zero going to make any difference here?
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services