Hi,
It's a known issue that strings to and from UDF functions are a pain
(https://dev.mysql.com/blog-archive/a-tale-of-udfs-with-character-sets/,
amongst others).
Most recently I ran into an issue trying to create a UDF to assist with
IPv6 manipulations (in it's current, not working as ex form at
https://github.com/jkroonza/uls-mysql/blob/master/src/uls_inet6.c).
The two UDF functions there is intended to find the "network" (first)
and "broadcast" (last) address of a range, with the intention to add
functions to perform other manipulations (like adding some offset to an
IPv6, or even add two IPv6's together, eg, to find the "next" network
for 2c0f:f720::/40 you can add 0:0:100:: to 2c0f:f720::).
The referenced works great when you get values from inet6 column to pass
to the function (it gets transformed into textual presentation before
being passed to the UDF) eg:
MariaDB [test]> select uls_inet6_network_address('::ffff:192.168.1.123',
120);
+--------------------------------------------------------+
| uls_inet6_network_address('::ffff:192.168.1.123', 120) |
+--------------------------------------------------------+
| ::ffff:192.168.1.0 |
+--------------------------------------------------------+
1 row in set (0.001 sec)
However, given this:
MariaDB [test]> create table t(t inet6 not null, primary key(t));
Query OK, 0 rows affected (0.024 sec)
MariaDB [test]> insert into t values('::ffff:192.168.1.15');
Query OK, 1 row affected (0.014 sec)
MariaDB [test]> select * from t where
uls_inet6_network_address('::ffff:192.168.1.123', 120) < t;
Empty set, 1 warning (0.009 sec)
MariaDB [test]> show warnings;
+---------+------+---------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------+
| Warning | 1292 | Incorrect inet6 value: '::ffff:192.168.1.0' |
+---------+------+---------------------------------------------+
1 row in set (0.000 sec)
I started to dig, to find this:
MariaDB [test]> select
charset(uls_inet6_network_address('::ffff:192.168.1.123', 120));
+-----------------------------------------------------------------+
| charset(uls_inet6_network_address('::ffff:192.168.1.123', 120)) |
+-----------------------------------------------------------------+
| binary |
+-----------------------------------------------------------------+
1 row in set (0.000 sec)
Then I played a bit and found that if I can rig it to the returned
string has 16 characters I don't get an error; eg:
MariaDB [test]> select * from t where
uls_inet6_network_address('::ff:f:ffff:ffff', 120) > t;
+---------------------+
| t |
+---------------------+
| ::ffff:192.168.1.15 |
+---------------------+
1 row in set (0.001 sec)
Based on this I'm inferring that:
1. Contextually the difference between BINARY() and CHAR() types is the
character set (ie, binary is implemented as a kind of characters set for
strings, or perhaps strings are binaries with a specific character set).
2. The inet6 storage column (which I now realize is simply BINARY(16),
with a transform applied to transform strings to and from the BINARY()
format, I think the technical term in the code is "fixed binary storage
storage").
3. When string (character) data gets sent *to* an inet6 column a
"character set conversion" is performed, and as is the case with UDFs,
it's already BINARY, so no conversion is actually performed, and since
the BINARY length is now *not* 16 bytes, we get the above warning, and
thus comparisons are as if with a NULL value, and always false.
I've been contemplating possible "solutions" to the problem without
breaking backwards compatibility, and it's tricky. The below aims at a
simple way of specifying the return "character set" of a string return
in more detail. And possibly even looking at the character sets for
parameters.
MySQL have started down a path (as per the tale above) whereby it's
using some other mechanism to give indications about character sets. I
don't like the particular mechanism, but it could work, looks like it
does only character set though, in which case for my example above, I
could just force both input and output to "latin1" so that MySQL/MariaDB
is forced to convert between that and INET6 again. I think we can do
better. I'm not convinced their change will permit backwards
compatibility - ie, any UDF that uses those functions will require them
and cannot opt to operate in a degraded mode. Not sure that's even a
desirable thing to have though ... so perhaps copying what they're doing
here is the way to go.
My ideas:
Option 1. Simply permit specifying a more precise string type during
function creation.
For example, instead of:
MariaDB [uls]> CREATE FUNCTION uls_inet6_last_address RETURNS STRING
SONAME 'uls_inet6.so';
Do this:
MariaDB [uls]> CREATE FUNCTION uls_inet6_last_address RETURNS INET6
SONAME 'uls_inet6.so';
And this is permitted because INET6 is a specialization of BINARY, which
is really what MySQL passes to/from UDF functions anyway.
The downside is this will only work for returns. This would, work for
my use case, since from INET6 => UDF the decoder is called anyway, so I
do get the text presentation on function entry, and now I can now I just
need to return the binary form of INET6. If sent back to the client it
gets converted by the server to
This eliminates at least *one* of the two pointless conversions to/from
binary format.
Option 2. A mechanism to pass the character set/binary type upon entry
and exit.
What I'm thinking here is a non-backwards compatible change,
potentially. Or use the *extension pointer somehow. I'm still having
some trouble navigating the codebase to figure out exactly how the void
*extension pointers in UDF_ARGS and UDF_INIT structs are used. But I'm
thinking they can be (ab)used to somehow pass an additional structure,
which may be how 3 is achieved perhaps?
To be more precise, the current UDF_ARGS looks like:
47 typedef struct UDF_ARGS {
48 unsigned int arg_count; /**< Number of arguments */
49 enum Item_result *arg_type; /**< Pointer to item_results */
50 char **args; /**< Pointer to argument */
51 unsigned long *lengths; /**< Length of string arguments */
52 char *maybe_null; /**< Set to 1 for all maybe_null
args */
53 char **attributes; /**< Pointer to attribute name */
54 unsigned long *attribute_lengths; /**< Length of attribute arguments */
55 void *extension;
56 } UDF_ARGS;
I can't find any source files that references both the UDF_ARGS type and
references this extension field. As such I'm hoping I can safely assume
that in current implementations this will always be NULL.
If so it becomes possible to turn this into a pointer-sized version
field, or struct size field, ie something like:
union {
void *extension;
size_t argsize; /* one alternative */
long api_version; /* another alternative */
};
for argsize this way it becomes possible to check that this field is >=
sizeof(UDF_ARGS) against which the UDF was compiled. This *feels*
flakey, but should be fine. If a user wants more fine-grained compile
backwards compat, then the user can check that all fields (further than
those listed) are positioned inside the structure.
Another option is simply to have extension be a pointer to some
UDF_ARGS_EXPANDED struct, which really is a pointer back to UDF_ARGS
which is really something like:
typedef struct UDF2_ARGS {
... everything from above, except void* extension becomes:
void *UDF2_ARGS extension; /* this points to self */
size_t argsize; /* or long api_version */
bool udf2_supported; /* set to false by the server, if the UDF
supports this, set to true */
const char ** string_arg_types; /* in _init this will be passed as
the raw type, eg, INET6 if the input is an INET6 field, can be set in
_init to have the server cast the type, ie, perform some conversion */
... additional fields;
} UDF2_ARGS;
So in my example, if udf2_supported comes back as false after _init,
then current behaviour is maintained, and INET6 will be converted to
string, and return strings are assumed to be BINARY (current
behaviour). If however, udf2_supported, then my code could set
string_args_types[0] = "inet6"; - which the engine knows is a BINARY(16)
type, and use the INET6 conversion code to convert to and from
BINARY(16) as needed.
Option 3. Some other more formal UDF API versioning/api
definition/scheme. So call this UDFv2, and make this extensible so that
features can be "negotiated", or that the UDF itself can even "fail to
load" in case where it requires support from the running server that's
not provided.
The idea here would be that a symbol like "mariadb_udf_init" is
exported, passing it some structure that defines the capabilities of the
server it's running on, the module can then decline to load, or proceed,
but limit itself based on functionality available kind of thing.
I'm guessing option 2 is also a way of achieving this. And may be
sufficient. I do however like the idea of having a {func}_load and
{func}_unload call (some of our PBKDF functions would also benefit from
this in that we can load the openssl algorithms once at startup instead
of every time on _init. There may be other complications though, but
still, this could be one suggestion for v2.
I'm happy to write a more formal specification based off of the informal
specification, and I'm inclined to attempt option 2 above. I just
honestly have no idea what's all REALLY involved, and where to find the
various bits and pieces of code. So I guess the initial steps would be:
1. Add a _load function which is invoked at *load* time, passing server
and version information. Not sure what else could be useful, intension
is to initialise the UDF for use as needed. UDFs that want to maintain
backwards compatible would need to track if this was indeed called or
not, and if not, act accordingly, or if it can't be backwards compatible
without this, error out in _init. Suggestions as to parameters to pass
would be great, I suggest we use a struct again with a size field as
it's first field to ensure that we can add extra fields at a later
stage, safely. Eg:
struct UDF_LOADINFO {
size_t load_size;
const char* server;
struct {
uint16_t major, minor, patch; /* 10, 6, 17 - for example what
I've got runing currently */
const char* extra; /* MariaDB-log, the stuff after the dash from
SELECT version(); */
} version;
} UDF_LOADINFO;
And so we will call "bool (*func_load)(const UDF_LOADINFO*)" IFF it exist.
2. Add an _unload function "void (*func_unload)()", purpose of which is
to clean up after _load.
Note: It's possible to use __attribute__((constructor)) and destructor
functions, but my experience is that they're messy, and if you have
multiple functions in a UDF you can't know which ones you're being
loaded for.
3. Expand for UDF2_ARGS as explained in option 2 above, as well as
UDF2_INIT.
4. Ability to flag functions as *deterministic* (same return for same
arguments) rather than merely const or not const. This way SQL can
optimize calls in the same manner as for stored functions. I'm not
going to dig into that, but happy to add the flag.
5. Document and flag this as UDFv2 as and when it hits formal release.
Kind regards,
Jaco
_______________________________________________
developers mailing list -- [email protected]
To unsubscribe send an email to [email protected]