Re: [HACKERS] custom function for converting human readable sizes to bytes

Vitaly Burovoy Thu, 21 Jan 2016 22:03:46 -0800

On 1/20/16, Pavel Stehule <[email protected]> wrote:
> ...
> New version is attached
>
> Regards
> Pavel


Hello!

1. Why the function is marked as VOLATILE? It depends only on input
value, it does change nothing in the DB.
I guess it should be IMMUTABLE for efficient caching its result.

2.
>       text            *arg = PG_GETARG_TEXT_PP(0);
...
>       str = text_to_cstring(arg);
>       
Hmm... My question was _why_ you get TEXT and convert it instead of
getting an argument of type cstring (not about usage of
VARSIZE_ANY_EXHDR).
It was enough to give me a link[1] and leave the usage of
VARSIZE_ANY_EXHDR as is.



I think all the other claims below are mostly cosmetic. Main behavior
is OK (considering its usage will not be in heavy queries).

===
Documentation:
Besides currently added row in a table I think it is worth to add
detailed comment after a block "<function>pg_size_pretty</> can be
used" similar to (but the best way is to get advice for a correct
phrase):
"pg_size_bytes can be used to get a size in bytes as bigint from a
human-readable format for a comparison purposes (it is the opposite to
pg_size_pretty function). The format is numeric with an optional size
unit: kB, MB, GB or TB."
(and delete unit sizes from the table row).

===
Tests:
Since there was a letter with an explanation why units bigger than
"TB" can't be used[2], it is a good reason to add that size units
("PB", "EB", "ZB") in tests with a note to update the documentation
part when that unit are supported.

===
Code style:
1. When you declare pointers, you must align _names_ by the left
border, i.e. asterisks must be at one position to the left from the
aligned names, i.e. one to three spaces instead of TAB before them.

2.
> int           multiplier;
One more TAB to align it with the name at the next line.

===
Code:

> /* allow whitespace between integer and unit */
May be "numeric" instead of "integer"?

> "\"%s\" is not in expected format"
It is not clear what format is expected.

> /* ignore plus symbol */
It seems it is not ignored, but copied... ;-)

> to ger hintstr
s/ger/get/
> Elsewhere is used space trimmed buffer.
I'd write it as "Otherwise trimmed value is used."
I'm not good at English, but that full block looks little odd for me.
I tried to reword it in the attachment single_buffer.c, but I don't
think it is enough.

> We allow ending spaces.
"trailing"?

> but this message can be little bit no intuitive - better text is "is not a 
> valid number"
It was a block with a comment "don't allow empty string", i.e. absence
of a number...

> Automatic memory deallocation doesn't cover all possible situations where
> the function can be used - for example DirectFunctionCall - so explicit
> deallocation can descrease a memory requirements when you call these
> functions from C.
DirectFunctionCall uses a memory context of the caller. For example,
you don't free "num" allocated by numeric_in and numeric_mul, also
mul_num allocated by int8_numeric...
I'd ask experienced hackers for importance of freeing (or leaving)
"str" and "buffer".

> postgres=# select pg_size_bytes('+912+ kB');
> ERROR:  invalid unit: "+ kB"
> This is difficult - you have to divide string to two parts and first world is 
> number, second world is unit.
Your code looks like a duplicated (not by lines, but by behavior).
numeric_in has similar checks, let it do them for you. The goal of
your function is just split mantissa and size unit, and if the last
one is found, turn it into an exponent.
See my example in the attachment check_mantissa_by_numeric_in.c. There
is just searching non-space char that can't be part of numeric. Then
all before it passes to numeric_in and let it check anything it
should. I even left first spaces to show full numeric part to a user
if an error occurs (for any case).
The second attachment single_buffer.c is an attempt to avoid
allocating the second buffer (the first is "str" allocated by
text_to_cstring) and copying into it. I don't think it gives a big
improvement, but nevertheless.

===
[1] http://www.postgresql.org/message-id/[email protected]
[2] 
http://www.postgresql.org/message-id/CAB7nPqS6Wob4WnZb=dhb3o0pc-nx1v3xjszkn3z9kbexgcq...@mail.gmail.com

-- 
Best regards,
Vitaly Burovoy

/*
 * Convert human readable size to long int.
 *
 * Due to support for decimal values and case insensitivity of units
 * a function parse_int cannot be used.
 */
Datum
pg_size_bytes(PG_FUNCTION_ARGS)
{
	text	   *arg = PG_GETARG_TEXT_PP(0);
	char	   *str,
			   *strptr;
	char	   *buffer,
			   *bufptr;
	Numeric		num;
	int64		result;
	size_t		len;

	str = text_to_cstring(arg);
	strptr = str;

	/* Skip leading spaces */
	while (isspace((unsigned char) *strptr))
		strptr++;

	bufptr = strptr;
	for(;*strptr != '\0'; strptr++)
	{
		if (isdigit((unsigned char) *strptr) ||
			isspace((unsigned char) *strptr) ||
			*strptr == '.' || *strptr == '+' || *strptr == '-')
			continue;

		break;
	}

	/* don't allow empty string */
	if (bufptr == strptr)
		ereport(ERROR,
				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
				 errmsg("\"%s\" is not a number", str)));

	/* prepare buffer for parts of parsed input string */
	buffer = (char *) palloc(VARSIZE_ANY_EXHDR(arg) + 1);

	len = strptr - str;
	memcpy(buffer, str, len);
	buffer[len] = '\0';

	num = DatumGetNumeric(DirectFunctionCall3(numeric_in,
							CStringGetDatum(buffer), 0, -1));

	/* Handle possible unit */
	if (*strptr != '\0')
	{
		int			multiplier;
		Numeric		mul_num;
		const char *hintmsg;
		const char *unitstr;

		bufptr = buffer;

		/*
		 * unitstr holds complete string related to unit part. Used
		 * as part of error message. The parse_memory_unit is called
		 * with this value, when we detected invalid format, and we
		 * would to emit error result to ger hintstr. Elsewhere is
		 * used space trimmed buffer.
		 */
		unitstr = strptr;

		/* copy chars to buffer and stop on first space or end string */
		while (*strptr != '\0' && !isspace((unsigned char) *strptr))
			*bufptr++ = *strptr++;

		*bufptr = '\0';

		/* We allow ending spaces. Skip all spaces. */
		while (isspace((unsigned char) *strptr))
			strptr++;

		/* Use buffer as unitstr, when we didn't find more words. */
		if (*strptr == '\0')
			unitstr = buffer;

		/* Still, the unit can be invalid: too long, unknown */
		if (!parse_memory_unit(unitstr, &multiplier, &hintmsg))
			ereport(ERROR,
					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
					 errmsg("invalid unit: \"%s\"", unitstr),
					 errhint("%s", _(hintmsg))));

		/*
		 * Now, the multiplier is in KB units. It should be multiplied by 1024
		 * before usage.
		 */
		mul_num = DatumGetNumeric(DirectFunctionCall1(int8_numeric,
						    Int64GetDatum(multiplier * 1024L)));

		num = DatumGetNumeric(DirectFunctionCall2(numeric_mul,
							    NumericGetDatum(mul_num),
							    NumericGetDatum(num)));
	}

	result = DatumGetInt64(DirectFunctionCall1(numeric_int8, NumericGetDatum(num)));

	PG_RETURN_INT64(result);
}

/*
 * Convert human readable size to long int.
 *
 * Due to support for decimal values and case insensitivity of units
 * a function parse_int cannot be used.
 */
Datum
pg_size_bytes(PG_FUNCTION_ARGS)
{
	text	   *arg = PG_GETARG_TEXT_PP(0);
	char	   *str,
			   *strptr;
	Numeric		num;
	int64		result;
	char		savedchar;
	char	   *tmpptr;

	str = text_to_cstring(arg);
	strptr = str;

	/* Skip leading spaces */
	while (isspace((unsigned char) *strptr))
		strptr++;

	/* The first non-space char to check for emptiness later */
	tmpptr = strptr;
	for(;*strptr != '\0'; strptr++)
	{
		if (isdigit((unsigned char) *strptr) ||
			isspace((unsigned char) *strptr) ||
			*strptr == '.' || *strptr == '+' || *strptr == '-')
			continue;

		break;
	}

	/* don't allow empty string */
	if (tmpptr == strptr)
		ereport(ERROR,
				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
				 errmsg("\"%s\" is not a number", str)));

	/*
	 * Save the first char after possible number with space chars after it
	 * and set end-of-string mark to get cstring for "numeric_in".
	 * If there is no unit part in the arg then savedchar is '\0'.
	 */
	savedchar = *strptr;
	*strptr = '\0';

	num = DatumGetNumeric(DirectFunctionCall3(numeric_in,
							CStringGetDatum(str), 0, -1));

	if (savedchar != '\0')
	{
		int			multiplier;
		Numeric		mul_num;
		const char *hintmsg;
		const char *unitstr;

		/*
		 * unitstr points to a complete string related to a unit part. It is
		 * used as a part of an error message. The parse_memory_unit is called
		 * with this value, even if invalid format is detected to get hintstr
		 * for the error message. Otherwise trimmed value is used.
		 */
		unitstr = strptr;
		*strptr = savedchar;  /* Restore value which was replaced by '\0' */

		while (*strptr != '\0' && !isspace((unsigned char) *strptr))
			strptr++;

		tmpptr = strptr;  /* The first space char to set there '\0' later */
		while (isspace((unsigned char) *strptr))
			strptr++;

		/* If there are only spaces after the unit part, trim them. */
		if (*strptr == '\0')
			*tmpptr = '\0';

		/* Still, the unit can be invalid: too long, unknown */
		if (!parse_memory_unit(unitstr, &multiplier, &hintmsg))
			ereport(ERROR,
					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
					 errmsg("invalid unit: \"%s\"", unitstr),
					 errhint("%s", _(hintmsg))));

		/*
		 * Now, the multiplier is in KB units. It should be multiplied by 1024
		 * before usage.
		 */
		mul_num = DatumGetNumeric(DirectFunctionCall1(int8_numeric,
						    Int64GetDatum(multiplier * 1024L)));

		num = DatumGetNumeric(DirectFunctionCall2(numeric_mul,
							    NumericGetDatum(mul_num),
							    NumericGetDatum(num)));
	}

	result = DatumGetInt64(DirectFunctionCall1(numeric_int8, NumericGetDatum(num)));

	PG_RETURN_INT64(result);
}

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

Reply via email to