[Geany-devel] [CODE REVIEW] Changes to encodings

Matthew Brush Sun, 15 May 2011 20:49:22 -0700

Hi all,

I was wondering if anyone had time to review some changes I've beenworking on?


There's two commits involved (attached patches and linked below):

1) Allow embedded NUL chars in files (without truncating).
2) Add more regex patterns for guessing encoding.

The first one is kind of large and affects several files/functions. Onething I did which should change is in a few places I added a 'gsizetmp_size' var to pass to the encodings_convert_to_utf8* functions, wherea simple NULL size parameter would've sufficed, in these cases, thestring is known to be NUL-terminated so this parameter isn't required.Unfortunately I made a commit after this so fixing this will comeafterwards (though it's not such a big deal). Also, I'm not completelysatisfied with the name of the sci_add_text2() function added to thesciwrappers files, but there was already a function named sci_add_text()which truncates the text at the first NUL, and the new function is amore direct wrapper of the SCI_ADDTEXT message. These changes break theplugin API, and would require two small changes to the geany-pluginsGeanyVC: geanyvc.c:480 and geanyvc.c:498 as well as GeanyGenDoc:ggd-plugin.c:343. It should be obvious why this breakage was required Ihope.

The following bug reports are related to the first set of changes:[1][2][3][4]. A few of the comments seem to suggest that it's notdesired to support text files containing NUL control bytes, but IMO,it's nice to be able to show these anyway, especially since Scintillahas a nice way of dealing with these. IMO, as long as the file getsmarked read-only, displaying as much as possible is better thantruncating at the first NUL byte.

The second one is a simple refactoring of the code around theregex-based encoding detection and adding a few more patterns to detecta few more types of files. Especially needing review are the actualpattern strings I used since, quite frankly, I suck at regexps. Onething I noticed looking at the bug report here[5], after creating thesepatches/commits, is that it seems the <?xml ... encoding="" ?> isalready supported by the CODING regex, so you can ignore the new XML oneI added if that truly is the case (I'll remove it from the proper patchI send in the future). Like I said, my regex skills aren't great :)

If you want to use the GitHub UI to review the commits [6][7] otherwise,patches are attached. If you have a GitHub account, you can leavecomments on the commits and/or specific lines of the commits.

Any feedback would be much appreciated as I plan to submit these changesas proper patches once I've ensured they are good enough.


In the meanwhile, I will continue testing...

Cheers,
Matthew Brush

[1]https://sourceforge.net/tracker/index.php?func=detail&aid=3232135&group_id=153444&atid=787791[2]https://sourceforge.net/tracker/index.php?func=detail&aid=2893180&group_id=153444&atid=787791[3]https://sourceforge.net/tracker/index.php?func=detail&aid=3232135&group_id=153444&atid=787791[4]https://sourceforge.net/tracker/index.php?func=detail&aid=2695290&group_id=153444&atid=787791[5]https://sourceforge.net/tracker/index.php?func=detail&aid=3183506&group_id=153444&atid=787791[6]https://github.com/codebrainz/geany/commit/81c89cce736c88f235761ce5765f2361fe2bedc8[7]https://github.com/codebrainz/geany/commit/5dec6db43b498378be3496d4e8949683544c1044

>From 81c89cce736c88f235761ce5765f2361fe2bedc8 Mon Sep 17 00:00:00 2001
From: Matthew Brush <[email protected]>
Date: Sun, 15 May 2011 17:48:42 -0700
Subject: [PATCH 1/2] Make encoding not care about embedded NUL control chars.

Avoid using strlen() in encodings.c and never assume a string
is NUL terminated.  As an added bonus, this probably improves
speed somewhat.

Make size argument to encoding_convert_to_utf8*() functions
take a pointer to a gsize so that we can store the size after
conversion.

Add a new function to validate utf8 with embedded NUL chars
which is a wrapper around g_utf8_validate().

Remove unused members from BufferData struct.

Add new function to directly wrap SCI_ADDTEXT by supporting a
length argument so that non-NUL-terminated text can be added
to the Scintilla widget.

Update other code that uses the modifications.
---
 src/document.c    |   14 +++---
 src/encodings.c   |  154 ++++++++++++++++++++++++++++++----------------------
 src/encodings.h   |    4 +-
 src/plugindata.h  |    4 +-
 src/sciwrappers.c |    9 +++
 src/sciwrappers.h |    1 +
 src/socket.c      |    5 +-
 src/symbols.c     |    6 ++-
 8 files changed, 117 insertions(+), 80 deletions(-)

diff --git a/src/document.c b/src/document.c
index 971936b..123862c 100644
--- a/src/document.c
+++ b/src/document.c
@@ -819,9 +819,9 @@ GeanyDocument *document_open_file(const gchar *locale_filename, gboolean readonl
 
 typedef struct
 {
-	gchar		*data;	/* null-terminated file data */
-	gsize		 len;	/* string length of data */
-	gchar		*enc;
+	gchar		*data;	/* file data */
+	gsize		 len;	/* size of data */
+	gchar		*enc;	/* charset name */
 	gboolean	 bom;
 	time_t		 mtime;	/* modification time, read by stat::st_mtime */
 	gboolean	 readonly;
@@ -850,14 +850,13 @@ static gboolean load_text_file(const gchar *locale_filename, const gchar *displa
 
 	filedata->mtime = st.st_mtime;
 
-	if (! g_file_get_contents(locale_filename, &filedata->data, NULL, &err))
+	if (! g_file_get_contents(locale_filename, &filedata->data, &filedata->len, &err))
 	{
 		ui_set_statusbar(TRUE, "%s", err->message);
 		g_error_free(err);
 		return FALSE;
 	}
 
-	filedata->len = (gsize) st.st_size;
 	if (! encodings_convert_to_utf8_auto(&filedata->data, &filedata->len, forced_enc,
 				&filedata->enc, &filedata->bom, &filedata->readonly))
 	{
@@ -879,7 +878,7 @@ static gboolean load_text_file(const gchar *locale_filename, const gchar *displa
 	if (filedata->readonly)
 	{
 		const gchar *warn_msg = _(
-			"The file \"%s\" could not be opened properly and has been truncated. " \
+			"The file \"%s\" could not be opened properly and may have been truncated. " \
 			"This can occur if the file contains a NULL byte. " \
 			"Be aware that saving it can cause data loss.\nThe file was set to read-only.");
 
@@ -1200,7 +1199,8 @@ GeanyDocument *document_open_file_full(GeanyDocument *doc, const gchar *filename
 
 		/* add the text to the ScintillaObject */
 		sci_set_readonly(doc->editor->sci, FALSE);	/* to allow replacing text */
-		sci_set_text(doc->editor->sci, filedata.data);	/* NULL terminated data */
+		sci_clear_all(doc->editor->sci);
+		sci_add_text2(doc->editor->sci, filedata.data, filedata.len);
 		queue_colourise(doc);	/* Ensure the document gets colourised. */
 
 		/* detect & set line endings */
diff --git a/src/encodings.c b/src/encodings.c
index cebc822..4af12ef 100644
--- a/src/encodings.c
+++ b/src/encodings.c
@@ -507,38 +507,76 @@ void encodings_init(void)
 }
 
 
+/* Use g_utf8_validate() but allow NUL control chars embedded in the string */
+static gboolean encodings_utf8_validate_with_nulls(const gchar *buffer, gsize size)
+{
+	const gchar *ptr, *end;
+	gsize test_size;
+
+	g_return_val_if_fail(buffer != NULL, FALSE);
+
+	if (size == -1)
+		return g_utf8_validate(buffer, size, NULL);
+
+	ptr = buffer;
+	test_size = size;
+
+	while (test_size > 0 && !g_utf8_validate(ptr, test_size, &end))
+	{
+		if (*end == '\0')
+		{
+			ptr = ++end;
+			test_size = size - (ptr - buffer);
+		}
+		else
+			return FALSE;
+	}
+
+	return TRUE;
+}
+
+
 /**
  *  Tries to convert @a buffer into UTF-8 encoding from the encoding specified with @a charset.
  *  If @a fast is not set, additional checks to validate the converted string are performed.
  *
  *  @param buffer The input string to convert.
- *  @param size The length of the string, or -1 if the string is nul-terminated.
+ *  @param size Pointer to the length of the string, pointing to an initial value of -1 if
+ * 		@a buffer is a nul-terminated string.  Will be updated with the converted size of the buffer.
  *  @param charset The charset to be used for conversion.
  *  @param fast @c TRUE to only convert the input and skip extended checks on the converted string.
  *
  *  @return If the conversion was successful, a newly allocated nul-terminated string,
  *    which must be freed with @c g_free(). Otherwise @c NULL.
  **/
-gchar *encodings_convert_to_utf8_from_charset(const gchar *buffer, gsize size,
+gchar *encodings_convert_to_utf8_from_charset(const gchar *buffer, gsize *size,
 											  const gchar *charset, gboolean fast)
 {
 	gchar *utf8_content = NULL;
 	GError *conv_error = NULL;
 	gchar* converted_contents = NULL;
-	gsize bytes_written;
+	gsize input_bytes, output_bytes;
 
 	g_return_val_if_fail(buffer != NULL, NULL);
 	g_return_val_if_fail(charset != NULL, NULL);
 
-	converted_contents = g_convert(buffer, size, "UTF-8", charset, NULL,
-								   &bytes_written, &conv_error);
+	if (size == NULL)
+		input_bytes = -1;
+	else
+		input_bytes = *size;
+
+	converted_contents = g_convert_with_fallback(buffer, input_bytes, "UTF-8",
+							charset, NULL, NULL, &output_bytes, &conv_error);
 
 	if (fast)
 	{
 		utf8_content = converted_contents;
+		if (size != NULL)
+			*size = output_bytes;
 		if (conv_error != NULL) g_error_free(conv_error);
 	}
-	else if (conv_error != NULL || ! g_utf8_validate(converted_contents, bytes_written, NULL))
+	else if (conv_error != NULL ||
+		! encodings_utf8_validate_with_nulls(converted_contents, output_bytes))
 	{
 		if (conv_error != NULL)
 		{
@@ -556,6 +594,8 @@ gchar *encodings_convert_to_utf8_from_charset(const gchar *buffer, gsize size,
 	{
 		geany_debug("Converted from %s to UTF-8.", charset);
 		utf8_content = converted_contents;
+		if (size != NULL)
+			*size = output_bytes;
 	}
 
 	return utf8_content;
@@ -577,7 +617,7 @@ static gchar *encodings_check_regexes(const gchar *buffer, gsize size)
 }
 
 
-static gchar *encodings_convert_to_utf8_with_suggestion(const gchar *buffer, gsize size,
+static gchar *encodings_convert_to_utf8_with_suggestion(const gchar *buffer, gsize *size,
 		const gchar *suggested_charset, gchar **used_encoding)
 {
 	const gchar *locale_charset = NULL;
@@ -586,11 +626,12 @@ static gchar *encodings_convert_to_utf8_with_suggestion(const gchar *buffer, gsi
 	gboolean check_suggestion = suggested_charset != NULL;
 	gboolean check_locale = FALSE;
 	gint i, preferred_charset;
+	gsize buffer_size;
 
-	if ((gint)size == -1)
-	{
-		size = strlen(buffer);
-	}
+	if (size == NULL)
+		buffer_size = -1;
+	else
+		buffer_size = *size;
 
 	/* current locale is not UTF-8, we have to check this charset */
 	check_locale = ! g_get_charset(&locale_charset);
@@ -644,11 +685,13 @@ static gchar *encodings_convert_to_utf8_with_suggestion(const gchar *buffer, gsi
 			continue;
 
 		geany_debug("Trying to convert %" G_GSIZE_FORMAT " bytes of data from %s into UTF-8.",
-			size, charset);
-		utf8_content = encodings_convert_to_utf8_from_charset(buffer, size, charset, FALSE);
+			buffer_size, charset);
+		utf8_content = encodings_convert_to_utf8_from_charset(buffer, &buffer_size, charset, FALSE);
 
 		if (G_LIKELY(utf8_content != NULL))
 		{
+			if (size != NULL)
+				*size = buffer_size;
 			if (used_encoding != NULL)
 			{
 				if (G_UNLIKELY(*used_encoding != NULL))
@@ -671,22 +714,32 @@ static gchar *encodings_convert_to_utf8_with_suggestion(const gchar *buffer, gsi
  *  @a used_encoding.
  *
  *  @param buffer the input string to convert.
- *  @param size the length of the string, or -1 if the string is nul-terminated.
+ *  @param size Pointer to the length of the string, pointing to an initial value of -1 if
+ * 		@a buffer is a nul-terminated string.  Will be updated with the converted size of the buffer.
  *  @param used_encoding return location of the detected encoding of the input string, or @c NULL.
  *
  *  @return If the conversion was successful, a newly allocated nul-terminated string,
  *    which must be freed with @c g_free(). Otherwise @c NULL.
  **/
-gchar *encodings_convert_to_utf8(const gchar *buffer, gsize size, gchar **used_encoding)
+gchar *encodings_convert_to_utf8(const gchar *buffer, gsize *size, gchar **used_encoding)
 {
 	gchar *regex_charset;
 	gchar *utf8;
+	gsize buffer_size;
+
+	if (size == NULL)
+		buffer_size = -1;
+	else
+		buffer_size = *size;
 
 	/* first try to read the encoding from the file content */
-	regex_charset = encodings_check_regexes(buffer, size);
-	utf8 = encodings_convert_to_utf8_with_suggestion(buffer, size, regex_charset, used_encoding);
+	regex_charset = encodings_check_regexes(buffer, buffer_size);
+	utf8 = encodings_convert_to_utf8_with_suggestion(buffer, &buffer_size, regex_charset, used_encoding);
 	g_free(regex_charset);
 
+	if (size != NULL)
+		*size = buffer_size;
+
 	return utf8;
 }
 
@@ -763,10 +816,8 @@ typedef struct
 {
 	gchar		*data;	/* null-terminated data */
 	gsize		 size;	/* actual data size */
-	gsize		 len;	/* string length of data */
 	gchar		*enc;
 	gboolean	 bom;
-	gboolean	 partial;
 } BufferData;
 
 
@@ -776,27 +827,13 @@ handle_forced_encoding(BufferData *buffer, const gchar *forced_enc)
 {
 	GeanyEncodingIndex enc_idx;
 
-	if (utils_str_equal(forced_enc, "UTF-8"))
-	{
-		if (! g_utf8_validate(buffer->data, buffer->len, NULL))
-		{
-			return FALSE;
-		}
-	}
+	gchar *converted_text = encodings_convert_to_utf8_from_charset(
+										buffer->data, &buffer->size, forced_enc, FALSE);
+	if (converted_text == NULL)
+		return FALSE;
 	else
-	{
-		gchar *converted_text = encodings_convert_to_utf8_from_charset(
-										buffer->data, buffer->size, forced_enc, FALSE);
-		if (converted_text == NULL)
-		{
-			return FALSE;
-		}
-		else
-		{
-			setptr(buffer->data, converted_text);
-			buffer->len = strlen(converted_text);
-		}
-	}
+		setptr(buffer->data, converted_text);
+
 	enc_idx = encodings_scan_unicode_bom(buffer->data, buffer->size, NULL);
 	buffer->bom = (enc_idx == GEANY_ENCODING_UTF_8);
 	buffer->enc = g_strdup(forced_enc);
@@ -828,11 +865,10 @@ handle_encoding(BufferData *buffer, GeanyEncodingIndex enc_idx)
 			if (enc_idx != GEANY_ENCODING_UTF_8) /* the BOM indicated something else than UTF-8 */
 			{
 				gchar *converted_text = encodings_convert_to_utf8_from_charset(
-										buffer->data, buffer->size, buffer->enc, FALSE);
+										buffer->data, &buffer->size, buffer->enc, FALSE);
 				if (converted_text != NULL)
 				{
 					setptr(buffer->data, converted_text);
-					buffer->len = strlen(converted_text);
 				}
 				else
 				{
@@ -850,7 +886,7 @@ handle_encoding(BufferData *buffer, GeanyEncodingIndex enc_idx)
 
 			/* try UTF-8 first */
 			if (encodings_get_idx_from_charset(regex_charset) == GEANY_ENCODING_UTF_8 &&
-				(buffer->size == buffer->len) && g_utf8_validate(buffer->data, buffer->len, NULL))
+				encodings_utf8_validate_with_nulls(buffer->data, buffer->size))
 			{
 				buffer->enc = g_strdup("UTF-8");
 			}
@@ -858,7 +894,7 @@ handle_encoding(BufferData *buffer, GeanyEncodingIndex enc_idx)
 			{
 				/* detect the encoding */
 				gchar *converted_text = encodings_convert_to_utf8_with_suggestion(buffer->data,
-					buffer->size, regex_charset, &buffer->enc);
+					&buffer->size, regex_charset, &buffer->enc);
 
 				if (converted_text == NULL)
 				{
@@ -866,7 +902,6 @@ handle_encoding(BufferData *buffer, GeanyEncodingIndex enc_idx)
 					return FALSE;
 				}
 				setptr(buffer->data, converted_text);
-				buffer->len = strlen(converted_text);
 			}
 			g_free(regex_charset);
 		}
@@ -881,13 +916,15 @@ handle_bom(BufferData *buffer)
 	guint bom_len;
 
 	encodings_scan_unicode_bom(buffer->data, buffer->size, &bom_len);
-	g_return_if_fail(bom_len != 0);
 
-	/* use filedata->len here because the contents are already converted into UTF-8 */
-	buffer->len -= bom_len;
+	if (bom_len == 0)
+		return;
+
+	/* use filedata->size here because the contents are already converted into UTF-8 */
+	buffer->size -= bom_len;
 	/* overwrite the BOM with the remainder of the file contents, plus the NULL terminator. */
-	g_memmove(buffer->data, buffer->data + bom_len, buffer->len + 1);
-	buffer->data = g_realloc(buffer->data, buffer->len + 1);
+	g_memmove(buffer->data, buffer->data + bom_len, buffer->size);
+	buffer->data = g_realloc(buffer->data, buffer->size);
 }
 
 
@@ -900,21 +937,11 @@ static gboolean handle_buffer(BufferData *buffer, const gchar *forced_enc)
 	 * if we have a BOM */
 	tmp_enc_idx = encodings_scan_unicode_bom(buffer->data, buffer->size, NULL);
 
-	/* check whether the size of the loaded data is equal to the size of the file in the
-	 * filesystem file size may be 0 to allow opening files in /proc/ which have typically a
-	 * file size of 0 bytes */
-	if (buffer->len != buffer->size && buffer->size != 0 && (
-		tmp_enc_idx == GEANY_ENCODING_UTF_8 || /* tmp_enc_idx can be UTF-7/8/16/32, UCS and None */
-		tmp_enc_idx == GEANY_ENCODING_UTF_7))  /* filter UTF-7/8 where no NULL bytes are allowed */
-	{
-		buffer->partial = TRUE;
-	}
-
 	/* Determine character encoding and convert to UTF-8 */
 	if (forced_enc != NULL)
 	{
 		/* the encoding should be ignored(requested by user), so open the file "as it is" */
-		if (utils_str_equal(forced_enc, encodings[GEANY_ENCODING_NONE].charset))
+		if (encodings_charset_equals(forced_enc, encodings[GEANY_ENCODING_NONE].charset))
 		{
 			buffer->bom = FALSE;
 			buffer->enc = g_strdup(encodings[GEANY_ENCODING_NONE].charset);
@@ -957,22 +984,19 @@ gboolean encodings_convert_to_utf8_auto(gchar **buf, gsize *size, const gchar *f
 
 	buffer.data = *buf;
 	buffer.size = *size;
-	/* use strlen to check for null chars */
-	buffer.len = strlen(buffer.data);
 	buffer.enc = NULL;
 	buffer.bom = FALSE;
-	buffer.partial = FALSE;
 
 	if (! handle_buffer(&buffer, forced_enc))
 		return FALSE;
 
-	*size = buffer.len;
+	*size = buffer.size;
 	if (used_encoding)
 		*used_encoding = buffer.enc;
 	if (has_bom)
 		*has_bom = buffer.bom;
 	if (partial)
-		*partial = buffer.partial;
+		*partial = strlen(buffer.data) != buffer.size;
 
 	*buf = buffer.data;
 	return TRUE;
diff --git a/src/encodings.h b/src/encodings.h
index 5fc90e2..a905096 100644
--- a/src/encodings.h
+++ b/src/encodings.h
@@ -81,11 +81,11 @@ void encodings_select_radio_item(const gchar *charset);
 void encodings_init(void);
 void encodings_finalize(void);
 
-gchar *encodings_convert_to_utf8(const gchar *buffer, gsize size, gchar **used_encoding);
+gchar *encodings_convert_to_utf8(const gchar *buffer, gsize *size, gchar **used_encoding);
 
 /* Converts a string from the given charset to UTF-8.
  * If fast is set, no further checks are performed. */
-gchar *encodings_convert_to_utf8_from_charset(const gchar *buffer, gsize size,
+gchar *encodings_convert_to_utf8_from_charset(const gchar *buffer, gsize *size,
 											  const gchar *charset, gboolean fast);
 
 gboolean encodings_is_unicode_charset(const gchar *string);
diff --git a/src/plugindata.h b/src/plugindata.h
index 64c3680..974a225 100644
--- a/src/plugindata.h
+++ b/src/plugindata.h
@@ -523,8 +523,8 @@ MsgWinFuncs;
 /* See encodings.h */
 typedef struct EncodingFuncs
 {
-	gchar*			(*encodings_convert_to_utf8) (const gchar *buffer, gsize size, gchar **used_encoding);
-	gchar* 			(*encodings_convert_to_utf8_from_charset) (const gchar *buffer, gsize size,
+	gchar*			(*encodings_convert_to_utf8) (const gchar *buffer, gsize *size, gchar **used_encoding);
+	gchar* 			(*encodings_convert_to_utf8_from_charset) (const gchar *buffer, gsize *size,
 													 const gchar *charset, gboolean fast);
 	const gchar*	(*encodings_get_charset_from_index) (gint idx);
 }
diff --git a/src/sciwrappers.c b/src/sciwrappers.c
index b34fefa..2e1448f 100644
--- a/src/sciwrappers.c
+++ b/src/sciwrappers.c
@@ -180,6 +180,15 @@ void sci_add_text(ScintillaObject *sci, const gchar *text)
 }
 
 
+void sci_add_text2(ScintillaObject *sci, const gchar *text, gsize length)
+{
+	g_return_if_fail(sci != NULL);
+	g_return_if_fail(text != NULL);
+
+	SSM(sci, SCI_ADDTEXT, length, (sptr_t)text);
+}
+
+
 /** Sets all text.
  * @param sci Scintilla widget.
  * @param text Text. */
diff --git a/src/sciwrappers.h b/src/sciwrappers.h
index 3195fb8..f2addb7 100644
--- a/src/sciwrappers.h
+++ b/src/sciwrappers.h
@@ -35,6 +35,7 @@ void				sci_set_mark_long_lines		(ScintillaObject *sci,	gint type, gint column,
 
 void 				sci_set_text				(ScintillaObject *sci,  const gchar *text);
 void 				sci_add_text				(ScintillaObject *sci,  const gchar *text);
+void 				sci_add_text2				(ScintillaObject *sci, 	const gchar *text, gsize length);
 gboolean			sci_can_redo				(ScintillaObject *sci);
 gboolean			sci_can_undo				(ScintillaObject *sci);
 gboolean			sci_has_selection			(ScintillaObject *sci);
diff --git a/src/socket.c b/src/socket.c
index 52cb0a0..6ad31af 100644
--- a/src/socket.c
+++ b/src/socket.c
@@ -547,10 +547,11 @@ static void socket_init_win32(void)
 static void handle_input_filename(const gchar *buf)
 {
 	gchar *utf8_filename, *locale_filename;
+	gsize tmp_size = -1;
 
 	/* we never know how the input is encoded, so do the best auto detection we can */
-	if (! g_utf8_validate(buf, -1, NULL))
-		utf8_filename = encodings_convert_to_utf8(buf, (gsize) -1, NULL);
+	if (! g_utf8_validate(buf, tmp_size, NULL))
+		utf8_filename = encodings_convert_to_utf8(buf, &tmp_size, NULL);
 	else
 		utf8_filename = g_strdup(buf);
 
diff --git a/src/symbols.c b/src/symbols.c
index fa82dbd..d23c9fd 100644
--- a/src/symbols.c
+++ b/src/symbols.c
@@ -1015,6 +1015,7 @@ static const gchar *get_symbol_name(GeanyDocument *doc, const TMTag *tag, gboole
 	const gchar *scope = tag->atts.entry.scope;
 	static GString *buffer = NULL;	/* buffer will be small so we can keep it for reuse */
 	gboolean doc_is_utf8 = FALSE;
+	gsize tmp_size = -1;
 
 	/* encodings_convert_to_utf8_from_charset() fails with charset "None", so skip conversion
 	 * for None at this point completely */
@@ -1024,7 +1025,7 @@ static const gchar *get_symbol_name(GeanyDocument *doc, const TMTag *tag, gboole
 
 	if (! doc_is_utf8)
 		utf8_name = encodings_convert_to_utf8_from_charset(tag->name,
-			(gsize) -1, doc->encoding, TRUE);
+			&tmp_size, doc->encoding, TRUE);
 	else
 		utf8_name = tag->name;
 
@@ -1059,6 +1060,7 @@ static const gchar *get_symbol_name(GeanyDocument *doc, const TMTag *tag, gboole
 static gchar *get_symbol_tooltip(GeanyDocument *doc, const TMTag *tag)
 {
 	gchar *utf8_name = editor_get_calltip_text(doc->editor, tag);
+	gsize tmp_size = -1;
 
 	/* encodings_convert_to_utf8_from_charset() fails with charset "None", so skip conversion
 	 * for None at this point completely */
@@ -1067,7 +1069,7 @@ static gchar *get_symbol_tooltip(GeanyDocument *doc, const TMTag *tag)
 		! utils_str_equal(doc->encoding, "None"))
 	{
 		setptr(utf8_name,
-			encodings_convert_to_utf8_from_charset(utf8_name, (gsize) -1, doc->encoding, TRUE));
+			encodings_convert_to_utf8_from_charset(utf8_name, &tmp_size, doc->encoding, TRUE));
 	}
 
 	if (utf8_name != NULL)
-- 
1.7.1

>From 5dec6db43b498378be3496d4e8949683544c1044 Mon Sep 17 00:00:00 2001
From: Matthew Brush <[email protected]>
Date: Sun, 15 May 2011 19:39:15 -0700
Subject: [PATCH 2/2] Add regex patterns for XML, CSS, and HTTP/Mail headers.

Refactor regex patterns code to make it easier to add more
patterns.
---
 src/encodings.c |   45 ++++++++++++++++++++++++++++++++++-----------
 1 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/src/encodings.c b/src/encodings.c
index 4af12ef..b7cf7be 100644
--- a/src/encodings.c
+++ b/src/encodings.c
@@ -50,13 +50,34 @@
 # include "gnuregex.h"
 #endif
 
-/* <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> */
-#define PATTERN_HTMLMETA "<meta[ \t\n\r\f]+http-equiv[ \t\n\r\f]*=[ \t\n\r\f]*\"?content-type\"?[ \t\n\r\f]+content[ \t\n\r\f]*=[ \t\n\r\f]*\"text/x?html;[ \t\n\r\f]*charset=([a-z0-9_-]+)\"[ \t\n\r\f]*/?>"
-/* " geany_encoding=utf-8 " or " coding: utf-8 " */
-#define PATTERN_CODING "coding[\t ]*[:=][\t ]*\"?([a-z0-9-]+)\"?[\t ]*"
+
+enum
+{
+	ENCODINGS_PATTERN_HTML_META,
+	ENCODINGS_PATTERN_XML_ENCODING,
+	ENCODINGS_PATTERN_CSS_ENCODING,
+	ENCODINGS_PATTERN_HEADERS,
+	ENCODINGS_PATTERN_CODING,
+	ENCODINGS_PATTERN_MAX
+};
+
+
+static const char *regex_patterns[ENCODINGS_PATTERN_MAX] = {
+	/* <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> */
+	"<meta[ \t\n\r\f]+http-equiv[ \t\n\r\f]*=[ \t\n\r\f]*\"?content-type\"?[ \t\n\r\f]+content[ \t\n\r\f]*=[ \t\n\r\f]*\"text/x?html;[ \t\n\r\f]*charset=([a-z0-9_-]+)\"[ \t\n\r\f]*/?>",
+	/* <?xml version="1.0" encoding="iso-8859-1"?> */
+	"<\?xml.*?encoding[ \t\n\r\f]*=[ \t\n\r\f]*\"?([^\" \t\n\r\f]+)\"?.*?\?>",
+	/* @charset "utf-8"; */
+	"@charset[ \t\n\r\f]+\"?([^\" \t\n\r\f]+)\"?;",
+	/* Content-Type: text/html; charset=UTF-8 */
+	"Content-Type:[^;]+;[ \t\n\r\f]*charset=[ \t\n\r\f]*\"?([^\" \t\n\r\f]+)\"?",
+	/* " geany_encoding=utf-8 " or " coding: utf-8 " */
+	"coding[\t ]*[:=][\t ]*\"?([a-z0-9-]+)\"?[\t ]*"
+};
+
 
 /* precompiled regexps */
-static regex_t pregs[2];
+static regex_t pregs[ENCODINGS_PATTERN_MAX];
 static gboolean pregs_loaded = FALSE;
 
 
@@ -388,9 +409,8 @@ void encodings_finalize(void)
 {
 	if (pregs_loaded)
 	{
-		guint i, len;
-		len = G_N_ELEMENTS(pregs);
-		for (i = 0; i < len; i++)
+		guint i;
+		for (i = 0; i < ENCODINGS_PATTERN_MAX; i++)
 		{
 			regfree(&pregs[i]);
 		}
@@ -413,8 +433,11 @@ void encodings_init(void)
 
 	if (! pregs_loaded)
 	{
-		regex_compile(&pregs[0], PATTERN_HTMLMETA);
-		regex_compile(&pregs[1], PATTERN_CODING);
+		for (i = 0; i < ENCODINGS_PATTERN_MAX; i++)
+		{
+			regex_compile(&pregs[i], regex_patterns[i]);
+		}
+
 		pregs_loaded = TRUE;
 	}
 
@@ -606,7 +629,7 @@ static gchar *encodings_check_regexes(const gchar *buffer, gsize size)
 {
 	guint i;
 
-	for (i = 0; i < G_N_ELEMENTS(pregs); i++)
+	for (i = 0; i < ENCODINGS_PATTERN_MAX; i++)
 	{
 		gchar *charset;
 
-- 
1.7.1

_______________________________________________
Geany-devel mailing list
[email protected]
https://lists.uvena.de/cgi-bin/mailman/listinfo/geany-devel

[Geany-devel] [CODE REVIEW] Changes to encodings

Reply via email to