In r12843 I commited utils/nsurl.{c|h} which contain new URL handling
functions.

    http://source.netsurf-browser.org/?view=revision&revision=12843


Current url module
==================

The way the url_ code works is inefficient.  The same urls get parsed,
chopped up into components, used for something, and then the components
are thrown away, before the url is parsed and chopped up again, over and
over.

Comparison of urls is frequent and currently slow (involves parsing,
chopping up), and more so when the urls are the same or the same at the
start (common).

Clients need to know about normalised and unnormalised urls, because all
of the url modules require normalised urls as input (except url_normalise
and the relative part passed to url_join).  Meanwhile url_join outputs
urls unnormalised.

All URLs, whether entered by user or result of joining, need to be
normalised, and this involves a slow regular expression.  Much slower than
parsing the url manually.


Generally this is slow when there are a lot of images on a page, all from
the same domain.


New nsurl module
================

URLs are opaque 'nsurl' objects to everything outside nsurl.c. 
Internally, they are stored as a set of interned components.

This makes comparison much faster.

There is no need to reparse and chop up a URL.

URL normalisation happens transparently when the nsurl objects are created
either via nsurl_create or nsurl_join.

Joining URLs requires less work, since the base URL is already in the
appropriate format.


Possible improvements
=====================

Currently to get a url as a string, you have to call nsurl_get which
allocates space for and builds a new url string from components.  Perhaps
we could keep a copy of the full url inside the nsurl and return a pointer
to it.

If we do that, we can only guarantee a trailing '/0' if the URL was
requested up to and including the fragment.  Also, it would have to act at
is currently does if the URL is required with bits missing, such as the
password.  On the other hand, I expect getting the full url is the most
common case anyway.

Probably need to look at how nsurl object copying works.  I think perhaps
making a nsurl_ref and replacing nsurl_destroy with nsurl_unref would be
best.

There's currently no way for the outside world to get at, say, the
interned lwc_string for 'host'.  Not sure if there will be any requirement
for that.


Testing of nsurl module
=======================

Attached is the code I've used to test it.  It compares the output of the
existing url_ functions to the output of the new nsurl_ functions and
asserts if there's a difference.

At the moment one benign thing causes it to trip up: output of url_join is
not always normalised, but the output of nsurl_join is.

The nsurl_join code is also tested on all the examples of URL joining in
the RFC (which it passes) in url_fini.

-- 

Michael Drake (tlsa)                  http://www.netsurf-browser.org/
Index: Makefile.sources
===================================================================
--- Makefile.sources    (revision 12828)
+++ Makefile.sources    (working copy)
@@ -16,7 +16,7 @@
        hubbub_binding.c imagemap.c layout.c list.c search.c table.c    \
        textinput.c textplain.c
 
-S_UTILS := base64.c filename.c hashtable.c locale.c messages.c                 
\
+S_UTILS := base64.c filename.c hashtable.c locale.c messages.c nsurl.c \
        talloc.c url.c utf8.c utils.c useragent.c filepath.c log.c
 
 S_HTTP := challenge.c generics.c primitives.c parameter.c              \
Index: utils/url.c
===================================================================
--- utils/url.c (revision 12828)
+++ utils/url.c (working copy)
@@ -34,6 +34,7 @@
 #include "curl/curl.h"
 #include "utils/config.h"
 #include "utils/log.h"
+#include "utils/nsurl.h"
 #include "utils/url.h"
 #include "utils/utils.h"
 
@@ -77,6 +78,121 @@
 }
 
 
+void url_fini(void)
+{
+       nsurl *base;
+       nsurl *joined;
+       struct test_pairs {
+               const char* test;
+               const char* res;
+       };
+
+       struct test_pairs tests[] = {
+               /* Normal Examples rfc3986 5.4.1 */
+               { "g:h",                "g:h" },
+               { "g",                  "http://a/b/c/g"; },
+               { "./g",                "http://a/b/c/g"; },
+               { "g/",                 "http://a/b/c/g/"; },
+               { "/g",                 "http://a/g"; },
+               { "//g",                "http://g"; /* [1] */ "/" },
+               { "?y",                 "http://a/b/c/d;p?y"; },
+               { "g?y",                "http://a/b/c/g?y"; },
+               { "#s",                 "http://a/b/c/d;p?q#s"; },
+               { "g#s",                "http://a/b/c/g#s"; },
+               { "g?y#s",              "http://a/b/c/g?y#s"; },
+               { ";x",                 "http://a/b/c/;x"; },
+               { "g;x",                "http://a/b/c/g;x"; },
+               { "g;x?y#s",            "http://a/b/c/g;x?y#s"; },
+               { "",                   "http://a/b/c/d;p?q"; },
+               { ".",                  "http://a/b/c/"; },
+               { "./",                 "http://a/b/c/"; },
+               { "..",                 "http://a/b/"; },
+               { "../",                "http://a/b/"; },
+               { "../g",               "http://a/b/g"; },
+               { "../..",              "http://a/"; },
+               { "../../",             "http://a/"; },
+               { "../../g",            "http://a/g"; },
+
+               /* Abnormal Examples rfc3986 5.4.2 */
+               { "../../../g",         "http://a/g"; },
+               { "../../../../g",      "http://a/g"; },
+
+               { "/./g",               "http://a/g"; },
+               { "/../g",              "http://a/g"; },
+               { "g.",                 "http://a/b/c/g."; },
+               { ".g",                 "http://a/b/c/.g"; },
+               { "g..",                "http://a/b/c/g.."; },
+               { "..g",                "http://a/b/c/..g"; },
+
+               { "./../g",             "http://a/b/g"; },
+               { "./g/.",              "http://a/b/c/g/"; },
+               { "g/./h",              "http://a/b/c/g/h"; },
+               { "g/../h",             "http://a/b/c/h"; },
+               { "g;x=1/./y",          "http://a/b/c/g;x=1/y"; },
+               { "g;x=1/../y",         "http://a/b/c/y"; },
+
+               { "g?y/./x",            "http://a/b/c/g?y/./x"; },
+               { "g?y/../x",           "http://a/b/c/g?y/../x"; },
+               { "g#s/./x",            "http://a/b/c/g#s/./x"; },
+               { "g#s/../x",           "http://a/b/c/g#s/../x"; },
+
+               { "http:g",             "http:g" /* [2] */ },
+               /* [1] Extra slash beyond rfc3986 5.4.1 example, since we're
+                *     testing normalisation in addition to joining */
+               /* [2] Using the strict parsers option */
+               { NULL,                 NULL }
+       };
+       int index = 0;
+       char *string;
+       size_t len;
+
+       /* Create base URL */
+       if (nsurl_create("http://a/b/c/d;p?q";, &base) != NSERROR_OK) {
+               LOG(("Failed to create base URL."));
+       } else {
+               if (nsurl_get(base, NSURL_WITH_FRAGMENT, &string, &len) !=
+                               NSERROR_OK) {
+                       LOG(("Failed to get string"));
+               } else {
+                       LOG(("Testing nsurl_join with base %s", string));
+                       free(string);
+               }
+
+               do {
+                       if (nsurl_join(base, tests[index].test,
+                                       &joined) != NSERROR_OK) {
+                               LOG(("Failed to join test URL."));
+                       } else {
+                               if (nsurl_get(joined, NSURL_WITH_FRAGMENT,
+                                               &string, &len) !=
+                                               NSERROR_OK) {
+                                       LOG(("Failed to get string"));
+                               } else {
+                                       if (strcmp(tests[index].res,
+                                                       string) == 0) {
+                                               LOG(("\tPASS: \"%s\"\t--> %s",
+                                                       tests[index].test,
+                                                       string));
+                                       } else {
+                                               LOG(("\tFAIL: \"%s\"\t--> %s",
+                                                       tests[index].test,
+                                                       string));
+                                               LOG(("\t\tExpecting: %s",
+                                                       tests[index].res));
+                                               assert(0);
+                                       }
+                                       free(string);
+                               }
+                               nsurl_destroy(joined);
+                       }
+
+               } while (tests[++index].test != NULL);
+
+               nsurl_destroy(base);
+       }
+}
+
+
 /**
  * Check whether a host string is an IP address.  It should support and
  * detect IPv4 addresses (all of dotted-quad or subsets, decimal or
@@ -192,7 +308,24 @@
        char* norm;
        bool http = false;
        regmatch_t match[10];
+       nsurl *test_url;
+       char *temp_string;
 
+       /* Get result via NSURL */
+       if (nsurl_create(url, &test_url) == NSERROR_OK) {
+               size_t temp_string_len;
+               if (nsurl_get(test_url, NSURL_WITH_FRAGMENT, &temp_string,
+                               &temp_string_len) == NSERROR_OK) {
+               } else {
+                       LOG(("nsurl_get() failed."));
+                       assert(0);
+               }
+               nsurl_destroy(test_url);
+       } else {
+               LOG(("nusurl_create() failed."));
+               assert(0);
+       }
+
        *result = NULL;
 
        /* skip past any leading whitespace (likely if URL was copy-pasted) */
@@ -360,6 +493,15 @@
                len -= 2;
        }
 
+       /* Compare to NSURL result */
+       if (strcmp(*result, temp_string) != 0) {
+               LOG(("url result mismatch:"));
+               LOG(("  URL: %s", *result));
+               LOG(("NSURL: %s", temp_string));
+               assert(0);
+       }
+       free(temp_string);
+
        /* norm and *result point to same memory, so just return ok */
        return URL_FUNC_OK;
 }
@@ -386,7 +528,33 @@
        char *merge_path = NULL, *split_point;
        char *input, *output, *start = NULL;
        int len, buf_len;
+       nsurl *test_base_url;
+       nsurl *test_join_url;
+       char *temp_string;
 
+       /* Get result via NSURL */
+       if (nsurl_create(base, &test_base_url) == NSERROR_OK) {
+               if (nsurl_join(test_base_url, rel, &test_join_url) ==
+                               NSERROR_OK) {
+                       size_t temp_string_len;
+                       if (nsurl_get(test_join_url, NSURL_WITH_FRAGMENT,
+                                       &temp_string,
+                                       &temp_string_len) == NSERROR_OK) {
+                       } else {
+                               LOG(("nsurl_get() failed."));
+                               assert(0);
+                       }
+                       nsurl_destroy(test_join_url);
+               } else {
+                       LOG(("nsurl_join() failed."));
+                       assert(0);
+               }
+               nsurl_destroy(test_base_url);
+       } else {
+               LOG(("nusurl_create() failed."));
+               assert(0);
+       }
+
        (*result) = 0;
 
        assert(base);
@@ -560,6 +728,16 @@
        free(merge_path);
        url_destroy_components((struct url_components *) base_ptr);
        url_destroy_components((struct url_components *) rel_ptr);
+
+       /* Compare to NSURL result */
+       if (strcmp(*result, temp_string) != 0) {
+               LOG(("url result mismatch:"));
+               LOG(("  URL: %s", *result));
+               LOG(("NSURL: %s", temp_string));
+               assert(0);
+       }
+       free(temp_string);
+
        return status;
 }
 
@@ -1090,9 +1268,33 @@
        url_func_result status;
        struct url_components c1, c2;
        bool res = true;
+       nsurl *t1;
+       nsurl *t2;
+       bool nsurl_res;
 
        assert(url1 && url2 && result);
 
+       /* Get result via NSURL */
+       if (nsurl_create(url1, &t1) == NSERROR_OK) {
+               if (nsurl_create(url2, &t2) == NSERROR_OK) {
+                       if (nsurl_compare(t1, t2,
+                                       nofrag ? NSURL_COMPLETE :
+                                       NSURL_WITH_FRAGMENT,
+                                       &nsurl_res) != NSERROR_OK) {
+                               LOG(("nsurl_compare() failed."));
+                               assert(0);
+                       }
+                       nsurl_destroy(t2);
+               } else {
+                       LOG(("nusurl_create() failed."));
+                       assert(0);
+               }
+               nsurl_destroy(t1);
+       } else {
+               LOG(("nusurl_create() failed."));
+               assert(0);
+       }
+
        /* Decompose URLs */
        status = url_get_components(url1, &c1);
        if (status != URL_FUNC_OK) {
@@ -1141,6 +1343,12 @@
        url_destroy_components(&c2);
        url_destroy_components(&c1);
 
+       if (res != nsurl_res) {
+               LOG(("URL comparison result mismatch.  Input:"));
+               LOG(("url1: %s", url1));
+               LOG(("url2: %s", url2));
+       }
+
        return URL_FUNC_OK;
 }
 
Index: utils/url.h
===================================================================
--- utils/url.h (revision 12828)
+++ utils/url.h (working copy)
@@ -44,6 +44,7 @@
 };
 
 void url_init(void);
+void url_fini(void);
 bool url_host_is_ip_address(const char *host);
 url_func_result url_normalize(const char *url, char **result);
 url_func_result url_join(const char *rel, const char *base, char **result);
Index: desktop/netsurf.c
===================================================================
--- desktop/netsurf.c   (revision 12828)
+++ desktop/netsurf.c   (working copy)
@@ -282,6 +282,8 @@
        LOG(("Destroying System colours"));
        gui_system_colour_finalize();
 
+       url_fini();
+
        LOG(("Remaining lwc strings:"));
        lwc_iterate_strings(netsurf_lwc_iterator, NULL);

Reply via email to