Re: [sqlite] extracting domain names from website addresses efficiently
Hello Peter, Monday, December 11, 2017, 9:16:27 AM, you wrote: PDS> This seems like a job for regular expressions. PDS> PDS> ___ PDS> sqlite-users mailing list PDS> sqlite-users@mailinglists.sqlite.org PDS> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users I'm with you. This seems like using the wrong tool for the job. Sure it can work but it's probably not the most efficient way. -- Tegmailto:t...@djii.com ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] extracting domain names from website addresses efficiently
Very interesting. I must be very slow, because at first I did no get what this was about, although you fed it me on a spoon. This is actually very straightforward. I find this easier to understand than recursive CTEs. Thank you, Peter and IgorT, for helping me on my way to understand this powerful feature. Still some way to go, but that is part of the fun. email signature Klaus Maas On 2017-12-12 00:22, petern wrote: Klaus. The CTE manual with good examples is at https://www.sqlite.org/lang_with.html IgorT posted some good stuff about your problem using CTE. FYI. TRIGGER is also recursive. Could be more efficient if you have to store them anyway: CREATE TABLE domain(d TEXT); CREATE TRIGGER domain_after_ins AFTER INSERT ON domain WHEN instr(NEW.d,'.') BEGIN INSERT INTO domain VALUES(substr(NEW.d,instr(NEW.d,'.')+1)); END; PRAGMA recursive_triggers=1; INSERT INTO domain VALUES ('a.b.c.d'); INSERT INTO domain VALUES('e.f.g.1.2'); SELECT * FROM domain; d -- a.b.c.d b.c.d c.d d e.f.g.1.2 f.g.1.2 g.1.2 1.2 2 Peter On Mon, Dec 11, 2017 at 9:22 AM, Klaus Maaswrote: Thank you, Igor, for your solution. I expected that I would be pointed to a recursive approach. Your solution is quite elegant in my opinion. My problem, however, is that I have not yet understood recursive CTE's. Could you give me some pointers for good resources to learn about them? Well, your code example is helping a lot in understanding them, although I am still a long way off producing something similar myself without assistance. There is 1 item in the last where condition of which I do not understand the purpose: (instr(subdomain, '.') = 0 and subdomain = long) This means that anything in the form of 'ftp://test/' would output the string between the two delimiters (:// and /), in this case 'test'. But that is not a domain name in the format domain.tld. (I am working under the assumption that table links is cleaned up contains valid links only with the protocol and :// prepended.) Or am I missing something?. Klaus On 2017-12-11 14:59, Igor Tandetnik wrote: Something like this (could likely be simplified further, but this should give the idea): with recursive subdomains as ( SELECT substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) AS long, substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) as subdomain FROM links union all select long, substr(subdomain, instr(subdomain, '.') + 1) from subdomains where instr(subdomain, '.') > 0) select * from subdomains where (instr(subdomain, '.') = 0 and subdomain = long) OR (instr(subdomain, '.') > 0 and instr(substr(subdomain, instr(subdomain, '.') + 1), '.')=0); The main point is to recursively build a table of all suffixes, then select just the suffixes you want. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] extracting domain names from website addresses efficiently
Thank you for your help! On 2017-12-11 22:43, Igor Tandetnik wrote: On 12/11/2017 12:22 PM, Klaus Maas wrote: There is 1 item in the last where condition of which I do not understand the purpose: (instr(subdomain, '.') = 0 and subdomain = long) That's a zero-dot case, when the original domain is simply 'com', say. Added for completeness. This means that anything in the form of 'ftp://test/' would output the string between the two delimiters (:// and /), in this case 'test'. But that is not a domain name in the format domain.tld. Feel free to adjust to taste. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] extracting domain names from website addresses efficiently
Klaus. The CTE manual with good examples is at https://www.sqlite.org/lang_with.html IgorT posted some good stuff about your problem using CTE. FYI. TRIGGER is also recursive. Could be more efficient if you have to store them anyway: CREATE TABLE domain(d TEXT); CREATE TRIGGER domain_after_ins AFTER INSERT ON domain WHEN instr(NEW.d,'.') BEGIN INSERT INTO domain VALUES(substr(NEW.d,instr(NEW.d,'.')+1)); END; PRAGMA recursive_triggers=1; INSERT INTO domain VALUES ('a.b.c.d'); INSERT INTO domain VALUES('e.f.g.1.2'); SELECT * FROM domain; d -- a.b.c.d b.c.d c.d d e.f.g.1.2 f.g.1.2 g.1.2 1.2 2 Peter On Mon, Dec 11, 2017 at 9:22 AM, Klaus Maaswrote: > Thank you, Igor, for your solution. > I expected that I would be pointed to a recursive approach. > Your solution is quite elegant in my opinion. > My problem, however, is that I have not yet understood recursive CTE's. > Could you give me some pointers for good resources to learn about them? > > Well, your code example is helping a lot in understanding them, although I > am still a long way off producing something similar myself without > assistance. > > There is 1 item in the last where condition of which I do not understand > the purpose: > (instr(subdomain, '.') = 0 and subdomain = long) > This means that anything in the form of 'ftp://test/' would output the > string between the two delimiters (:// and /), in this case 'test'. > But that is not a domain name in the format domain.tld. > (I am working under the assumption that table links is cleaned up contains > valid links only with the protocol and :// prepended.) > Or am I missing something?. > > Klaus > > > On 2017-12-11 14:59, Igor Tandetnik wrote: > >> Something like this (could likely be simplified further, but this should >> give the idea): >> >> with recursive subdomains as ( >> SELECT substr(link, instr(link, '://')+3, instr(substr(link, >> instr(link, '://')+3), '/')-1) AS long, >> substr(link, instr(link, '://')+3, instr(substr(link, >> instr(link, '://')+3), '/')-1) as subdomain >> FROM links >> union all >> select long, substr(subdomain, instr(subdomain, '.') + 1) >> from subdomains >> where instr(subdomain, '.') > 0) >> select * from subdomains >> where (instr(subdomain, '.') = 0 and subdomain = long) OR >> (instr(subdomain, '.') > 0 and instr(substr(subdomain, >> instr(subdomain, '.') + 1), '.')=0); >> >> The main point is to recursively build a table of all suffixes, then >> select just the suffixes you want. >> > > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] extracting domain names from website addresses efficiently
On 12/11/2017 12:22 PM, Klaus Maas wrote: There is 1 item in the last where condition of which I do not understand the purpose: (instr(subdomain, '.') = 0 and subdomain = long) That's a zero-dot case, when the original domain is simply 'com', say. Added for completeness. This means that anything in the form of 'ftp://test/' would output the string between the two delimiters (:// and /), in this case 'test'. But that is not a domain name in the format domain.tld. Feel free to adjust to taste. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] extracting domain names from website addresses efficiently
Thank you, Igor, for your solution. I expected that I would be pointed to a recursive approach. Your solution is quite elegant in my opinion. My problem, however, is that I have not yet understood recursive CTE's. Could you give me some pointers for good resources to learn about them? Well, your code example is helping a lot in understanding them, although I am still a long way off producing something similar myself without assistance. There is 1 item in the last where condition of which I do not understand the purpose: (instr(subdomain, '.') = 0 and subdomain = long) This means that anything in the form of 'ftp://test/' would output the string between the two delimiters (:// and /), in this case 'test'. But that is not a domain name in the format domain.tld. (I am working under the assumption that table links is cleaned up contains valid links only with the protocol and :// prepended.) Or am I missing something?. Klaus On 2017-12-11 14:59, Igor Tandetnik wrote: Something like this (could likely be simplified further, but this should give the idea): with recursive subdomains as ( SELECT substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) AS long, substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) as subdomain FROM links union all select long, substr(subdomain, instr(subdomain, '.') + 1) from subdomains where instr(subdomain, '.') > 0) select * from subdomains where (instr(subdomain, '.') = 0 and subdomain = long) OR (instr(subdomain, '.') > 0 and instr(substr(subdomain, instr(subdomain, '.') + 1), '.')=0); The main point is to recursively build a table of all suffixes, then select just the suffixes you want. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] extracting domain names from website addresses efficiently
This seems like a job for regular expressions. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] extracting domain names from website addresses efficiently
Something like this (could likely be simplified further, but this should give the idea): with recursive subdomains as ( SELECT substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) AS long, substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) as subdomain FROM links union all select long, substr(subdomain, instr(subdomain, '.') + 1) from subdomains where instr(subdomain, '.') > 0) select * from subdomains where (instr(subdomain, '.') = 0 and subdomain = long) OR (instr(subdomain, '.') > 0 and instr(substr(subdomain, instr(subdomain, '.') + 1), '.')=0); The main point is to recursively build a table of all suffixes, then select just the suffixes you want. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users