Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-12 Thread Teg
Hello Peter,

Monday, December 11, 2017, 9:16:27 AM, you wrote:

PDS> This seems like a job for regular expressions.
PDS>  

PDS> ___
PDS> sqlite-users mailing list
PDS> sqlite-users@mailinglists.sqlite.org
PDS> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

I'm  with  you. This seems like using the wrong tool for the job. Sure
it can work but it's probably not the most efficient way. 

-- 
 Tegmailto:t...@djii.com

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-12 Thread Klaus Maas

Very interesting.

I must be very slow, because at first I did no get what this was about, 
although you fed it me on a spoon.


This is actually very straightforward. I find this easier to understand 
than recursive CTEs.


Thank you, Peter and IgorT, for helping me on my way to understand this 
powerful feature.


Still some way to go, but that is  part of the fun.


email signature Klaus Maas

On 2017-12-12 00:22, petern wrote:

Klaus.  The CTE manual with good examples is at
https://www.sqlite.org/lang_with.html
IgorT posted some good stuff about your problem using CTE.

FYI. TRIGGER is also recursive.  Could be more efficient if you have to
store them anyway:

CREATE TABLE domain(d TEXT);
CREATE TRIGGER domain_after_ins AFTER INSERT ON domain WHEN instr(NEW.d,'.')
   BEGIN INSERT INTO domain VALUES(substr(NEW.d,instr(NEW.d,'.')+1)); END;

PRAGMA recursive_triggers=1;
INSERT INTO domain VALUES ('a.b.c.d');
INSERT INTO domain VALUES('e.f.g.1.2');
SELECT * FROM domain;
d
--
a.b.c.d
b.c.d
c.d
d
e.f.g.1.2
f.g.1.2
g.1.2
1.2
2

Peter







On Mon, Dec 11, 2017 at 9:22 AM, Klaus Maas  wrote:


Thank you, Igor, for your solution.
I expected that I would be pointed to a recursive approach.
Your solution is quite elegant in my opinion.
My problem, however, is that I have not yet understood recursive CTE's.
Could you give me some pointers for good resources to learn about them?

Well, your code example is helping a lot in understanding them, although I
am still a long way off producing something similar myself without
assistance.

There is 1 item in the last where condition of which I do not understand
the purpose:
(instr(subdomain, '.') = 0 and subdomain = long)
This means that anything in the form of 'ftp://test/' would output the
string between the two delimiters (:// and /), in this case 'test'.
But that is not a domain name in the format domain.tld.
(I am working under the assumption that table links is cleaned up contains
valid links only with the protocol and :// prepended.)
Or am I missing something?.

Klaus


On 2017-12-11 14:59, Igor Tandetnik wrote:


Something like this (could likely be simplified further, but this should
give the idea):

with recursive subdomains as (
   SELECT substr(link, instr(link, '://')+3, instr(substr(link,
instr(link, '://')+3), '/')-1) AS long,
  substr(link, instr(link, '://')+3, instr(substr(link,
instr(link, '://')+3), '/')-1) as subdomain
   FROM links
union all
   select long, substr(subdomain, instr(subdomain, '.') + 1)
   from subdomains
   where instr(subdomain, '.') > 0)
select * from subdomains
where (instr(subdomain, '.') = 0 and subdomain = long) OR
   (instr(subdomain, '.') > 0 and instr(substr(subdomain,
instr(subdomain, '.') + 1), '.')=0);

The main point is to recursively build a table of all suffixes, then
select just the suffixes you want.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-11 Thread Klaus Maas

Thank you for your help!



On 2017-12-11 22:43, Igor Tandetnik wrote:

On 12/11/2017 12:22 PM, Klaus Maas wrote:
There is 1 item in the last where condition of which I do not 
understand the purpose:

(instr(subdomain, '.') = 0 and subdomain = long)


That's a zero-dot case, when the original domain is simply 'com', say. 
Added for completeness.


This means that anything in the form of 'ftp://test/' would output 
the string between the two delimiters (:// and /), in this case 'test'.

But that is not a domain name in the format domain.tld.


Feel free to adjust to taste.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-11 Thread petern
Klaus.  The CTE manual with good examples is at
https://www.sqlite.org/lang_with.html
IgorT posted some good stuff about your problem using CTE.

FYI. TRIGGER is also recursive.  Could be more efficient if you have to
store them anyway:

CREATE TABLE domain(d TEXT);
CREATE TRIGGER domain_after_ins AFTER INSERT ON domain WHEN instr(NEW.d,'.')
  BEGIN INSERT INTO domain VALUES(substr(NEW.d,instr(NEW.d,'.')+1)); END;

PRAGMA recursive_triggers=1;
INSERT INTO domain VALUES ('a.b.c.d');
INSERT INTO domain VALUES('e.f.g.1.2');
SELECT * FROM domain;
d
--
a.b.c.d
b.c.d
c.d
d
e.f.g.1.2
f.g.1.2
g.1.2
1.2
2

Peter







On Mon, Dec 11, 2017 at 9:22 AM, Klaus Maas  wrote:

> Thank you, Igor, for your solution.
> I expected that I would be pointed to a recursive approach.
> Your solution is quite elegant in my opinion.
> My problem, however, is that I have not yet understood recursive CTE's.
> Could you give me some pointers for good resources to learn about them?
>
> Well, your code example is helping a lot in understanding them, although I
> am still a long way off producing something similar myself without
> assistance.
>
> There is 1 item in the last where condition of which I do not understand
> the purpose:
> (instr(subdomain, '.') = 0 and subdomain = long)
> This means that anything in the form of 'ftp://test/' would output the
> string between the two delimiters (:// and /), in this case 'test'.
> But that is not a domain name in the format domain.tld.
> (I am working under the assumption that table links is cleaned up contains
> valid links only with the protocol and :// prepended.)
> Or am I missing something?.
>
> Klaus
>
> 
> On 2017-12-11 14:59, Igor Tandetnik wrote:
>
>> Something like this (could likely be simplified further, but this should
>> give the idea):
>>
>> with recursive subdomains as (
>>   SELECT substr(link, instr(link, '://')+3, instr(substr(link,
>> instr(link, '://')+3), '/')-1) AS long,
>>  substr(link, instr(link, '://')+3, instr(substr(link,
>> instr(link, '://')+3), '/')-1) as subdomain
>>   FROM links
>> union all
>>   select long, substr(subdomain, instr(subdomain, '.') + 1)
>>   from subdomains
>>   where instr(subdomain, '.') > 0)
>> select * from subdomains
>> where (instr(subdomain, '.') = 0 and subdomain = long) OR
>>   (instr(subdomain, '.') > 0 and instr(substr(subdomain,
>> instr(subdomain, '.') + 1), '.')=0);
>>
>> The main point is to recursively build a table of all suffixes, then
>> select just the suffixes you want.
>>
>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-11 Thread Igor Tandetnik

On 12/11/2017 12:22 PM, Klaus Maas wrote:

There is 1 item in the last where condition of which I do not understand the 
purpose:
(instr(subdomain, '.') = 0 and subdomain = long)


That's a zero-dot case, when the original domain is simply 'com', say. Added 
for completeness.


This means that anything in the form of 'ftp://test/' would output the string 
between the two delimiters (:// and /), in this case 'test'.
But that is not a domain name in the format domain.tld.


Feel free to adjust to taste.
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-11 Thread Klaus Maas

Thank you, Igor, for your solution.
I expected that I would be pointed to a recursive approach.
Your solution is quite elegant in my opinion.
My problem, however, is that I have not yet understood recursive CTE's.
Could you give me some pointers for good resources to learn about them?

Well, your code example is helping a lot in understanding them, although 
I am still a long way off producing something similar myself without 
assistance.


There is 1 item in the last where condition of which I do not understand 
the purpose:

(instr(subdomain, '.') = 0 and subdomain = long)
This means that anything in the form of 'ftp://test/' would output the 
string between the two delimiters (:// and /), in this case 'test'.

But that is not a domain name in the format domain.tld.
(I am working under the assumption that table links is cleaned up 
contains valid links only with the protocol and :// prepended.)

Or am I missing something?.

Klaus


On 2017-12-11 14:59, Igor Tandetnik wrote:
Something like this (could likely be simplified further, but this 
should give the idea):


with recursive subdomains as (
  SELECT substr(link, instr(link, '://')+3, instr(substr(link, 
instr(link, '://')+3), '/')-1) AS long,
 substr(link, instr(link, '://')+3, instr(substr(link, 
instr(link, '://')+3), '/')-1) as subdomain

  FROM links
union all
  select long, substr(subdomain, instr(subdomain, '.') + 1)
  from subdomains
  where instr(subdomain, '.') > 0)
select * from subdomains
where (instr(subdomain, '.') = 0 and subdomain = long) OR
  (instr(subdomain, '.') > 0 and instr(substr(subdomain, 
instr(subdomain, '.') + 1), '.')=0);


The main point is to recursively build a table of all suffixes, then 
select just the suffixes you want.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-11 Thread Peter Da Silva
This seems like a job for regular expressions.
 

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] extracting domain names from website addresses efficiently

2017-12-11 Thread Igor Tandetnik

Something like this (could likely be simplified further, but this should give 
the idea):

with recursive subdomains as (
  SELECT substr(link, instr(link, '://')+3, instr(substr(link, instr(link, 
'://')+3), '/')-1) AS long,
 substr(link, instr(link, '://')+3, instr(substr(link, instr(link, 
'://')+3), '/')-1) as subdomain
  FROM links
union all
  select long, substr(subdomain, instr(subdomain, '.') + 1)
  from subdomains
  where instr(subdomain, '.') > 0)
select * from subdomains
where (instr(subdomain, '.') = 0 and subdomain = long) OR
  (instr(subdomain, '.') > 0 and instr(substr(subdomain, instr(subdomain, 
'.') + 1), '.')=0);

The main point is to recursively build a table of all suffixes, then select 
just the suffixes you want.
--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users