Re: How to use full-text search URL parser to filter query results by domain name?

2019-04-06 Thread hamann . w
>> I am trying to understand how to use the full-text search parser for
>> URLS and hostnames to filter results from a text field containing URLS
>> based on domain, and also how to index text columns for fast
>> lookup/matching based on domain.
>> >> I have a PostgreSQL database containing documents and links downloaded
>> by a web crawler, with the following tables:
>> >>     pages
>> >>     --
>> >>     id:  Integer (primary key)
>> >>     url: String  (unique)
>> >>     title:   String
>> >>     text:    String
>> >>     html:    String
>> >>     last_visit:  DateTime
>> >>     word_pos:    TSVECTOR
>> >>     >> >>     links
>> >>     --
>> >>     id Integer (primary key)
>> >>     source:    String
>> >>     target:    String  >> >>     link_text: String
>> >>     UNIQUE(source,target)
>> >>     >> >>     crawls
>> >>     -
>> >>     id: Integer (primary key)
>> >>     query:  String
>> >>     >> >>     crawl_results
>> >>     -
>> >>     id:   Integer (primary key)
>> >>     score:    Integer (constraint 0<=score<=1)
>> >>     crawl_id: Integer (foreign key, crawls.id)
>> >>     page_id:  Integer (foreign key, pages.id)
>> >> >> The `source` and `target` fields in the `links` table contain URLs. I 
>> >> >> am
>> running the following query to extract scored links from the top-ranking
>> search results, for pages that haven't been fetched yet:
>> >>     WITH top_results AS >> >>     (SELECT page_id, score FROM 
>> >> crawl_results >> >>     WHERE crawl_id=$1 >> >>     ORDER 
>> >> BY score LIMIT 100)
>> >>     SELECT top_results.score, l.target
>> >>     FROM top_results >> >>     JOIN pages p ON 
>> >> top_results.page_id=p.id
>> >>     JOIN links l on p.url=l.source >> >>     WHERE NOT EXISTS 
>> >> (SELECT pp.id FROM pages pp WHERE l.target=pp.url)
>> >> >> However, *I would like to filter these results so that only one row is
>> returned for a given domain (the one with the lowest score)*. So for
>> instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
>> 'http://www.foo.com/zor'), I only want the first because it has same
>> domain `foo.com` and has the lower score.
>> >> I was able to find documentation for the builtin full text search
>> parsers ,
>> which can parse URLS and extract the hostname. For instance, I can
>> extract the hostname from a URL as follows:
>> 
Hi,

I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could use

select regex_replace('https?://(.*=)/.*', '\\1', url)

instead of the fulltext parser

Best regards
Wolfgang








How to use full-text search URL parser to filter query results by domain name?

2019-04-06 Thread Jess Wren
I am trying to understand how to use the full-text search parser for
URLS and hostnames to filter results from a text field containing URLS
based on domain, and also how to index text columns for fast
lookup/matching based on domain.

I have a PostgreSQL database containing documents and links downloaded
by a web crawler, with the following tables:

    pages

    --

    id:  Integer (primary key)

    url: String  (unique)

    title:   String

    text:    String

    html:    String

    last_visit:  DateTime

    word_pos:    TSVECTOR

    

    links

    --

    id Integer (primary key)

    source:    String

    target:    String  

    link_text: String

    UNIQUE(source,target)

    

    crawls

    -

    id: Integer (primary key)

    query:  String

    

    crawl_results

    -

    id:   Integer (primary key)

    score:    Integer (constraint 0<=score<=1)

    crawl_id: Integer (foreign key, crawls.id)

    page_id:  Integer (foreign key, pages.id)


The `source` and `target` fields in the `links` table contain URLs. I am
running the following query to extract scored links from the top-ranking
search results, for pages that haven't been fetched yet:

    WITH top_results AS 

    (SELECT page_id, score FROM crawl_results 

    WHERE crawl_id=$1 

    ORDER BY score LIMIT 100)

    SELECT top_results.score, l.target

    FROM top_results 

    JOIN pages p ON top_results.page_id=p.id

    JOIN links l on p.url=l.source 

    WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)


However, *I would like to filter these results so that only one row is
returned for a given domain (the one with the lowest score)*. So for
instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
'http://www.foo.com/zor'), I only want the first because it has same
domain `foo.com` and has the lower score.

I was able to find documentation for the builtin full text search
parsers ,
which can parse URLS and extract the hostname. For instance, I can
extract the hostname from a URL as follows:

    SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid 
= 6;

    

    token    

    -

    www.foo.com

    (1 row)



However, I can't figure out how I would integrate this into the above
query to filter out duplicate domains from the results. And because this
is the docs for "testing and debugging text search
",
I don't know if this use of `ts_parse()` is even related to how the URL
parser is intended to be used in practice.

How would I use the "host" parser in my query above to return one row
per domain? Also, how would I appropriately index the "links" table for
"host" and "url" token lookup?

Thanks!



Re: pg_upgrade --jobs

2019-04-06 Thread senor
Thanks Tom for the explanation. I assumed it was my ignorance of how the schema 
was handled that was making this look like a problem that had already been 
solved and I was missing something.

I fully expected the "You're Doing It Wrong" part. That is out of my control 
but not beyond my influence.

I suspect I know the answer to this but have to ask. Using a simplified example 
where there are 100K sets of 4 tables, each representing the output of a single 
job, are there any shortcuts to upgrading that would circumvent exporting the 
entire schema? I'm sure a different DB design would be better but that's not 
what I'm working with.

Thanks


From: Ron 
Sent: Saturday, April 6, 2019 4:57 PM
To: pgsql-general@lists.postgresql.org
Subject: Re: pg_upgrade --jobs

On 4/6/19 6:50 PM, Tom Lane wrote:

senor  writes:


[snip]

The --link option to pg_upgrade would be so much more useful if it
weren't still bound to serially dumping the schemas of half a million
tables.



To be perfectly blunt, if you've got a database with half a million
tables, You're Doing It Wrong.

Heavy (really heavy) partitioning?

--
Angular momentum makes the world go 'round.




Re: pg_upgrade --jobs

2019-04-06 Thread Ron

On 4/6/19 6:50 PM, Tom Lane wrote:

senor  writes:

[snip]

The --link option to pg_upgrade would be so much more useful if it
weren't still bound to serially dumping the schemas of half a million
tables.

To be perfectly blunt, if you've got a database with half a million
tables, You're Doing It Wrong.


Heavy (*really heavy*) partitioning?

--
Angular momentum makes the world go 'round.


Re: pg_upgrade --jobs

2019-04-06 Thread Tom Lane
senor  writes:
> Is the limitation simply the state of development to date or is there
> something about dumping the schemas that conflicts with paralleling?

At minimum, it'd take a complete redesign of pg_dump's output format,
and I'm not even very sure what such a redesign would look like.  All
the schema information goes into a single file that has to be written
serially.  Trying to make it be one file per table definition wouldn't
really fix much: somewhere there has to be a "table of contents", plus
where are you going to put the dependency info that shows what ordering
is required for restore?

> The --link option to pg_upgrade would be so much more useful if it
> weren't still bound to serially dumping the schemas of half a million
> tables.

To be perfectly blunt, if you've got a database with half a million
tables, You're Doing It Wrong.

regards, tom lane




Re: pg_upgrade --jobs

2019-04-06 Thread senor
Thanks Tom. I suppose "pg_dump can only parallelize data dumping" answers my 
original question as "expected behavior" but I would like to understand the 
reason better.

My knowledge of Postgres and other DBMSs is at casual admin level with the 
occasional deep dive on specific errors or analysis. I'm not averse to getting 
into the code. Before my OP I searched for reasons that the schema-only option 
would prevent pg_dump from being able to run multiple jobs and didn't find 
anything that I understood to confirm either way.

Is the limitation simply the state of development to date or is there something 
about dumping the schemas that conflicts with paralleling? I'm willing to do 
some studying if provided links to relevant articles.

The --link option to pg_upgrade would be so much more useful if it weren't 
still bound to serially dumping the schemas of half a million tables. As 
already mentioned, if there is an alternate process that mimics pg_upgrade but 
allows for paralleling, I'm open to that.

Thanks all


From: Tom Lane 
Sent: Saturday, April 6, 2019 3:02 PM
To: senor
Cc: pgsql-general@lists.postgresql.org
Subject: Re: pg_upgrade --jobs

senor  writes:
> Since pg_upgrade is in control of how it is calling pg_dump, is there a 
> reason pg_upgrade cannot use the directory output format when calling 
> pg_dump? Is the schema-only operation incompatible?

Well, there's no point in it.  pg_dump can only parallelize data dumping,
and there's none to be done in the --schema-only case that pg_upgrade
uses.

Also, since pg_upgrade *does* use parallelism across multiple pg_dump
calls (if you've got multiple databases in the cluster), it'd be a bit
problematic to have another layer of parallelism below that, if it did
indeed do anything.  You don't want "--jobs=10" to suddenly turn into
100 sessions.

regards, tom lane




Re: pg_upgrade --jobs

2019-04-06 Thread Tom Lane
senor  writes:
> Since pg_upgrade is in control of how it is calling pg_dump, is there a 
> reason pg_upgrade cannot use the directory output format when calling 
> pg_dump? Is the schema-only operation incompatible?

Well, there's no point in it.  pg_dump can only parallelize data dumping,
and there's none to be done in the --schema-only case that pg_upgrade
uses.

Also, since pg_upgrade *does* use parallelism across multiple pg_dump
calls (if you've got multiple databases in the cluster), it'd be a bit
problematic to have another layer of parallelism below that, if it did
indeed do anything.  You don't want "--jobs=10" to suddenly turn into
100 sessions.

regards, tom lane




Re: pg_upgrade --jobs

2019-04-06 Thread senor
Thank you for responding. I did see that note and should have included that as 
part of my question.

Since pg_upgrade is in control of how it is calling pg_dump, is there a reason 
pg_upgrade cannot use the directory output format when calling pg_dump? Is the 
schema-only operation incompatible?


From: Adrian Klaver 
Sent: Saturday, April 6, 2019 1:52 PM
To: senor; pgsql-general@lists.postgresql.org
Subject: Re: pg_upgrade --jobs

On 4/6/19 11:44 AM, senor wrote:
> The pg_upgrade --jobs option is not passed as an argument when it calls 
> pg_dump. I haven't found anything in docs or forums mentioning a reason for 
> not supporting under certain circumstances other than possibly for pre-9.2. 
> The pg_upgrade docs page states that it allows multiple CPUs to be used for 
> dump and reload of schemas. Some databases I'm upgrading have 500,000+ tables 
> and running with a single process is greatly increasing the upgrade time.
>
> I am also using the --link option.
> I have tried "--jobs 20", "--jobs=20", placing this option first and last and 
> many other variations.
> I am upgrading 9.2.4 to 9.6.12 on CentOS 6.
> Varying hardware but all with 32+ CPU cores.
>
> su - postgres -c "/usr/pgsql-9.6/bin/pg_upgrade --jobs=20 --link \
> --old-bindir=/usr/pgsql-9.2/bin/ \
> --new-bindir=/usr/pgsql-9.6/bin/ \
> --old-datadir=/var/lib/pgsql/9.2/data/ \
> --new-datadir=/var/lib/pgsql/9.6/data/"
>
> I feel like there's a simple reason I've missed but this seems pretty 
> straight forward.

https://www.postgresql.org/docs/9.6/app-pgdump.html

"--jobs=njobs

 Run the dump in parallel by dumping njobs tables simultaneously.
This option reduces the time of the dump but it also increases the load
on the database server. You can only use this option with the directory
 ^^^
output format because this is the only output format where multiple
^
processes can write their data at the same time."


> A secondary plan would be to find instructions for doing the same as 
> "pg_upgrade --link" manually so I can run "pg_dump --jobs 20".
> Any assist is appreciated.
> Thanks,
> S. Cervesa
>
>


--
Adrian Klaver
adrian.kla...@aklaver.com




Re: pg_upgrade --jobs

2019-04-06 Thread Adrian Klaver

On 4/6/19 11:44 AM, senor wrote:

The pg_upgrade --jobs option is not passed as an argument when it calls 
pg_dump. I haven't found anything in docs or forums mentioning a reason for not 
supporting under certain circumstances other than possibly for pre-9.2. The 
pg_upgrade docs page states that it allows multiple CPUs to be used for dump 
and reload of schemas. Some databases I'm upgrading have 500,000+ tables and 
running with a single process is greatly increasing the upgrade time.

I am also using the --link option.
I have tried "--jobs 20", "--jobs=20", placing this option first and last and 
many other variations.
I am upgrading 9.2.4 to 9.6.12 on CentOS 6.
Varying hardware but all with 32+ CPU cores.

su - postgres -c "/usr/pgsql-9.6/bin/pg_upgrade --jobs=20 --link \
--old-bindir=/usr/pgsql-9.2/bin/ \
--new-bindir=/usr/pgsql-9.6/bin/ \
--old-datadir=/var/lib/pgsql/9.2/data/ \
--new-datadir=/var/lib/pgsql/9.6/data/"

I feel like there's a simple reason I've missed but this seems pretty straight 
forward.


https://www.postgresql.org/docs/9.6/app-pgdump.html

"--jobs=njobs

Run the dump in parallel by dumping njobs tables simultaneously. 
This option reduces the time of the dump but it also increases the load 
on the database server. You can only use this option with the directory

^^^
output format because this is the only output format where multiple
^
processes can write their data at the same time."



A secondary plan would be to find instructions for doing the same as "pg_upgrade --link" 
manually so I can run "pg_dump --jobs 20".
Any assist is appreciated.
Thanks,
S. Cervesa





--
Adrian Klaver
adrian.kla...@aklaver.com




pg_upgrade --jobs

2019-04-06 Thread senor
The pg_upgrade --jobs option is not passed as an argument when it calls 
pg_dump. I haven't found anything in docs or forums mentioning a reason for not 
supporting under certain circumstances other than possibly for pre-9.2. The 
pg_upgrade docs page states that it allows multiple CPUs to be used for dump 
and reload of schemas. Some databases I'm upgrading have 500,000+ tables and 
running with a single process is greatly increasing the upgrade time.

I am also using the --link option. 
I have tried "--jobs 20", "--jobs=20", placing this option first and last and 
many other variations.
I am upgrading 9.2.4 to 9.6.12 on CentOS 6.
Varying hardware but all with 32+ CPU cores.

su - postgres -c "/usr/pgsql-9.6/bin/pg_upgrade --jobs=20 --link \
   --old-bindir=/usr/pgsql-9.2/bin/ \
   --new-bindir=/usr/pgsql-9.6/bin/ \
   --old-datadir=/var/lib/pgsql/9.2/data/ \
   --new-datadir=/var/lib/pgsql/9.6/data/" 

I feel like there's a simple reason I've missed but this seems pretty straight 
forward. 
A secondary plan would be to find instructions for doing the same as 
"pg_upgrade --link" manually so I can run "pg_dump --jobs 20". 
Any assist is appreciated. 
Thanks,
S. Cervesa



Re: 9.6.11- could not truncate directory "pg_serial": apparent wraparound

2019-04-06 Thread Pavel Suderevsky
Guys, still need your help.

Previous night:
*2019-04-05 00:35:04 UTC  LOG:  could not truncate directory "pg_serial":
apparent wraparound*
*2019-04-05 00:40:04 UTC  LOG:  could not truncate directory "pg_serial":
apparent wraparound*
(2 checkpoints)

It turned that I have some problem with performance related to predicate
locking on this platform.
A lot of long prepared statements with the *SerializableXactHashLock* and
*predicate_lock_manager* wait_events followed by high CPU usage happened
during 00:30 and 00:45. During this period there were 55k pred locks
granted at max and 30k in average. Probably because of high CPU usage some
statements were spending a lot of time in bind/parse steps.

Probably if you advise me what could cause *"pg_serial": apparent
wraparound* messages I would have more chances to handle all the
performance issues.

Thank you!
--
Pavel Suderevsky
E: psuderev...@gmail.com

пн, 11 мар. 2019 г. в 19:09, Pavel Suderevsky :

> Hi,
>
> PG: 9.6.11
> OS: CentOS
> Env: AWS EC2
>
> I've faced the following exceptions in PostgreSQL server log:
> > could not truncate directory "pg_serial": apparent wraparound
> Sometimes it repeats every 5 min and the longest period was 40 min.
>
> In fact, I can't find any suspicious events happening that periods.
> pg_wait_sampling didn't catch any events, no long queries (more than 60s),
> Autovacuum workers or transactions in "idle in transaction" state were in
> action at this time.
>
> The only related I could find in archive is:
> https://www.postgresql.org/message-id/CACjxUsON4Vya3a6r%3DubwmN-4qTDDfZjuwSzjnL1QjdUc8_gzLw%40mail.gmail.com
> >You should not see the errors you are reporting nor
> >the warning I mentioned unless a serializable transaction remains
> >active long enough for about 1 billion transaction IDs to be
> >consumed.
>
> Database age now is just 18.5 millions of transactions.
>
> Server has two standbys (sync and async), hot_standby_feedback is off.
>
> Please advice what I can do to find a reason of these exceptions.
>


SIGTERM/SIGINT master/slave behavior

2019-04-06 Thread rihad
Hi. When master server receives smart shutdown request (TERM) does it 
exit after making sure it sends all received writes to the slave 
server(s), or it exits leaving the slave in an inconsistent state? What 
about during fast shutdown (SIGINT)? I know that it asks current 
requests to terminate immediately in that case, but what about the 
replication connection? Is it safe to send SIGINT when intending to do a 
manual failover?



https://www.postgresql.org/docs/10/server-shutdown.html


Thanks.