Re: [GENERAL] is (not) distinct from

2017-03-07 Thread Adrian Klaver

On 03/07/2017 01:22 AM, Johann Spies wrote:

Thanks (again Adrian) and Tom.

The situation was that I had a table with 731million records which I
wanted to copy into a partitioned one where there was a unique
constraint on the fields used in my query.

The "backup" table was the single one.

While inserting into the partitioned table from the backup one, several
(about 12) records failed to insert.

I wanted to find out which records were involved and found that some had
"'" characters in the values which broke some of the functions used to
do some calculations..

As there were fields that might have null values I have tried the "is
not distinct from".

Both sides of the query had primary keys and I did not use group by.
That was why I used "distinct".


Would it not be easier to use a LEFT JOIN between the original 
table(backup) and the new table:


https://www.postgresql.org/docs/9.6/static/sql-select.html

join_type

"LEFT OUTER JOIN returns all rows in the qualified Cartesian product 
(i.e., all combined rows that pass its join condition), plus one copy of 
each row in the left-hand table for which there was no right-hand row 
that passed the join condition. This left-hand row is extended to the 
full width of the joined table by inserting null values for the 
right-hand columns. Note that only the JOIN clause's own condition is 
considered while deciding which rows have matches. Outer conditions are 
applied afterwards."


So something like:

SELECT
a.pk
FROM
original_table AS a
LEFT JOIN--The OUTER is not required
new_table AS b
ON
a.pk = b.pk
WHERE
b.pk IS NULL

That would show all the rows in the original table that where not 
transferred over.




Anyhow in the end, I made some progress with a modified query:

where
s.citing_article = A.citing_article
   and
  s.cited_article !=  A.cited_article
   and
   s.pubyear is  distinct from A.pubyear
   and
   s.year_cited is distinct from A.year_cited
   and
   s.cited_author is distinct from A.cited_author
   and
regexp_replace(s.cited_title,  $$'$$, $$''$$,'g') is distinct from
regexp_replace(A.cited_title,  $$'$$, $$''$$,'g')
   and
regexp_replace(s.cited_work,  $$'$$, $$''$$,'g') is distinct
from regexp_replace(A.cited_work,   $$'$$, $$''$$,'g')
   and
s.doi is distinct from A.doi

Regards.
Johann

--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)



--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is (not) distinct from

2017-03-07 Thread Johann Spies
Thanks (again Adrian) and Tom.

The situation was that I had a table with 731million records which I wanted
to copy into a partitioned one where there was a unique constraint on the
fields used in my query.

The "backup" table was the single one.

While inserting into the partitioned table from the backup one, several
(about 12) records failed to insert.

I wanted to find out which records were involved and found that some had
"'" characters in the values which broke some of the functions used to do
some calculations..

As there were fields that might have null values I have tried the "is not
distinct from".

Both sides of the query had primary keys and I did not use group by.  That
was why I used "distinct".

Anyhow in the end, I made some progress with a modified query:

where
s.citing_article = A.citing_article
   and
  s.cited_article !=  A.cited_article
   and
   s.pubyear is  distinct from A.pubyear
   and
   s.year_cited is distinct from A.year_cited
   and
   s.cited_author is distinct from A.cited_author
   and
regexp_replace(s.cited_title,  $$'$$, $$''$$,'g') is distinct from
regexp_replace(A.cited_title,  $$'$$, $$''$$,'g')
   and
regexp_replace(s.cited_work,  $$'$$, $$''$$,'g') is distinct from
regexp_replace(A.cited_work,   $$'$$, $$''$$,'g')
   and
s.doi is distinct from A.doi

Regards.
Johann

-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)


Re: [GENERAL] is (not) distinct from

2017-03-01 Thread Tom Lane
Adrian Klaver  writes:
> Where I am going with this, is that it is not clear to me how you are 
> matching the two sets of records to determine whether they are different 
> or not.

He's not.  The query is forming the cartesian product of the two tables
and then dropping join rows where the tables match ... but every B row is
going to have multiple A rows where it doesn't match, and those join rows
will all survive the WHERE.  Then "select distinct" gets rid of the
duplicates, and since nothing from A is presented in the result, it's not
very obvious what's happening.

This is a great example of "select distinct" being used as a band-aid
over a fundamental misunderstanding of SQL.  It's good advice to never use
"distinct" unless you know exactly why your query is generating duplicate
rows in the first place.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is (not) distinct from

2017-03-01 Thread Adrian Klaver

On 03/01/2017 12:15 AM, Johann Spies wrote:

On 28 February 2017 at 17:06, Adrian Klaver mailto:adrian.kla...@aklaver.com>> wrote:


I have not worked through all this but at first glance I suspect:

select distinct b.* from b ...

is distinct from ...

constitutes a double negative.

What happens if you eliminate the first distinct?



Thanks Adrian,

The dynamics of the data has changed because of data updates so an exact
comparison is not possible.

Other tests now confirm that the 28 records are identical in both tables.
The results then become more confusing:

If I remove the first distinct
and use "is distinct from"

I get 756 rows

and when I use "is not distinct from"

I get 28.

In the first (756) case when I use "group by" the result of the first
query is exactly the same as the second one.


To be clear you are looking for records in citation that are different 
from citationbackup over a subset(Are there more fields?) of 8 fields, 
correct?


What do those 8 fields represent?

Is citationbackup really a backup of citation?

Is there a Primary Key on either/both tables?

What are you grouping by?

Where I am going with this, is that it is not clear to me how you are 
matching the two sets of records to determine whether they are different 
or not. Your result that yields 756 rows indicates that the comparison 
is not an apples to apples comparison, but a comparison of two 
'shuffled' sets. Adding the group by seems to sort that out. So some 
idea of what constitutes a difference and how you determine which 
records from each table you want to match would be helpful. If you could 
show the table schema and some sample data it would be even better.




Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)



--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] is (not) distinct from

2017-03-01 Thread Johann Spies
On 28 February 2017 at 17:06, Adrian Klaver 
wrote:

>
> I have not worked through all this but at first glance I suspect:
>
> select distinct b.* from b ...
>
> is distinct from ...
>
> constitutes a double negative.
>
> What happens if you eliminate the first distinct?
>
>
>
> Thanks Adrian,

The dynamics of the data has changed because of data updates so an exact
comparison is not possible.

Other tests now confirm that the 28 records are identical in both tables.
The results then become more confusing:

If I remove the first distinct
and use "is distinct from"

I get 756 rows

and when I use "is not distinct from"

I get 28.

In the first (756) case when I use "group by" the result of the first query
is exactly the same as the second one.

Regards
Johann
-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)


Re: [GENERAL] is (not) distinct from

2017-02-28 Thread Adrian Klaver

On 02/28/2017 12:08 AM, Johann Spies wrote:

When I query table a I get 18 rows.
The same query on table b results in 28 rows.

Both tables have the same structure.

When I export the results to csv-files and do a diff it confirms that
all 18 rows from a are also in b. Table b has 10 new rows.

When I combine these queries and use "is (not) distinct from"
I get strange results:

with a as (select citing_article, cited_article, pubyear, year_cited,
cited_author, cited_title, cited_work, doi
from wos_2017_1.citation
where citing_article='abcdefg'
order by 3,4,5,6,8),
b as (
select citing_article, cited_article, pubyear, year_cited, cited_author,
cited_title, cited_work, doi
from wos_2017_1.citationbackup
where citing_article='abcdefg'
order by 3,4,5,6,8)
select distinct b.* from b , a
where
( B.citing_article,
 B.cited_article,
 B.pubyear,
 B.year_cited,
 B.cited_author,
 B.cited_title,
 B.cited_work,
B.doi)
 is distinct from
(A.citing_article,
 A.cited_article,
 A.pubyear,
 A.year_cited,
 A.cited_author,
 A.cited_title,
 A.cited_work, A.doi)

The result of this query is 28 rows - thus that of b.
I expected this to be 10.

If I change the query to "is not distinct from" it results in 18 rows
which is what I would have expected.


I have not worked through all this but at first glance I suspect:

select distinct b.* from b ...

is distinct from ...

constitutes a double negative.

What happens if you eliminate the first distinct?




Regards
Johann.
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)



--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] is (not) distinct from

2017-02-28 Thread Johann Spies
When I query table a I get 18 rows.
The same query on table b results in 28 rows.

Both tables have the same structure.

When I export the results to csv-files and do a diff it confirms that all
18 rows from a are also in b. Table b has 10 new rows.

When I combine these queries and use "is (not) distinct from"
I get strange results:

with a as (select citing_article, cited_article, pubyear, year_cited,
cited_author, cited_title, cited_work, doi
from wos_2017_1.citation
where citing_article='abcdefg'
order by 3,4,5,6,8),
b as (
select citing_article, cited_article, pubyear, year_cited, cited_author,
cited_title, cited_work, doi
from wos_2017_1.citationbackup
where citing_article='abcdefg'
order by 3,4,5,6,8)
select distinct b.* from b , a
where
( B.citing_article,
 B.cited_article,
 B.pubyear,
 B.year_cited,
 B.cited_author,
 B.cited_title,
 B.cited_work,
B.doi)
 is distinct from
(A.citing_article,
 A.cited_article,
 A.pubyear,
 A.year_cited,
 A.cited_author,
 A.cited_title,
 A.cited_work, A.doi)

The result of this query is 28 rows - thus that of b.
I expected this to be 10.

If I change the query to "is not distinct from" it results in 18 rows
which is what I would have expected.

Regards
Johann.
-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)


Re: [GENERAL] is not distinct from any(...)

2008-09-19 Thread Tom Lane
"Robert Haas" <[EMAIL PROTECTED]> writes:
> That doesn't seem to work, because IS NOT DISTINCT FROM is not an
> operator.

Yah :-(

> So then I tried creating an operator === (anyelement,
> anyelement) that just does IS NOT DISTINCT FROM and writing:
> select 1 === any(array[1]);
> which got me:
> ERROR:  could not find array type for data type anyelement

FWIW, it seems to work in 8.3.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] is not distinct from any(...)

2008-09-19 Thread Robert Haas
I'm trying to write a SQL statement to determine whether a value is an
an array, but I want the comparison to be done using IS NOT DISTINCT
FROM rather than =.

My first thought was that instead of writing:

SELECT value = ANY(array)

...I could simply write:

SELECT value IS NOT DISTINCT FROM ANY(array)

That doesn't seem to work, because IS NOT DISTINCT FROM is not an
operator.  So then I tried creating an operator === (anyelement,
anyelement) that just does IS NOT DISTINCT FROM and writing:

select 1 === any(array[1]);

which got me:

ERROR:  could not find array type for data type anyelement

Grr... any suggestions?

...Robert

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general