Re: identify duplicates

Chip Scheide via 4D_Tech Wed, 01 Nov 2017 08:09:58 -0700

One of the things others have mentioned, but have not really expounded 
upon...
What constitutes a duplicate?
For an individual field this can be complex, for multiple fields...

For some data types (numerics) a duplicate is easy to define
2 = 2 = 2 = 2 (integer/long integer)

For others (text) it is not simple.
A = A
But does 
A = a ?
or
A = <any extended character set of A>? (such as A[or a] + umlaut) 

Does
Mike MacKenzie = MacKenzie, Mike ?

If you have 3 fields which you are looking to locate duplication it 
becomes more of an issue (assuming each field is text).
It also could be an issue depending on how the 3 fields are 
concatenated.
ex.  Name + longint ID + street address

How is the name constructed? is it consistent across all data (Mike 
MacKenzie vs MacKenzie Mike vs Michael MacKenzie)
How is the longint ID converted to text?  String(ID), or 
String(ID;<some zero padding format>), is the padding long enough?
How is the street Name constructed? Number Street Name Designation (123 
East West Ave)
Is the street name construct consistent across all data? (123 East West 
Ave vs 123 East West vs 123 East West Avenue)

Assuming (a BIG assumption), that the data is 'clean', i.e. all the 
data is formatted similarly) a couple of approaches:
- might be to use the function DISTINCT VALUES (with counting option On 
- if available in your version of 4D) on all 3 fields independently 
then use one of the arrays to loop over the data looking for duplicate 
values in the other fields.

- if you can modify the data (add a field to the table), and the amount 
of data is not too large, or you have very little execution time 
constraint, you could add a field which would hold the concatenated 3 
field values, then use DISTINCT VALUES.

- I have not really worked with this aspect of 4D, so I am not sure, 
but I know that it is possible to build a composite index (2 [or more?] 
fields to make a single index. Whether you can use the composite index 
with DISTINCT VALUES I am not sure.

I'm sure there are ways in which the approach can be made efficient, 
but it will definitely depend on the data, and what constitutes a 
duplicate.

On Tue, 31 Oct 2017 14:40:50 -0600, David Witton via 4D_Tech wrote:
> I'm looking for a strategy to identify duplicate records in a table - that
> is, records for which 3 fields are identical across two or more records -
> or in another case, where a single field is not unique.
> 
> Does anyone have a suggestion on how to proceed?
> 
> -- 
> David Witton
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
> **********************************************************************
---------------
Gas is for washing parts
Alcohol is for drinkin'
Nitromethane is for racing 
**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**********************************************************************

Re: identify duplicates

Reply via email to