Addshore added a comment.

When discussing with people here in the office having the relation to a second table as in T148988#2744793 seems odd, as said above, normalization only changes the title in maybe 1% of cases (well, much less)... Hence the attempt to put the normalization in a table of their own, that would have a much smaller row count than compared with the titles table.

I think we will aim for a modified first version (If of course there are no strong oppositions):

The sites table will be added to make the dbname vs interwiki prefix lookups easy.

The sites table could either use the dbname as the primary key or use and int ID. this would then of course mean the site field in the titles table would either be a string (the dbname) or an int (the id)
In the case of an int ID DELETES would require a lookup in the sites table, as would inserts. SELECTS would also join the table.

group has been dropped from the table, the reasoning behind this is right now we only have the usecase for a single group, and if we every have more groups they can simply be configured to use a different database / tables.

CREATE TABLE IF NOT EXISTS cognate_sites (
  cs_dbname VARBINARY(32) PRIMARY KEY NOT NULL,
  cs_interwiki VARBINARY(32) NOT NULL
);

The titles table would contain the same basic fields.

title_key would be a normalized version of the title.
title_source would be the origional title in cases that it differs from title_key. NULL if they are the same.

CREATE TABLE IF NOT EXISTS cognate_titles (
  cgti_site VARBINARY(32) NOT NULL,
  cgti_namespace INT NOT NULL,
  cgti_title_source VARBINARY(255),
  cgti_title_key VARBINARY(255) NOT NULL,
  );

Example data can be seen below:

cgti_site, cgti_namespace, cgti_title_source, cgti_title_key
1,0,NULL,Berlin
2,0,NULL,Berlin
1,0,Ellipsis…, Ellipsis...
2,0,NULL, Ellipsis...

SELECTS here would thus always select using namespace, and title_key, joining on the sites table.

A final question here is regarding the keys for the table.

  1. A strictly unique key (possible PK) in the table would be a combination of all 4 fields, Do you see this as being too much? Would the id vs string for site make a difference here?
  2. Another option could be a PK on the site, namespace and title_key (but this could run into an edge case where a single site contains 2 different titles that get normalized to the same key.

TASK DETAIL
https://phabricator.wikimedia.org/T148988

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: jcrespo, Addshore
Cc: hoo, Aklapper, jcrespo, Addshore, Marostegui, Minhnv-2809, D3r1ck01, Izno, Luke081515, Wikidata-bugs, aude, Darkdadaah, Mbch331, Jay8g, Krenair
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to