dcausse added a comment.
At a glance I suspect that now you might get duplicated QIDs in sa_and_sasc_ids = ( df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids")) .where(col("predicate") == P31_DIRECT_URL) .where(col("object").isin(sa_and_sasc_qids)) .alias("sa_and_sasc_ids") ) Which could be explained by entities being tagged with multiple entries found in `sa_and_sasc_qids`. What happens if you apply a `distinct` here: sa_and_sasc_ids = ( df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids")) .where(col("predicate") == P31_DIRECT_URL) .where(col("object").isin(sa_and_sasc_qids)) .disctinct() .alias("sa_and_sasc_ids") ) TASK DETAIL https://phabricator.wikimedia.org/T342123 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE, dcausse Cc: dcausse, Lydia_Pintscher, dr0ptp4kt, Aklapper, Manuel, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org