dcausse added a comment.

  At a glance I suspect that now you might get duplicated QIDs in
  
    sa_and_sasc_ids = (
        df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
        .where(col("predicate") == P31_DIRECT_URL)
        .where(col("object").isin(sa_and_sasc_qids))
        .alias("sa_and_sasc_ids")
    )
  
  Which could be explained by entities being tagged with multiple entries found 
in `sa_and_sasc_qids`.
  What happens if you apply a `distinct` here:
  
    sa_and_sasc_ids = (
        df_wikidata_rdf.select(col("subject").alias("sa_and_sasc_qids"))
        .where(col("predicate") == P31_DIRECT_URL)
        .where(col("object").isin(sa_and_sasc_qids))
        .disctinct()
        .alias("sa_and_sasc_ids")
    )

TASK DETAIL
  https://phabricator.wikimedia.org/T342123

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, dcausse
Cc: dcausse, Lydia_Pintscher, dr0ptp4kt, Aklapper, Manuel, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to