[Wikidata-bugs] [Maniphest] T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article)

AndrewTavis_WMDE Fri, 04 Aug 2023 04:36:52 -0700

AndrewTavis_WMDE added a comment.


  As for as the Spark UDF issues are concerned, let me just sketch out the 
process here as it's in a separate notebook from the main one just linked. The 
general goal in this is to explore using UDFs to easily derive data via the 
`claims` column of `wmf.wikidata_entity`. We can easily find out how many 
scholarly articles we have via the `discovery.wikibase_rdf` table as in the 
example notebook I linked on `people.wikimedia.org`, but then the goal was to 
do something similar via `wmf.wikidata_entity.claims` so I can have a claims 
exploration example to work from later :)
  
  I've made major progress on this this morning, but some new questions have 
come up. The initial problem I was facing is that I was thinking that UDFs 
could reference local functions, which doesn't seem to be the case. I had a 
function that was being called recursively, and because of this needed it to be 
its own function, but then calling it in the UDF was returning `null`. As soon 
as I defined the recursive function within the UDF itself it was fine :)
  
  Now onto something that's very much confusing me. Say that we have the 
following query that's referencing the `wmf.wikidata_entity` for Q1895685 
(Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic 
Acid) <https://www.wikidata.org/wiki/Q1895685>:
  
    sa_preview_query = """
    SELECT
        *
    
    FROM 
        wmf.wikidata_entity
    
    WHERE 
        snapshot = '2023-07-24'
        AND id = 'Q1895685'
    """
  
  Getting a single row from this table for testing purposes is faster with 
Presto, but then that ended up causing the next problem in trying to make a 
UDF... The base version of the function I had written was working on the table 
I'd gotten from Presto, but then it wasn't working when I used it in Spark. The 
reason for this is that if I first use Presto with the query above then the 
`claims` output is a list of lists, but if I use Spark purely then the output 
is a dictionary:
  
    df_sa_preview = wmf.presto.run(
        commands=sa_preview_query
    )
    # claims = [[Q...]...]
    
    df_sa_preview_spark = (
        spark.table("wmf.wikidata_entity")
        .where("snapshot = '2023-07-24'")
        .where("id = 'Q1895685'")
        .alias("df_sa_preview_spark")
    )
    # claims = [{Q...}...]
  
  This is teaching me to definitely always test UDFs solely within a Spark 
context, but I'm confused why the column outputs are different. Given that 
Spark is returning a more traditional JSON/dictionary structure I'm assuming 
that the change is happening with how Presto outputs data?

TASK DETAIL
  https://phabricator.wikimedia.org/T342111

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: mpopov, JAllemandou, Lydia_Pintscher, dcausse, Gehel, dr0ptp4kt, 
AndrewTavis_WMDE, Aklapper, Manuel, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article)

Reply via email to