AndrewTavis_WMDE added a comment.
As for as the Spark UDF issues are concerned, let me just sketch out the process here as it's in a separate notebook from the main one just linked. The general goal in this is to explore using UDFs to easily derive data via the `claims` column of `wmf.wikidata_entity`. We can easily find out how many scholarly articles we have via the `discovery.wikibase_rdf` table as in the example notebook I linked on `people.wikimedia.org`, but then the goal was to do something similar via `wmf.wikidata_entity.claims` so I can have a claims exploration example to work from later :) I've made major progress on this this morning, but some new questions have come up. The initial problem I was facing is that I was thinking that UDFs could reference local functions, which doesn't seem to be the case. I had a function that was being called recursively, and because of this needed it to be its own function, but then calling it in the UDF was returning `null`. As soon as I defined the recursive function within the UDF itself it was fine :) Now onto something that's very much confusing me. Say that we have the following query that's referencing the `wmf.wikidata_entity` for Q1895685 (Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid) <https://www.wikidata.org/wiki/Q1895685>: sa_preview_query = """ SELECT * FROM wmf.wikidata_entity WHERE snapshot = '2023-07-24' AND id = 'Q1895685' """ Getting a single row from this table for testing purposes is faster with Presto, but then that ended up causing the next problem in trying to make a UDF... The base version of the function I had written was working on the table I'd gotten from Presto, but then it wasn't working when I used it in Spark. The reason for this is that if I first use Presto with the query above then the `claims` output is a list of lists, but if I use Spark purely then the output is a dictionary: df_sa_preview = wmf.presto.run( commands=sa_preview_query ) # claims = [[Q...]...] df_sa_preview_spark = ( spark.table("wmf.wikidata_entity") .where("snapshot = '2023-07-24'") .where("id = 'Q1895685'") .alias("df_sa_preview_spark") ) # claims = [{Q...}...] This is teaching me to definitely always test UDFs solely within a Spark context, but I'm confused why the column outputs are different. Given that Spark is returning a more traditional JSON/dictionary structure I'm assuming that the change is happening with how Presto outputs data? TASK DETAIL https://phabricator.wikimedia.org/T342111 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE Cc: mpopov, JAllemandou, Lydia_Pintscher, dcausse, Gehel, dr0ptp4kt, AndrewTavis_WMDE, Aklapper, Manuel, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org