GoranSMilovanovic added a comment.

  @Lydia_Pintscher @Lea_WMDE @WMDE-leszek
  
  The data that you are looking for are **extremely** difficult to obtain.
  
  The only way that works - or at least the only one that I was able to 
discover - is to parse the revisions from the Mediawiki wikitext history 
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Mediawiki_wikitext_history>
 table in the Data Lake which represents the "//... the 
full-historical-revision wikitext history of WMF's wikis, as provided through 
monthly XML Dumps//". However, the data there are not structured so I will be 
parsing revisions with regular expressions to figure out when the constraints 
specified in the ticket description are met.
  
  Adding an additional layer of complexity, some useful regex functions are not 
available from the version of Apache Spark which is the actual version in our 
Analytics Cluster (e.g. regexp_extract_all 
<https://spark.apache.org/docs/latest/api/sql/#regexp_extract_all>). That means 
that I need to work partly in the Analytics Cluster (Pyspark to extract the 
data w. some basic filtering) and partly on the Analytics Clients (Python or R 
to process the data to meet the definitions of the constraints that you have 
specified). At this point, even figuring out the correct repartitioning of the 
dataset just in order to be able to efficiently store it to hdfs and then 
process in-memory from the Analytics Clients turns out to be very complicated.
  
  That being said: I am focused on this very much, but I cannot promise that 
this will be finished as soon as I have expected.

TASK DETAIL
  https://phabricator.wikimedia.org/T278698

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: WMDE-leszek, Aklapper, GoranSMilovanovic, Lea_WMDE, Lydia_Pintscher, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to