[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-05-23 Thread AndrewTavis_WMDE
AndrewTavis_WMDE changed the task status from "Open" to "Stalled".

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-05-17 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-05-17 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  Note that MR#700 

 has been  opened that has the work for this :)

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-05-15 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-26 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-26 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  See T362849_wd_item_sitelink_segments.ipynb 

 for the work to derive the segments :)

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-26 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-26 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  Ok, so the new numbers after the change in scope for the max `2024-04-15` 
snapshot are:
  
items_with_sitelinks: 32,231,861
items_items_with_sitelinks_link_to: 2,980,388
all_other_items: 72,910,679
  
  For documentation, the numbers for the original Population B definition for 
the min `2024-02-26` snapshot were:
  
items_with_sitelinks: 31,978,738
linked_to_items_with_sitelinks: 75,221,879
all_other_items: 242,565
  
  Status on the rest of this:
  
  - The weekly DAG is written and further does include an export to the 
published datasets repo
- I've also included the work for T361203 
 in this
  - We need to confirm the numbers above and the method that generates them
  - I'll then rewrite the DAG job that runs the query
  - Then testing, I'll need the table `wmde.wd_item_sitelink_segments_weekly` 
to be made in HDFS by an admin, and then we can go into production
  - Should all be done by Tuesday/Wednesday evening after I'm back in a few 
weeks depending on folks' availability
  - I'll make a new task for the historic data generation process, which will 
depend on T363451 

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-25 Thread Manuel
Manuel updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, Manuel
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-25 Thread Manuel
Manuel updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, Manuel
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-25 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  See {https://phabricator.wikimedia.org/T363451} for the task about bringing 
back the partition (hopefully via another job). I added a bit about whether we 
want to maybe turn this job on when WMDE needs historical data. Let me know 
what you all think on that :)

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-24 Thread Manuel
Manuel added a comment.


  About the missing revision history:
  
  - Did I understand correctly that we do not have any kind of complete edit 
history for Wikidata on the data lake? If so, we will need to find a solution 
for this, as my assumption is that we will need this kind of information for 
other use cases as well. What you found out about potential solutions will be 
helpful. Still, if this needs to be implemented new in any case, the first step 
should be finding out what exactly we need. Could you create a separate task 
for this project?
  - It would help to understand what options this currently leaves us to access 
historic revisions: To my knowledge, the dumps for Wikidata stopped including a 
complete page history at some point. Does this leave us with only the live API 
and MariaDB?

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, Manuel
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-24 Thread Manuel
Manuel updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, Manuel
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-24 Thread Manuel
Manuel added a comment.


  Thank you for digging into this:
  
  > I'll begin work on a DAG based on wmf.wikidata_entity
  
  Sounds good to me! I changed the description accordingly.
  
  > Are we fine with a weekly DAG?
  
  Sure!

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, Manuel
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-23 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  Another note on this is: if we don't expect to be needing a Wikidata 
partition of `wmf.mediawiki_wikitext_history` for other tasks, then we could 
work directly from the XML dump for the data backdate. We wouldn't be able to 
leverage PySpark for the querying though, so I worry about how long all of this 
would take...

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-23 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a subscriber: JAllemandou.
AndrewTavis_WMDE added a comment.


  Thanks for all of the information, @mpopov!
  
  I talked this over in my bi-weekly with @JAllemandou, and would like to bring 
some further context to this particular situation :)
  
  The go to table for this would be wmf.wikidata_entity 

 for the following reasons:
  
  - It has the `sitelinks` column for Population A above
  - It has the `claims` column for Population B above
  
  It thus has everything we need for the given task for future data. One change 
to the output for this though would be the frequency of the DAG, as 
`wmf.wikidata_entity` is a weekly data dump, so it'd make sense to do a weekly 
DAG. If we still want to do a monthly job, then the best option would be to do 
a DAG that runs on the first Monday of every month (in the docs for 
`wmf.wikidata_entity` it mentions the `2020-01-20` snapshot, which was a 
Monday).
  
  Now we get to the question of the historical data... This is a situation that 
cannot be solved at this time given the current makeup of the Data Lake. As 
mentioned on Mattermost: we currently do not have Wikidata as a partition 
within wmf.mediawiki_wikitext_history 
,
 so we do not have historical versions of Wikidata items with which we'd be 
able to rebuild the history. The assumption we're making on this is that the 
legacy version of these metrics was made using `wmf.mediawiki_wikitext_history` 
at a time when Wikidata was still an available partition. The change for 
removing Wikidata from the `wmf.mediawiki_wikitext_history` dump process was 
`2024-02` - see T357859  where ~12 
of 25 days of the dump generation is for the Wikidata XML dump. This was 
slowing down metrics delivery for WMF Movements Insights.
  
  Steps forward on this:
  
  - I'll begin work on a DAG based on `wmf.wikidata_entity`, as even if we do 
get a Wikidata partition within `wmf.mediawiki_wikitext_history`, it would not 
be used for recent data updates
- Are we fine with a weekly DAG?
  - A decision needs to be made on whether WMDE is requesting Wikidata data to 
again be an output in `wmf.mediawiki_wikitext_history` snapshot creation process
- The preferred solution here would be to not revert the changes to T357859 
, but rather make a new job that 
adds a new partition to the table via the Wikidata XML dump
- Reason for this is to assure that WMF Movements Insights can maintain the 
current speed of delivery
- @JAllemandou has said that bringing the Wikidata partition back is fine 
if we need it (again, preferably in the above way)
  - If the request is being made, a new task should be made for it
  - We'd then do what I'd argue would be a separate task whereby the new 
`wmf.mediawiki_wikitext_history` Wikidata parition would be used to recompute 
the historical populations above
  
  Let me know what thoughts are on the above!

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: JAllemandou, mpopov, AndrewTavis_WMDE, Manuel, Aklapper, 
Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-23 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: mpopov, AndrewTavis_WMDE, Manuel, Aklapper, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-23 Thread AndrewTavis_WMDE
AndrewTavis_WMDE claimed this task.
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: mpopov, AndrewTavis_WMDE, Manuel, Aklapper, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-22 Thread Manuel
Manuel added a comment.


  Hi @mpopov, thank you for your input!
  
  This confirms what I mentioned already, @AndrewTavis_WMDE: For a similar 
metric our legacy systems were set up to re-compute the entire history with 
each new snapshot. This would be the easiest solution in this case as well. To 
save computing resources, we could also use the newest snapshots only to add 
the most recent new data points. As a result, the historic data (where no 
snapshots are available anymore) would follow a slightly different definition, 
but this seems ok here for now.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manuel
Cc: mpopov, AndrewTavis_WMDE, Manuel, Aklapper, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-19 Thread mpopov
mpopov added subscribers: AndrewTavis_WMDE, mpopov.
mpopov added a comment.


  @AndrewTavis_WMDE asked me for some thoughts/suggestions here :)
  
  I started typing out a DM reply but decided some of this stuff would be good 
to have on public record.
  
  > it's not normal that snapshots go back a decade plus, so I'm a bit confused 
on this
  
  The way that MediaWiki and Wikidata snapshots work – and have to work, due to 
the nature of the data – is they are snapshots in time of EVERYTHING at the 
time of the snapshot generation. This is why even `wmf.edits_hourly` (or 
whatever that table is called) can contain counts of edits made in April even 
though the latest snapshot is '2024-04' – it's indiscriminate of timestamps 
associated with any of the data.
  
  I think 3-4 snapshots back is probably a good number of snapshots to keep 
because it does enable us to investigate odd discrepancies between snapshots 
T355182  – beyond the state change 
problem. The challenge with this data that you may have come across is that 
state of things (whether an edit got deleted or reverted, whether a user is 
labelled as a bot or not) changes over time, so the same edit or the same user 
made years ago can be categorized differently from snapshot to snapshot.
  
  Ultimately, **any metric that is calculated from data which can change state 
is going to be subject to drift when a static measurement is stored anywhere.** 
We actually run into this problem with the key result for FY23-24 Wiki 
Experiences Objective 1.1 (Superset dashboard 
) that aims to increase 
number of unreverted (and undeleted) mobile contributions to articles on 
Wikipedia by 10%.
  
  Throughout March 2024 – when the '2024-02' snapshot was used – the metric for 
the KR was at 4.7%. Then, when the '2024-03' snapshot was generated (at the 
beginning of April), the February value of that metric changed to 4.4% – 
because the state of the edits made in February changed. The dashboard uses the 
most recently available snapshot and has no memory about the values of the 
metric based on previous snapshots. If we were to store a value in a 
spreadsheet or a report and then 1+ snapshots later compare the dashboard to 
the spreadsheet/report, there will be a discrepancy.
  
  There's no getting around it – it's natural and folks who work with or look 
at these metrics need to become comfortable with that concept. There are some 
things we can do to improve stability (decrease snapshot-to-snapshot 
variability) of the metric, but it won't address the problem entirely. Like, we 
could (and should) impose "not reverted within first 48 hours" as opposed to 
currently "not reverted at the time of the snapshot" but deletion of edits and 
also whether a user is considered a real editor or a bot, well, those are going 
to change snapshot-to-snapshot and dealing with those would be extremely 
painful.
  
  I won't evaluate the listed metrics but I will recommend asking yourselves 
the following for each metric:
  
  - Can we backfill this? Can we re-compute the history of this metric given a 
snapshot?
  - Are we comfortable re-computing the entire history of this metric with each 
new snapshot?
  - Will we be reporting this metric anywhere else and would it be a problem if 
what we reported in the past and what we report in the future differ?
  - Are we comfortable calculating the value of the metric only once and 
storing that somewhere that we call "source of truth" for measurements of this 
metric going forward?
- For example, you calculate the value of metric A for April 2024 (using 
March 2024 snapshot) and hold on that value because once the March 2024 
snapshot is deleted, any re-calculation of metric A for April 2024 using a 
later snapshot will result in a different value.

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mpopov
Cc: mpopov, AndrewTavis_WMDE, Manuel, Aklapper, Danny_Benjafield_WMDE, 
S8321414, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T362849: [Analytics] Segments of Wikidata's data over time

2024-04-18 Thread Manuel
Manuel created this task.
Manuel added projects: Wikidata, Wikidata Analytics (Kanban).
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  Purpose
  ---
  
  As Wikidata Product Managers, we would like to understand how different 
segments of Wikidata's data developed over time, so we can inform our 
projections.
  
  Scope
  -
  
  - How did the number of Items of the following types develop over time?
- A) Items that contain a sitelink to one of the Wikimedia projects (e.g. 
about a notable person)
- B) Items that are connected to A (e.g. about the non-notable father of 
that person)
- C) All other Items
  
  Desired output
  --
  
  - Monthly stats of the number of Items in category A, B and C
  
  Acceptance criteria
  
  [ ] Current numbers for A, B and C to verify the approach
  [ ] Historic monthly data for A, B, and C
  [ ] Monthly data process to capture new data
  [ ] Public output of data in the form of a table (and ideally diagram)
  
  ---
  
  **Information below this point is filled out by the Wikidata Analytics team.**
  
  General Planning
  
  
  Information is filled out by the analytics product manager.
  
  Assignee Planning
  -
  
  Information is filled out by the assignee of this task.
  
  Estimation
  --
  
  Estimate: 
  Actual:
  
  Sub Tasks
  -
  
  Full breakdown of the steps to complete this task:
  
  [ ] subtask
  
  Data to be used
  ---
  
  See Analytics/Data_Lake 
 for the breakdown of 
the data lake databases and tables.
  
  The following tables will be referenced in this task:
  
  - link_to_table
  
  Notes and Questions
  ---
  
  Things that came up during the completion of this task, questions to be 
answered and follow up tasks:
  
  - Note

TASK DETAIL
  https://phabricator.wikimedia.org/T362849

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manuel
Cc: Manuel, Aklapper, Danny_Benjafield_WMDE, S8321414, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org