JAllemandou added a comment.
No objection :) I'd have gone for option 1 as it seems the easiest to
maintain, but I agree, it means installing some stuff to the blazegraph
machines.
TASK DETAIL
https://phabricator.wikimedia.org/T349069
EMAIL PREFERENCES
https://phabricator.wikimedia.org
JAllemandou added a comment.
I would suggest using the `hdfs-rsync` tool to do this - it requires some
setting up with puppet, but it is helpful, through copying only new stuff from
folders (see
https://github.com/wikimedia/operations-puppet/blob/1c4d67ff19372832484f7551dc49836be5806024
JAllemandou added a comment.
> However, my assumption is that when only filtering for agent_type !=
'spider' the population will still include a lot of non-UI hits.
The `agent_type` field currently can take 3 values: `spider`, `automated` and
`user`. The `spider` one is used when u
JAllemandou added a comment.
In T342416#9101868 <https://phabricator.wikimedia.org/T342416#9101868>,
@EBernhardson wrote:
> These are both generated by spark. The rdf is being imported by a scala
application while the cirrus dump is imported by pyspark, but they should both
JAllemandou added a comment.
In T342416#9091146 <https://phabricator.wikimedia.org/T342416#9091146>,
@EBernhardson wrote:
> I looked into these, the attached patch should fix it but it leaves an open
question (@JAllemandou):
>
> The `core-site.xml`, along with pupp
JAllemandou added a comment.
We met this morning with @AndrewTavis_WMDE and @Manuel - Thank you folks for
the great meeting.
The detailed Meeting notes are here:
https://docs.google.com/document/d/1REsolXnZf2KqApL0p-DE8X4eWXI_zxHgrCe3k1hcZnw
From the job list in previous comment
JAllemandou added a comment.
Hi @AndrewTavis_WMDE,
I've done some investigation, and here is what I have: Goran has 11 CRON jobs
running from various hosts on our system (1on `stat1004`, 2 on `stat1007`, 7 on
`stat1008`).
- `WDCM_Sqoop_Clients` runs on`stat1004` weekly - It doesn't
JAllemandou added a comment.
In T334951#8952583 <https://phabricator.wikimedia.org/T334951#8952583>,
@AndrewTavis_WMDE wrote:
> - If the answer to the above question of permanently losing some data
that's being produced by Concepts Monitor and other WMDE jobs is no, then
JAllemandou added a comment.
In T334951#8946790 <https://phabricator.wikimedia.org/T334951#8946790>,
@AndrewTavis_WMDE wrote:
> I'll async with him now and see if we can come to a decision sooner than
that, but you all will have the answer by Wednesday at the latest
JAllemandou added a comment.
Hi Folks - What is the status on this one?
I'd like Data-Engineering to announce the deprecation of Spark2 for this end
of month, but not without knowing how we plan on tackling your job :)
Here are the 2 possible solutions I can think of:
- Stopping
JAllemandou added a comment.
In T303831#8237323 <https://phabricator.wikimedia.org/T303831#8237323>,
@EBernhardson wrote:
> data cleanup looks to now have run successfully
Thanks a lot @EBernhardson for finalizing on this :)
TASK DETAIL
https://phabricator.wikimedia.or
JAllemandou added a comment.
In T303831#8175252 <https://phabricator.wikimedia.org/T303831#8175252>,
@EBernhardson wrote:
> @JAllemandou The one remaining piece of this ticket is cleaning up the
historical data, per T303831#8081172
<https://phabricator.wikimedia.org/T30
JAllemandou added a comment.
Thanks a lot @EBernhardson for the help on finishing this!
TASK DETAIL
https://phabricator.wikimedia.org/T303831
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: EBernhardson, dcausse
JAllemandou closed subtask T299059: Write an Airflow job converting commons
structured data dump to Hive as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T258834
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: AKhatun_WMF
JAllemandou closed this task as "Resolved".
TASK DETAIL
https://phabricator.wikimedia.org/T299059
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Snwachukwu, JAllemandou
Cc: Cparle, nettrom_WMF, Miriam, Nuria, cchen, AKhatun_WMF, J
JAllemandou closed this task as "Resolved".
TASK DETAIL
https://phabricator.wikimedia.org/T258834
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF,
Fernandob
JAllemandou closed subtask T258834: Create a Commons equivalent of the
wikidata_entity table in the Data Lake as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T252443
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: cchen, JAllemandou
Cc
JAllemandou added a comment.
Thank you for letting us know :)
TASK DETAIL
https://phabricator.wikimedia.org/T300240
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: ArielGlenn, Aklapper, JAllemandou, AKhatun_WMF, dcausse
JAllemandou created this task.
JAllemandou added projects: Product-Analytics, Structured-Data-Backlog,
Wikidata-Query-Service, Wikidata, Data-Engineering, Discovery-Search (Current
work), Patch-For-Review, Data-Engineering-Kanban.
Restricted Application removed a project: Patch-For-Review.
TASK
JAllemandou added a comment.
Code is ready:
- Import `commons-mediainfo` json dumps to HDFS
(https://gerrit.wikimedia.org/r/738874)
- Update spark transformation job to work with both wikidata and commons
dumps (https://gerrit.wikimedia.org/r/739129)
- Update `wikidata_entity` table
JAllemandou added a project: Data-Engineering-Kanban.
TASK DETAIL
https://phabricator.wikimedia.org/T258834
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: AKhatun_WMF, JAllemandou, cchen, Nuria, Miriam, nettrom_WMF, 786, EChetty
JAllemandou updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T291205
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, Jmixter87, JAllemandou, MPhamWMF, CBogen, Namenlos314, Gq86
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
It is interesting to understand how properties are used by different content
subgraphs (for instance humans, scholarly articles etc
JAllemandou added a comment.
Why not adding other prefixes if it's as simple as adding the prefix to the
AQS list - I think there'll be more gotchas.
let's try @AKhatun_WMF :)
TASK DETAIL
https://phabricator.wikimedia.org/T285465
EMAIL PREFERENCES
https://phabricator.wikimedia.org
JAllemandou closed this task as "Resolved".
JAllemandou added a comment.
The analysis is documented here:
https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Basic_Analysis.
Thanks @AKhatun_WMF :)
TASK DETAIL
https://phabricator.wikimedia.org/T282139
EMAIL PREFERENC
JAllemandou added subscribers: MPhamWMF, Gehel.
JAllemandou added a comment.
Thanks @AKhatun_WMF for the analysis.
@dcausse , @Gehel and @MPhamWMF - Do you think it;s worth trying to make our
parser being able to process queries with the 'mwapi' prefix (it represents 10%
of all requests
JAllemandou added a subtask: T285465: Document and analyze the number of
parsing errors for parsed WDQS queries.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T285465
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, AKhatun_WMF, JAllemandou, MPhamWMF, CBogen
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
We wish, for the month of June 2021:
- Report the number of parsing errors when generating parsed queries
information
- Provide
JAllemandou added a comment.
The problem I see with using a generic class in the `QueryElem` object is the
conversion to parquet. I don't think it'll work out of the box, leading to
having to devise our own conversion. Let's brainstorm on ideas on this,
possibly in meeting to make it faster
JAllemandou created this task.
JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review,
Discovery-Search (Current work).
Restricted Application removed a project: Patch-For-Review.
TASK DESCRIPTION
This task is related to T273854 <https://phabricator.wikimedia.
JAllemandou updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T273854
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: dcausse, Aklapper, JAllemandou, Invadibot, MPhamWMF, maantietaja, CBogen,
Akuckartz
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T273854
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: dcausse, Aklapper, JAllemandou, Invadibot, MPhamWMF
JAllemandou removed JAllemandou as the assignee of this task.
JAllemandou added a subscriber: dcausse.
JAllemandou updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T273854
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a subtask: T273854: Automate regular WDQS query parsing and
data-extraction.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: AKhatun_WMF, Aklapper
JAllemandou created this task.
JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review,
Discovery-Search (Current work).
Restricted Application removed a project: Patch-For-Review.
TASK DESCRIPTION
Augment query-analysis QueryInfo with a list of
operators+nodes+paths
JAllemandou created this task.
JAllemandou added projects: Wikidata-Query-Service, Wikidata, Patch-For-Review,
Discovery-Search (Current work).
Restricted Application removed a project: Patch-For-Review.
TASK DESCRIPTION
The job should process data hourly.
Expected parameters to be passed
JAllemandou closed subtask T282129: Test triple-analysis functions over a large
dataset with Spark as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc
JAllemandou closed this task as "Resolved".
TASK DETAIL
https://phabricator.wikimedia.org/T282129
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, Invadibot
JAllemandou added a comment.
Closing this task :) Thanks fro the great work @AKhatun_WMF
TASK DETAIL
https://phabricator.wikimedia.org/T282129
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: CBogen, AKhatun_WMF
JAllemandou closed this task as "Resolved".
JAllemandou added a comment.
Great ! Thanks for that :) Closing the ticket.
TASK DETAIL
https://phabricator.wikimedia.org/T282130
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, J
JAllemandou closed subtask T282130: Provide a way to save extracted
query-information in parquet format as Resolved.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc
JAllemandou added a comment.
@AKhatun_WMF That's great! could you please provide some info on expected
data-size in parquet (for daily data for instance)? Many thanks.
TASK DETAIL
https://phabricator.wikimedia.org/T282130
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
As a way to get familiar with the data, please provide quantitative
information over the dataset using spark in a notebook (probably using
JAllemandou added a subtask: T282130: Provide a way to save extracted
query-information in parquet format.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: AKhatun_WMF
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T282130
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
Being able to save the information in Parquet will be very useful as it
allows to automatically process the queries as the y flow in (hourly
JAllemandou added a subtask: T282129: Test triple-analysis functions over a
large dataset with Spark.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: AKhatun_WMF
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T282129
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: CBogen, AKhatun_WMF, Aklapper, JAllemandou, MPhamWMF
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
Once ready locally with unit-tests, apply the triple-analysis method to
bigger data in spark (a day).
TASK DETAIL
https
JAllemandou added a subtask: T282127: Add unit-tests to WDQS analysis toolkit.
TASK DETAIL
https://phabricator.wikimedia.org/T280640
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, JAllemandou
Cc: AKhatun_WMF, Aklapper, CBogen, dcausse
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
Extract a set of queries to be used as unit-tests (10 queries) from the
events.
This should facilitate making sure the code is doing what
JAllemandou added a parent task: T280640: Refine WDQS queries analysis.
TASK DETAIL
https://phabricator.wikimedia.org/T282127
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, CBogen, AKhatun_WMF, JAllemandou, MPhamWMF
JAllemandou created this task.
JAllemandou added projects: Wikidata, Dumps-Generation, Analytics.
Restricted Application added a project: wdwb-tech.
TASK DESCRIPTION
Analytics load wikidata all-json dumps weekly on the hadoop cluster, and we
have received an alert for dumps not being available
JAllemandou created this task.
JAllemandou added a project: Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
Restricted Application added a project: Wikidata.
TASK DESCRIPTION
The current analysis parses queries and extracts:
- Operators (list, and map
JAllemandou added a subscriber: dcausse.
JAllemandou added a comment.
Info: There already is in the cluster a job doing `TTL -> RDF` conversion.
The TTL dumps are imported weekly, and converted to blazegraph RDF once
available.
The job is maintained by the Search Platform team (p
JAllemandou claimed this task.
TASK DETAIL
https://phabricator.wikimedia.org/T273854
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Aklapper, JAllemandou, MPhamWMF, CBogen, Akuckartz, 4748kitoko, Nandana,
Namenlos314, Akovalyov
JAllemandou created this task.
JAllemandou added projects: Analytics, Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.
Restricted Application added a project: Wikidata.
TASK DESCRIPTION
This task is about running regular query-parsing jobs for WDQS and storing
JAllemandou added a comment.
Ah! I realize I have not updated that task. The analysis can be found here:
https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Queries_Analysis
@CBogen : I let you handle the definition of done, and whether this task
should be closed or not :)
TASK DETAIL
JAllemandou added a comment.
Planned deadline was end of last month. I've gone through various issues
preventing to achieve it. I'have started the actual work today (I gave it
thought but didn't code) and wish to present results before the end of the
month.
TASK DETAIL
https
JAllemandou added a comment.
Some more info on this aspect: I have done a quick analysis over September
queries today and found that my assumption that long queries were made by users
from UI is wrong.
First, total numbers of request and sum of query-time split by queries taking
more
JAllemandou added a comment.
I continued my analysis today looking at top-100 parsed user-agents from both
queries-with-referer subset, and queries-without-referer subset, over the month
of September.
See https://phabricator.wikimedia.org/P12933
- The queries-with-referer have
JAllemandou added a comment.
Heya - I'm sorry I completely missed the ping :S
Quick analysis:
spark.sql("SELECT (http.request_headers['referer'] IS NOT NULL) as
defined_referer, count(1) as c from event.wdqs_external_sparql_query where year
= 2020 and month = 9
JAllemandou added a comment.
In term of logging-size, it probably depends on the result type: in case of
descriptions or other text-heavy fields, this could get bigger if high or no
`LIMIT` are set in the number of returned rows. We should set a limit :)
TASK DETAIL
https
JAllemandou added a comment.
Will make it a lot easier to analyze than to have to build the 'in-flight'
view of queries!
TASK DETAIL
https://phabricator.wikimedia.org/T261937
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc
JAllemandou added a comment.
@GoranSMilovanovic I have indeed done some analysis using Apache Jena parser
to extract algebraic representation of queries. Not yet to the level of
completion I like though. I'll be on holidays until August 15th starting
tonight - let's discuss when I come back
JAllemandou added a comment.
@GoranSMilovanovic I finally published a wiki page with most of the results I
found: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Traffic_Analysis
Sorry for the delay ...
TASK DETAIL
https://phabricator.wikimedia.org/T248308
EMAIL PREFERENCES
https
JAllemandou added a comment.
SELECT
http.request_headers['user-agent'],
user_agent_map,
count(1) as c
FROM event.wdqs_external_sparql_query
WHERE year = 2020 and month = 5 and day = 1
GROUP BY
http.request_headers['user-agent
JAllemandou added a comment.
> First step: analyze the frequency distribution of the user_agent field
(string) from wmf.webrequest where queries are SPARQL.
I suggest you use events instead fo webrequest:
`event.wdqs_internal_sparql_query` and `event.wdqs_external_sparql_query`.
JAllemandou closed this task as "Resolved".
JAllemandou updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T249319
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Milimetric, Aklapper, Addshore,
JAllemandou added a comment.
An idea: How about sending back to kafka the update stream and make THAT one
retention higher?
Moving retention to 30 days for revision-create will make a lot of data stay
that wouldn't be necessary (about half of the data), while keeping only the
updates
JAllemandou added a comment.
Patch needs to be deployed before the dashboard shows data.
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Ladsgroup, JAllemandou
Cc: Milimetric, Ladsgroup, Nuria
JAllemandou added a comment.
Events using `isBlank` since the beginning of year are now stored here:
`/user/joal/wdqs_queries/2020_use_isBlank/wdqs_use_is_blank_202002.json`.
There are ~56k events stored in json format in a single file to facilitate
analysis.
TASK DETAIL
https
JAllemandou added a comment.
As I was working on getting a better idea of the queries, I got some results
relatively easily:
Since beginning of year:
- Internal cluster: No request using `isBlank()`, 481202298 requests total
- External cluster: 54669 requests using `isBlank
JAllemandou renamed this task from "Copy Wikidata dumps to HDFS" to "Copy
Wikidata dumps to HDFS + parquet".
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc:
JAllemandou added a subtask: T243832: Fix hdfs-rsync`prune-empty-dirs` feature.
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc: Isaac, Groceryheist, MGerlach, WMDE-leszek, abian
JAllemandou claimed this task.
JAllemandou added a project: Analytics-Kanban.
JAllemandou set the point value for this task to "5".
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAll
JAllemandou added a project: Analytics-Kanban.
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Ladsgroup, JAllemandou
Cc: Ladsgroup, Nuria, JAllemandou, elukey, Addshore, Aklapper, Lydia_Pintscher
JAllemandou added a comment.
The patch merged by @Nuria had a bug. I commented on the already merged patch
on a solution. For the moment the job is not started.
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel
JAllemandou added a comment.
Chiming in: I suggest using Spark for investigations - Given the size of the
dataset, parallel computation should help. This means another hop for the data:
--> stat1004 --> HDFS. Please ping if you want/need help :)
TASK DETAIL
JAllemandou added a subscriber: Groceryheist.
JAllemandou added a comment.
New dataset available @GoranSMilovanovic. Pinging @Groceryheist as I also
generated the items per page.
hdfs dfs -ls /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet | tail -1
drwxr-xr-x - analytics
JAllemandou added a project: Analytics-Kanban.
TASK DETAIL
https://phabricator.wikimedia.org/T239471
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Addshore, JAllemandou
Cc: JAllemandou, Addshore, Aklapper, 4748kitoko, Hook696, Daryl-TTMG
JAllemandou added a comment.
Does this being closed mean we can access data on kafka?
TASK DETAIL
https://phabricator.wikimedia.org/T101013
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: dcausse, JAllemandou
Cc: Igorkim78, JAllemandou, Ottomata
JAllemandou added a comment.
I think this problem could be related to T226730 (preventing most
`Special:XXX` pages to be flagged as pageviews).
TASK DETAIL
https://phabricator.wikimedia.org/T236895
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a comment.
this is done @GoranSMilovanovic.
Raw data is here
`/user/joal/wmf/data/raw/mediawiki/wikidata/all_jsondumps/20190902` and parquet
data is here `/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190902`
TASK DETAIL
https://phabricator.wikimedia.org/T209655
JAllemandou added a comment.
@GoranSMilovanovic : You're welcome :) At some point I'll manage to have that
productionize ;)
TASK DETAIL
https://phabricator.wikimedia.org/T209655
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc
JAllemandou added a comment.
A lot trickier :)
We have the `wmf_raw.mediawiki_private_cu_changes` table in hive, allowing us
to compute geo-editors (editors-by-country, aggregated). This table only
contains 3 month of data for PII removal reasons. It's probably not enough for
what you're
JAllemandou added a comment.
Hi @Lea_WMDE and @GoranSMilovanovic - I think the answer the your problem is
solved in this month snapshot with the `revision_tags` field of
mediawiki_history:
spark.sql("""
SELECT
substr(event_timestamp, 0, 4) as year,
JAllemandou added a comment.
The analytics hadoop cluster could also be of use here: the task can easily
take advantage of parallelization.
TASK DETAIL
https://phabricator.wikimedia.org/T94019
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a comment.
Community has spoken, we'll find workarounds - Thanks a lot @ArielGlenn for
helping driving this :)
TASK DETAIL
https://phabricator.wikimedia.org/T216160
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: JAllemandou
Cc
JAllemandou added a comment.
Some queries are computed using hadoop for wikidata (see
https://github.com/wikimedia/analytics-refinery/tree/master/oozie/wikidata). If
SQL over recent-changes works for, that's great :)
TASK DETAIL
https://phabricator.wikimedia.org/T218901
EMAIL PREFERENCES
JAllemandou added a comment.
Reading about this - Would delayed data be interesting? This information is
accessible in hadoop :)
TASK DETAIL
https://phabricator.wikimedia.org/T218901
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
JAllemandou added a comment.
Most of the complicated things already exist for this to work (equicalent of
rsync for HDFS, spark job converting wikidata json dumps to parquet).
I wanted for T216160 <https://phabricator.wikimedia.org/T216160> to be
settled before
JAllemandou added a comment.
Hey @GoranSMilovanovic - I don't have a good understanding of what you're
after, but having read pairs and contingency table above, maybe this Spark
function could be helpful:
https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql
JAllemandou added a comment.
In T216160#5020236 <https://phabricator.wikimedia.org/T216160#5020236>,
@ArielGlenn wrote:
> By Friday I'll have done that; by next Wednesday let's make a decision,
barring any huge obstacles.
Awesome, thanks @ArielGlenn :)
TASK DETAI
JAllemandou added a comment.
Exact analysis ran on 2018-12-06:
val df =
spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001")
val base_rdd = df.select("labels", "descriptions", "aliases").rdd
val strings
JAllemandou added a comment.
Hi @Isaac
Sorry for the issue. I correcte the query above (last query, join criteria:
`AND ws.sitelink.title = title_namespace_localized` --> `AND
REPLACE(ws.sitelink.title, ' ', '_') = title_namespace_localized`
We were not joining correctly on ti
JAllemandou added a comment.
We're on the same page @diego :)
I can precompute the table described in ii) if needed, and will surely do it
once we'll have the wikidata-dump productioned - Let me know if you need it
before
TASK DETAIL
https://phabricator.wikimedia.org/T215616
EMAIL
JAllemandou added a comment.
I can't speak about failures and restarts as I don't know much about the dumps-generation process. @ArielGlenn would the person to know best.
As for the dates, the main reason we ask for the change is for dates consistency by month, mimic-ing what exists for xml
JAllemandou added a comment.
Thanks @Isaac for reformulating the question I tried to explain above :)
@diego: Can you confirm there is value for you in having revisions tied to wikidata-items regardless of when the link happened?TASK DETAILhttps://phabricator.wikimedia.org/T215616EMAIL
1 - 100 of 136 matches
Mail list logo