Esc3300 added a comment.
Shouldn't users opt-in to this?TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Esc3300Cc: Esc3300, JAllemandou, mpopov, mforns, PokestarFan, Nuria, Lydia_Pintscher, mkroetzsch, leila,
Smalyshev added a comment.
@Esc3300 Which users? WDQS does not track users, only queries. The log does contain query IP but the data processing will remove it, as well as any other PII.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/pa
Smalyshev added a comment.
Thinking about it, I don't think we ever would need more that hourly resolution for anything related to queries (we can get hit stats from the usual stats places I assume). I also thought about dataset #1 as more short-lived. But I am not that insistant on session ID thin
Nuria added a comment.
@Smalyshev We like to default to public if possible, the more eyes on the data the more useful it can be.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: NuriaCc: mforns, PokestarFan, Nu
mforns added a comment.
@Nuria @Smalyshev
So probably if we round timestamp and remove sessionId your proposal for dattaset #1 is safe to keep long term (cc @mforns for anything I might be missing)
I think it depends highly on how drastically we sanitize the potentially identifying fields (user a
Smalyshev added a comment.
I made a more formal full description of which data I'd like to be in the public dataset, so people don't have to read through all the comments here: https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Publishing_query_data
Please review and comment if you see anything mi
JAllemandou added a comment.
@Nuria , @Smalyshev : Given all wikidata-query tagged rows belong in misc, which is super small, I have no objection running jobs either hourly or daily.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/pane
Smalyshev added a comment.
We have the logs, but they are not publicly accessible. See https://meta.wikimedia.org/wiki/Discovery/Data_access_guidelines#Request_logs for access guidelines.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/
I9606 added a comment.
Hi folks. It sounds like there is reasonably clear pattern for access. I have a student that could execute this project starting sept. 19 if the barriers were cleared. Anything I can provide to move this along? Thanks!TASK DETAILhttps://phabricator.wikimedia.org/T143819EMA
debt added a comment.
Hi @I9606 - we have a NDA process that your student would need to go through before we can go too much further with this being done in a volunteer capacity.
The link is here for the main page: https://meta.wikimedia.org/wiki/Non-disclosure_agreements and your student would ne
I9606 added a comment.
OK. Do we just sign and mail that in or is there a specific contact person we should be in touch with?TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: I9606Cc: debt, thiemowmde, Jonas,
leila added a comment.
@I9606 I imagine that what you are interested in will be one of the early outputs of the research documented at https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries . If that is the case, we should wait for the result of that research to gradually start com
I9606 added a comment.
Assuming that we can gain access to the output of that work and that it allows us to explore subject-matter specific aspects of the data, then yes, it sounds like it would be a great foundation for what we want to do.
I notice that this project started in May this year and t
leila added a comment.
@I9606 that specific project proposal was initiated in May 2016. The access to data was granted only in September 2016. Timelines will be updated once we know more. :)TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settin
AndrewSu added a comment.
Would it be possible for our team to get access to these log files so that we can perform our analyses that are related to, but distinct from, the ones that @mkroetzsch is doing? We are happy to coordinate with Markus so that there is no duplication of effort. But, I sus
mkroetzsch added a comment.
@AndrewSu As I just replied to Benjamin Good in this matter, it is a bit too early for this, since we only have the basic technical access as of very recently. We have not had a chance to extract any community shareable data sets yet, and it is clear that it will require
AndrewSu added a comment.
@mkroetzsch Thank you for the info. We look forward to coordinating more when/if you see fit in the future.
Since our project is not dependent on Markus' work, and since I don't believe that our work will negatively impact Markus' project, I propose we treat our request
leila added a comment.
@AndrewSu please read https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations to learn about how we start formal collaborations (which is a per-requisite for accessing the data). If you are interested, please attach a proposal to this phabricator task, ping me
Lydia_Pintscher added a comment.
From my side the team around @I9606 and @AndrewSu has useful things to contribute on this topic and it'd be great if their request can be granted.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/em
leila added a comment.
@AndrewSu Lydia and I had some off-list discussions and we thought it's a good idea that I leave a bit more information for you here:
Please don't spend days on the proposal if you decide to submit it. This is supposed to be a 1-2 page proposal that will help us understand
AndrewSu added a comment.
Thank you @leila for the guidance on the process and next steps -- very helpful! @I9606 and I will touch base to see how we want to proceed/prioritize from our end...TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/set
Smalyshev added a comment.
Hmm not sure how to implement this yet, as we do not track which items were in query results (might be possible from GUI, though expensive, and probably not possible from API) but may be possible to analyze e.g. property usage in queries. Anybody in Analytics interested i
AndrewSu added a comment.
Just want to add a note that if someone on the WMF side was interested in building the infrastructure to compute these usage metrics, the "Gene Wiki" team would be very willing collaborators in evaluating and refining the metrics. We have been working hard loading biomedi
Nuria added a comment.
If @Smalyshev thinks this would be a good idea and can develop the instrumentation for the metrics and own the metric definition (together with "gene wiki") we can help on the project as needed, seems to me that things like these could be computed with the infrastructure we h
Nuria added a comment.
As far as I understand you need to publish not only queries to service but also query results (is this correct @Smalyshev?) analyzing those will produce the metric counts @AndrewSu and @leila are interested on. This requires a schema definition of what a query result is (i
Smalyshev added a comment.
It may be hard to capture query results, given that we don't have any mechanism of tracking them now. We do have logs for queries themselves, so that's what I would start with...
@AndrewSu if you have any suggestions about the metrics that would be very helpful. Please a
AndrewSu added a comment.
My initial thought is that there will be two types of metrics. First, we want to look at statement-level metrics. For all the statements that our team has loaded into Wikidata, we have been referencing specific resources that assert that statement. For example, see the
Smalyshev added a comment.
Those statements might be part of the output of the SPARQL query, or they might simply be structural intermediates.
We don't have currently tools to capture the statistics about output of the query, let alone intermediaries. We could, however (with some work) capture usa
Nuria added a comment.
@Smalyshev @AndrewSu please take a look at other metric definitions we have. once you decide on a metric definition please be so kind as to document it in beta: https://meta.wikimedia.org/wiki/Research:Standard_metrics#Newly_registered_user
This helps a lot to quantify what
AndrewSu added a comment.
We could, however (with some work) capture usage of certain property, or item, or property-item combination, in the original query. Would that be useful?
Property usage: I think there is some small-ish subset of properties that are very closely tied to a single data pr
Nuria added a comment.
To incentivize them to contribute, we have to give them even better metrics of community usage/impact that they can give to funders
Understood, as I said we are willing to help in any way we can, seems like a great objective. My main point is that if we come up with a metric
AndrewSu added a comment.
In T143819#3350566, @Nuria wrote:
To incentivize them to contribute, we have to give them even better metrics of community usage/impact that they can give to funders
Understood, as I said we are willing to help in any way we can, seems like a great objective. My main poi
32 matches
Mail list logo