Re: [Wikidata] dcatap namespace in WDQS
Hi! > As part of our Wikidata Query Service setup, we maintain the namespace > serving DCAT-AP (DCAT Application Profile) data[1]. (If you don't know > what I'm talking about you can safely ignore the rest of the message). Following up on this discussion and the feedback received, I have decided to move dcatap namespace to separate endpoint - https://dcatap.wmflabs.org/. I've updated the manual to reflect it[1]. The old setup is still working, but we'll be disabling updates, and eventually also disable the namespace itself, so while it still be used for now, if you plan to use it (logs suggest there's virtually no usage now, but that can change of course) please use the endpoint above. [1] https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#DCAT-AP -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] dcatap namespace in WDQS
Hi! As part of our Wikidata Query Service setup, we maintain the namespace serving DCAT-AP (DCAT Application Profile) data[1]. (If you don't know what I'm talking about you can safely ignore the rest of the message). Recent check showed that this namespace is virtually unused - over the last two months, only 3 query per month were served from that namespace, and all of them coming from WMF servers (not sure whether it's a tool or somebody querying manually, did not dig further). So I wonder if it makes sense to continue maintaining this namespace? While it does not require very significant effort - it's mostly automated - it does need occasional attention when maintenance is performed, and some scripts and configurations become slightly more complex because of it. No big deal if somebody is using it, that's what the service is for, but if it is completely unused, no point is spending even minimal effort on it, at least on main production servers (of course, it'd be possible to set up a simple SPARQL server in labs with the same data). In any case, RDF dcatap data will be available in https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf, no change is planned there, but if the namespace is phased out, the data could no longer be queried using WDQS. One could still download it and, since it's a very small dataset, use any tool that can read RDF to parse it and work with it. I'd like to hear from anybody interested in this whether they are using this namespace or plan to use it and what for. Please either answer here or even better in the task[2] on Phabricator. [1] https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#DCAT-AP [2] https://phabricator.wikimedia.org/T228297 -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users
Hi! > Forgive my ignorance. I don't know much about infrastructure of WDQS and > how it works. I just want to mention how application servers do it. In > appservers, there are dedicated nodes both for apache and the replica > database. So if a bot overdo things in Wikipedia (which happens quite a > lot), users won't feel anything but the other bots take the hit. Routing > based on UA seems hard though while it's easy in mediawiki (if you hit > api.php, we assume it's a bot). We have two clusters - public and internal, with the latter serving only Wikimedia tasks thus isolated from outside traffic. However, we do not have a practical way right now to separate bot and non-bot traffic, and I don't think we now have resources for another cluster. > Routing based on UA seems hard though while it's easy in mediawiki I don't think our current LB setup can route based on user agent. There could be a gateway that does that, but given that we don't have resources for another cluster for now, it's not too useful to spend time on developing something like that for now. Even if we did separate browser and bot traffic, we'd still have the problem on bot cluster - most bots are benign and low-traffic, and we want to do our best to enable them to function smoothly. But for this to work, we need ways to weed out outliners that consume too much resources. In a way, the bucketing policy is a sort of version of what you described - if you use proper identification, you are judged on your traffic. If you use generic identification, you are bucketed with other generic agents, and thus may be denied if that bucket is full. This is not the best final solution, but experience so far shows it reduced the incidence of problems. Further ideas on how to improve it of course are welcome. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Wikidata Query Service User-Agent requirements for script users
Hello all! Here is (at last!) an update on what we are doing to protect the stability of Wikidata Query Service. For 4 years we have been offering to Wikidata users the Query Service, a powerful tool that allows anyone to query the content of Wikidata, without any identification needed. This means that anyone can use the service using a script and make heavy or very frequent requests. However, this freedom has led to the service being overloaded by a too big amount of queries, causing the issues or lag that you may have noticed. A reminder about the context: We have had a number of incidents where the public WDQS endpoint was overloaded by bot traffic. We don't think that any of that activity was intentionally malicious, but rather that the bot authors most probably don't understand the cost of their queries and the impact they have on our infrastructure. We've recently seen more distributed bots, coming from multiple IPs from cloud providers. This kind of pattern makes it harder and harder to filter or throttle an individual bot. The impact has ranged from increased update lag to full service interruption. What we have been doing: While we would love to allow anyone to run any query they want at any time, we're not able to sustain that load, and we need to be more aggressive in how we throttle clients. We want to be fair to our users and allow everyone to use the service productively. We also want the service to be available to the casual user and provide up-to-date access to the live Wikidata data. And while we would love to throttle only abusive bots, to be able to do that we need to be able to identify them. We have two main means of identifying bots: 1) their user agent and IP address 2) the pattern of their queries Identifying patterns in queries is done manually, by a person inspecting the logs. It takes time and can only be done after the fact. We can only start our identification process once the service is already overloaded. This is not going to scale. IP addresses are starting to be problematic. We see bots running on cloud providers and running their workloads on multiple instances, with multiple IP addresses. We are left with user agents. But here, we have a problem again. To block only abusive bots, we would need those bots to use a clearly identifiable user agent, so that we can throttle or block them and contact the author to work together on a solution. It is unlikely that an intentionally abusive bot will voluntarily provide a way to be blocked. So we need to be more aggressive about bots which are using a generic user agent. We are not blocking those, but we are limiting the number of requests coming from generic user agents. This is a large bucket, with a lot of bots that are in this same category of "generic user agent". Sadly, this is also the bucket that contains many small bots that generate only a very reasonable load. And so we are also impacting the bots that play fair. At the moment, if your bot is affected by our restrictions, configure a custom user agent that identifies you; this should be sufficient to give you enough bandwidth. If you are still running into issues, please contact us; we'll find a solution together. What's coming next: First, it is unlikely that we will be able to remove the current restrictions in the short term. We're sorry for that, but the alternative - service being unresponsive or severely lagged for everyone - is worse. We are exploring a number of alternatives. Adding authentication to the service, and allowing higher quotas to bots that authenticate. Creating an asynchronous queue, which could allow running more expensive queries, but with longer deadlines. And we are in the process of hiring another engineer to work on these ideas. Thanks for your patience! WDQS Team ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Significant change of Wikidata dump size
Hi! On 6/25/19 11:17 PM, Ariel Glenn WMF wrote: > I think the issue is with the 0624 json dumps, which do seem a lot > smaller than previous weeks' runs. Ah, true, I didn't realize that. I think this may be because of that dumpJson.php issue, which is now fixed. Maybe rerun the dump? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Significant change of Wikidata dump size
Hi! > Which script, please, and which dump? (The conversation was not > forwarded so I don't have the context.) Sorry, the original complaint was: > I apologize if I missed something, but why the current JSON dump size is ~25GB while a week ago it was ~58GB? (see https://dumps.wikimedia.org/wikidatawiki/entities/20190617/) But looking at it now, I see wikidata-20190617-all.json.gz is comparable with the last week, so looks like it's fine now? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Significant change of Wikidata dump size
Hi! > Follow-up: according to my processing script, this dump contains > only 30280591 entries, while the main page is still advertising 57M+ > data items. > Isn't it a bug in the dump process? There was a problem with dump script (since fixed), so the dump may indeed be broken. CCing Ariel to take a look. Probably needs to be re-run or we can just wait for the next one. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Result format change for WDQS JSON query output
Hi! > from 2014, so I will research which form is more correct. But for now I > would recommend to update the tools to recognize that these literals now > may have type. If I discover that the standards or accepted practices > recommend otherwise, I'll update further. You can also watch > https://phabricator.wikimedia.org/T225996 for final resolution of this. I surveyed existing practices of SPARQL endpoints and tools, and looks like the accepted practice is to omit the datatypes for such literals even within the context of RDF 1.1. Example: https://issues.apache.org/jira/browse/JENA-1077 I will adjust the code in Blazegraph accordingly, so WDQS will comply with this practice (i.e. result format will be as it was before). This will be implemented in coming days. Sorry again for the disruption. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Result format change for WDQS JSON query output
Hi! Due to upgrade to more current version of Sesame toolkit, the format of JSON output of Wikidata Query Service has changed slightly[1]. The change is that plain literals (ones that do not have explicit data type, like "string" or "string"@de) now have "datatype" field. The language literals will have type http://www.w3.org/1999/02/22-rdf-syntax-ns#langString and the non-language ones http://www.w3.org/2001/XMLSchema#string. This is in accordance with RDF 1.1 standard [2], where all literals have data type (even though for these types it is implicit). I apologize for not noting this in advance - though I knew this change in the standard happened, I did not foresee it will also carry over to the JSON output format. I am not sure yet which output form is actually correct, since standards seem to be conflicting, maybe due to the fact that JSON results standard hasn't been updated since 2013 and RDF 1.1 is from 2014, so I will research which form is more correct. But for now I would recommend to update the tools to recognize that these literals now may have type. If I discover that the standards or accepted practices recommend otherwise, I'll update further. You can also watch https://phabricator.wikimedia.org/T225996 for final resolution of this. [1] https://phabricator.wikimedia.org/T225996 [2] https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Planned filename change for Wikidata RDF entity dumps
Hi! As outlined in https://phabricator.wikimedia.org/T226153, we are planning to change filename scheme for Wikidata RDF entity dumps, by removing the "-BETA" suffix from the filename. The Wikidata RDF ontology is not beta anymore and dumps have been working stable for a while now, so it's time to drop the beta mark from the name. It may take a week or two for the change to propagate and be applied to dumps, but if your tools depend on exact naming, please prepare them for the eventual change in the name. Note that links like https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.gz would still be pointing to the right files, and if all you care is downloading the latest dump, using these links is always recommended. We will send another message once the change has been implemented and deployed. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Overload of query.wikidata.org (Guillaume Lederrey)
Hi! On 6/18/19 2:29 PM, Tim Finin wrote: > I've been using wdtaxonomy > <https://wdtaxonomy.readthedocs.io/en/latest/> happily for many months > on my macbook. Starting yesterday, every call I make (e.g., "wdtaxonomy > -c Q5") produces an immediate "SPARQL request failed" message. Could you provide more details, which query is sent and what is the full response (including HTTP code)? > > Might these requests be blocked now because of the new WDQS policies? One thing I may think of it that this tool does not send the proper User-Agent header. According to https://meta.wikimedia.org/wiki/User-Agent_policy, all clients should identify with valid user agent. We've started enforcing it recently, so maybe this tool has this issue. If not, please provide the data above. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
Hi! > The documented limits about FDB states that it to support up to 100TB of > data > <https://apple.github.io/foundationdb/known-limitations.html#database-size>. > That is 100x times more > than what WDQS needs at the moment. "Support" is such a multi-faceted word. It can mean "it works very well with such amount of data and is faster than the alternatives" or "it is guaranteed not to break up to this number but breaks after it" or "it would work, given massive amounts of memory and super-fast hardware and very specific set of queries, but you'd really have to take an effort to make it work" and everything in between. The devil is always in the details, which this seemingly simple word "supports" is rife with. > I am offering my full-time services, it is up to you decide what will > happen. I wish you luck with the grant, though I personally think if you expect to have a production-ready service in 6 month that can replace WDQS then in my personal opinion it is a bit too optimistic. I might be completely wrong on this of course. If you just plan to load the Wikidata data set and evaluate the queries to ensure they are fast and produce proper results on the setup you propose, then it can be done. Good luck! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Overload of query.wikidata.org
Hi! > We are currently dealing with a bot overloading the Wikidata Query > Service. This bot does not look actively malicious, but does create > enough load to disrupt the service. As a stop gap measure, we had to > deny access to all bots using python-request user agent. > > As a reminder, any bot should use a user agent that allows to identify > it [1]. If you have trouble accessing WDQS, please check that you are > following those guidelines. To add to this, we have had this trouble because two events that WDQS currently does not deal well with have coincided: 1. An edit bot that edited with 200+ edits per minute. This is too much. Over 60/m is really almost always too much. And also it would be a good thing to consider if your bots does multiple changes (e.g. adds multiple statements) doing it in one call instead of several, since WDQS currently will do an update on each change separately, and this may be expensive. We're looking into various improvements to this, but it is the state currently. 2. Several bots have been flooding the service query endpoint with requests. There is recently a growth in bots that a) completely ignore both regular limits and throttling hints b) do not have proper identifying user agent and c) use distributed hosts so our throttling system has a problem to deal with them automatically. We intend to crack down more and more on such clients, because they look a lot like DDOS and ruin the service experience for everyone. I will write down more detailed rules probably a bit later, but so far these: https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Usage_constraints and additionally having distinct User-Agent if you're running a bot is a good idea. And for people who are thinking it's a good idea to launch a max-requests-I-can-stuff-into-the-pipe bot, put it on several Amazon machines so that throttling has hard time detecting it, and then when throttling does detect it neglecting to check for a week that all the bot is doing is fetching 403s from the service and wasting everybody's time - please think again. If you want to do something non-trivial querying WDQS and limits get in the way - please talk to us (and if you know somebody who isn't reading this list but is considering wiring a bot interfacing with WDQS - please educate them and refer them for help, we really prefer to help than to ban). Otherwise, we'd be forced to put more limitations on it that will affect everyone. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
Hi! > Data living in an RDBMS engine distinct from Virtuoso is handled via the > engines Virtual Database module i.e., you can build powerful RDF Views > over ODBC- or JDBC- accessible data using Virtuoso. These view also have > the option of being materialized etc.. Yes, but the way the data are stored now is JSON blob within a text field in MySQL. I do not see how RDF View over ODBC would help it any - of course Virtuoso would be able to fetch JSON text for a single item, but then what? We'd need to run queries across millions of items, fetching and parsing JSON for every one of them every time is unfeasible. Not to mention this JSON is not an accurate representation of the RDF data model. So I don't think it is worth spending time in this direction... I just don't see how any query engine could work with that storage. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
Hi! > It handles data locality across a shared nothing cluster just fine i.e., > you can interact with any node in a Virtuoso cluster and experience > identical behavior (everyone node looks like single node in the eyes of > the operator). Does this mean no sharding, i.e. each server stores the full DB? This is the model we're using currently, but given the growth of the data it may be non sustainable on current hardware. I see in your tables that Uniprot has about 30B triples, but I wonder how update loads there look like. Our main issue is that the hardware we have now is showing its limits when there's a lot of updates in parallel to significant query load. So I wonder if the "single server holds everything" model is sustainable in the long term. > There are live instances of Virtuoso that demonstrate its capabilities. > If you want to explore shared-nothing cluster capabilities then our live > LOD Cloud cache is the place to start [1][2][3]. If you want to see the > single-server open source edition that you have DBpedia, DBpedia-Live, > Uniprot and many other nodes in the LOD Cloud to choose from. All of > these instance are highly connected. Again, here the question is not too much in "can you load 7bn triples into Virtuoso" - we know we can. What we want to figure out whether given specific query/update patterns we have now - it is going to give us significantly better performance allowing to support our projected growth. And also possibly whether Virtuoso has ways to make our update workflow be more optimal - e.g. right now if one triple changes in Wikidata item, we're essentially downloading and updating the whole item (not exactly since triples that stay the same are preserved but it requires a lot of data transfer to express that in SPARQL). Would there be ways to update the things more efficiently? > Virtuoso handles both shared-nothing clusters and replication i.e., you > can have a cluster configuration used in conjunction with a replication > topology if your solution requires that. Replication could certainly be useful I think it it's faster to update single server and then replicate than simultaneously update all servers (that's what is happening now). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
Hi! > Unlike, most sites we do have our own custom frontend in front of > virtuoso. We did this to allow more styling, as well as being flexible > and change implementations at our whim. e.g. we double parse the SPARQL > queries and even rewrite some to be friendlier. I suggest you do the > same no matter which DB you use in the end, and we would be willing to > open source ours (it is in Java, and uses RDF4J and some ugly JSPX but > it works, if not to use at least as an inspiration). We did this to > avoid being locked into endpoint specific features. It would be interesting to know more about this, if this is open source. Is there any more information about it online? > Pragmatically, while WDS is a Graph database, the queries are actually > very relational. And none of the standard graph algorithms are used. To If you mean algorithms like A* or PageRank, then yes, they are not used too much (likely also because SPARQL has no standard support for any of these, too), though Blazegraph implements some of them as custom services. > be honest RDF is actually a relational system which means that > relational techniques are very good at answering them. The sole issue is > recursive queries (e.g. rdfs:subClassOf+) in which the virtuoso > implementation is adequate but not great. Yes, path queries are pretty popular on WDQS too, especially given as many relationships like administrative/territorial placement or ownership are hierarchical and transitive, which often requires path queries. > This is why recovering physical schemata from RDF data is such a > powerful optimization technique [1]. i.e. you tend to do joins not > traversals. This is not always true but I strongly suspect it will hold > for the vast majority of the Wikidata Query Service case. Would be interesting to see if we can apply anything from the article. Thanks for the link! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
Hi! >> So there needs to be some smarter solution, one that we'd unlike to > develop inhouse > > Big cat, small fish. As wikidata continue to grow, it will have specific > needs. > Needs that are unlikely to be solved by off-the-shelf solutions. Here I think it's good place to remind that we're not Google, and developing a new database engine inhouse is probably a bit beyond our resources and budgets. Fitting existing solution to our goals - sure, but developing something new of that scale is probably not going to happen. > FoundationDB and WiredTiger are respectively used at Apple (among other > companies) > and MongoDB since 3.2 all over-the-world. WiredTiger is also used at Amazon. I believe they are, but I think for our particular goals we have to limit themselves for a set of solution that are a proven good match for our case. >> We also have a plan on improving the throughput of Blazegraph, which > we're working on now. > > What is the phabricator ticket? Please. You can see WDQS task board here: https://phabricator.wikimedia.org/tag/wikidata-query-service/ > That will be vendor lock-in for wikidata and wikimedia along all the > poor souls that try to interop with it. Since Virtuoso is using standard SPARQL, it won't be too much of a vendor lock in, though of course the standard does not cover all, so some corners are different in all SPARQL engines. This is why even migration between SPARQL engines, even excluding operational aspects, is non-trivial. Of course, migration to any non-SPARQL engine would be order of magnitude more disruptive, so right now we do not seriously consider doing that. > It has two backends: MMAP and rocksdb. Sure, but I was talking about the data model - ArangoDB sees the data as set of documents. RDF approach is a bit different. > ArangoDB is a multi-model database, it support: As I already mentioned, there's a difference between "you can do it" and "you can do it efficiently". Graphs are simple creatures, and can be modeled on many backends - KV, document, relational, column store, whatever you have. The tricky part starts when you need to run millions of queries on 10B triples database. If your backend is not optimal for that task, it's not going to perform. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
Hi! > thanks for the elaboration. I can understand the background much better. > I have to admit, that I am also not a real expert, but very close to the > real experts like Vidal and Rahm who are co-authors of the SWJ paper or > the OpenLink devs. If you know anybody at OpenLink that would be interested in trying to evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and provide support for this project, it would be interesting to discuss it. While open-source thing is still a barrier and in general the requirements are different, at least discussing it and maybe getting some numbers might be useful. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
Hi! > Yes, sharding is what you need, I think, instead of replication. This is > the technique where data is repartitioned into more manageable chunks > across servers. Agreed, if we are to get any solution that is not constrained by hardware limits of a single server, we can not avoid looking at sharding. > Here is a good explanation of it: > > http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF Thanks, very interesting article. I'd certainly would like to know how this works with database in the size of 10 bln. triples and queries both accessing and updating random subsets of them. Updates are not covered very thoroughly there - this is, I suspect, because many databases of 10 bln. size do not have as active (non-append) update workload as we do. Maybe they still manage to solve it, if so, I'd very much like to know about it. > Just a note here: Virtuoso is also a full RDMS, so you could probably > keep wikibase db in the same cluster and fix the asynchronicity. That is Given how the original data is stored (JSON blob inside mysql table) it would not be very useful. In general, graph data model and Wikitext data model on top of which Wikidata is built are very, very different, and expecting same storage to serve both - at least without very major and deep refactoring of the code on both sides - is not currently very realistic. And of course moving any of the wiki production databases to Virtuoso would be a non-starter. Given than original Wikidata database stays on Mysql - which I think is a reasonable assumption - there would need to be a data migration pipeline for data to come from Mysql to whatever is the WDQS NG storage. > also true for any mappers like Sparqlify: > http://aksw.org/Projects/Sparqlify.html However, these shift the > problem, then you need a sharded/repartitioned relational database Yes, relational-RDF bridges are known but my experience is they usually are not very performant (the difference in "you can do it" and "you can do it fast" is sometimes very significant) and in our case, it would be useless anyway as Wikidata data is not really stored in relational database per se - it's stored in JSON blob opaquely saved in relational database structure that knows nothing about Wikidata. Yes, it's not the ideal structure for optimal performance of Wikidata itself, but I do not foresee this changing, at least in any short term. Again, we could of course have data export pipeline to whatever storage format we want - essentially we already have one - but the concept of having single data store is probably not realistic at least within foreseeable timeframes. We use separate data store for search (ElasticSearch) and probably will have to have separate one for queries, whatever would be the mechanism. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Scaling Wikidata Query Service
special arrangement. Since this arrangement will probably not include open-sourcing the enterprise part of Virtuoso, it should deliver a very significant, I dare say enormous advantage for us to consider running it in production. It may be possible that just OS version is also clearly superior to the point that it is worth migrating, but this needs to be established by evaluation. > - I recently heard a presentation from Arango-DB and they had a good > cluster concept as well, although I don't know anybody who tried it. The > slides seemed to make sense. We considered AgangoDB in the past, and it turned out we couldn't use it efficiently on the scales we need (could be our fault of course). They also use their own proprietary language for querying, which might be worth it if they deliver us a clear win on all other aspects, but that does not seem to be the case. Also, AgangoDB seems to be document database inside. This is not what our current data model is. While it is possible to model Wikidata in this way, again, changing the data model from RDF/SPARQL to a different one is an enormous shift, which can only be justified by an equally enormous improvement in some other areas, which currently is not clear. This project seems to be still very young. While I would be very interested if somebody took on themselves to model Wikidata in terms of ArangoDB documents, load the whole data and see what the resulting performance would be, I am not sure it would be wise for us to invest our team's - very limited currently - resources into that. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] searching for Wikidata items
Hi! > Yes, the api is > at https://www.wikidata.org/w/api.php?action=query=search=Bush There's also https://www.wikidata.org/w/api.php?action=wbsearchentities=Bush=en=json This is what completion search in Wikidata is using. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Where did label filtering break recently and how?
Hi! > and if I enable any of the FILTER lines, it returns 0 results. > What changed / Why ? Thanks for reporting, I'll check into it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Are we ready for our future
Hi! > WQS data doesn't have versions, it doesn't have to be in one space and > can easily be separated. The whole point of LOD is to decentralize your > data. But I understand that Wikidata/WQS is currently designend as a > centralized closed shop service for several reasons granted. True, WDQS does not have versions. But each time the edit is made, we now have to download and work through the whole 2M... It wasn't a problem when we were dealing with regular-sized entities, but current system certainly is not good for such giant ones. As for decentralizing, WDQS supports federation, but for obvious reasons federated queries are slower and less efficient. That said, if there were separate store for such kind of data, it might work as cross-querying against other Wikidata data wouldn't be very frequent. But this is something that Wikidata community needs to figure out how to do. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Are we ready for our future
Hi! > For the technical guys, consider our growth and plan for at least one > year. When the impression exists that the current architecture will not > scale beyond two years, start a project to future proof Wikidata. We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example: https://www.wikidata.org/w/index.php?title=Q57009452 This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit. Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements. While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated. Any thoughts on this? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata-tech] RDF export for SDC
Hi! I started looking into how to produce RDF dump of MediaInfo entities, and I've encountered some roadblocks that I am not sure how to get around. Would like to hear suggestions on this, here or on phabricator directly, or on IRC: 1. https://phabricator.wikimedia.org/T99 Basically right now when we are enumerating entities for certain types, we are just looking at pages from namespace related to entity types and assume page title is parseable directly into entity ID. However, with slot entities like MediaInfo it is not the case. So, we need there generic service that would take a page and set of entity types, and figure out: a. Which of those entity types are "regular" entities with dedicated page IDs and which ones live in slots b. For the regular entities, do $this->entityIdParser->parse( $row->page_title ) as before c. For slot entities, check that the slot is present and if so, produce entity ID specific to this slot. Preferably this is also done without separate db access (may not be easy) since SqlEntityIdPager needs to have good performance. I am not sure whether there's an API that does that. EntityByLinkedTitleLookup comes very close and even has a hook that does the right thing, but it does DB access even for local IDs for Wikidata (can be fixed) and does not support batching. Any other suggestions how the above can be properly done? There's also complication that pages to slots is no longer one-to-one, so fetch() operation can return not only $limit but anywhere from 0 to (number of slots)*$limit entity Ids. Probably not a huge deal but might need some careful handling. 2. https://phabricator.wikimedia.org/T222306 The entities in SDC are not local entities - e.g. if I am looking at https://commons.wikimedia.org/wiki/Special:EntityData/M9103972.json P180 and Q83043 do not come from Commons, they come from Wikidata. However, they do not have prefixes, which means RDF builder thinks they are local, and assigns them Commons-based namespaces, which is obviously wrong, since they are Wikidata entities. While Commons has a bunch of redirects set up, RDF identifies data by literal URL and has no idea about redirects, so querying data would be problematic if Wikidata datataset is combined with Commons dataset. It would, for example, make it next to impossible to run federated queries between Wikidata and Commons, as two stores would use different URIs for Wikidata entities. Additionally, current RDF generation process assumes wd: prefix always belongs to local wiki, so on Commons wd: is <https://commons.wikimedia.org/entity/> but on Wikidata it's of course the Wikidata URL. This may be very confusing to people. If wd: means different things in Commons and Wikidata, then federated queries may be confusing as it'd be unclear which wd: means what where. Ideally, we'd not use wd: prefix for Commons at all, but this goes against the assumption hardcoded in RdfVocabulary that local wiki entities are wd:. So again I am not sure what's the best way to treat this situation, since I am not sure how federation model in SDC is working - the code suggests there should be some kinds of prefixes for entity IDs, but SDC does not seem to use any. Any suggestions about the above are welcome. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata] Request
Hi! >> I am facing a problem where I can’t get enough data for my project. So is >> there anything that can be done to extend the limit of queries as they >> timeout ? If you have queries that take longer than timeout permits, the options usually would be: 1. Working with Wikidata dumps, as mentioned before 2. Looking into optimizing your query - maybe timeout happens because your query is too slow. Check out https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization and https://www.wikidata.org/wiki/Wikidata:Request_a_query . 3. Download information in smaller chunks using LIMIT/OFFSET clauses. Note that this doesn't speed up query itself. 4. Use LDF server: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Linked_Data_Fragments_endpoint Depending on what data do you need, there probably would be the options. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Fwd: [rdf4j-users] SPARQL 1.2 Community Group
Hi! There is a discussion going on in W3C SPARQL 1.2 Community Group about the improvements in SPARQL language. May be interesting to people that are using SPARQL and those that may have some ideas of how to improve it. -- Forwarded message - From: *Andy Seaborne* mailto:a...@seaborne.org>> Date: Fri, Mar 29, 2019 at 7:31 AM Subject: [rdf4j-users] SPARQL 1.2 Community Group To: mailto:rdf4j-us...@googlegroups.com>> SPARQL 1.2 Community Group starts up: http://www.w3.org/community/sparql-12/ It will document features found as extensions and capture common needs from the user community. -- You received this message because you are subscribed to the Google Groups "RDF4J Users" group. ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] On the use of "prop/direct-normalized" in RDF dumps
Hi! > rg --search-zip -F "http://www.wikidata.org/prop/direct-normalized; > wikidata_latest-truthy.nt.bz2 | pv > wikidata-extids.txt > > But I get as a result a little less than 29.5 million lines. Pubmed and > DOI, which alone account for about 33 million statements, are not included. Could you provide specific properties and preferably also some Q-ids for which you expected to find direct-normalized props but didn't? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Fwd: [Wikimedia-l] Developing instructional material for Wikidata Query Service
Hi! > In recent years, Wikimedia Israel has developed online instructional > materials, such as the Wikipedia courseware and the guide for creating > encyclopedic content. We plan to use our experience in this field, and in > collaboration with Wikimedia Deutschland, we intend to develop a website > with a step-by-step tutorial to learn how to use the Wikidata Query > Service. The instructional material will be available in three languages > (Hebrew, Arabic and English) but it will be possible to add the same > instructions in other languages. We are quite confident that having a > tutorial that explains and teaches the Query Service will help expand > Wikidata to new audiences worldwide. This sounds great, thank you! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] WikibaseCirrusSearch extension
Hi! I've been working for a while now on splitting the code that does searching - and more specifically, searching using ElasticSearch/CirrusSearch - out from Wikibase extension code and into a separate extension (see https://phabricator.wikimedia.org/T190022). If you don't know what I'm talking about here (or not interested in this topic), you can safely skip the rest of this message. The extension WikibaseCirrusSearch is meant to have all the code related to ElasticSearch and CirrusSearch extension integration to Wikibase, so main Wikibase repo does not have any Elastic-specific code. This means that if you have your own Wikibase install, you'll need (after migration is done) to install WikibaseCirrusSearch to get search functionality like we have on Wikidata now. There will also be change in configurations - I'll make a migration document and announce it separately. We're now working on deploying and testing it on Beta/testwiki, after which we'll start migrating production to running the code in this extension for search, after which the search code in the Wikibase repo itself will be removed. You can track the progress in the Phabricator task mentioned above. Since code migration is in pretty advanced stage now, I'd like to ask if you make any changes to any code under repo/includes/Search or repo/config in Wikibase repo, or any tests or configs related to those, please inform me (by adding me to patch reviewers/CC or by email or by any other reasonable means) so that these changes won't be lost in the migration. I'll be looking into the latest patches for anything related periodically, but I might miss things. WikibaseLexeme code that relates to search will be also migrated to a separate extension (WikibaseLexemeCirrusSearch), that work will be starting soon. So the request above applies to the search parts of the WikibaseLexeme code also. If you have any questions/comments, please feel free to ask me, on the lists or on the IRC. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [discovery] Data corruption on 2 Wikidata Query Service servers
Hi! > We are having some issues with 2 of the Wikidata Query Service > servers. So far, the issue looks like data corruption, probably > related to an issue in Blazegraph itself (the database engine behind > Wikidata Query Service). The issue prevents updates to the data, but > reads are unaffected as far as we can tell. The incident report for this issue is here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190110-WDQS It will be updated if we have any new developments or new information. As of now, all servers are working normally. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Fwd: Querying Wikidata
Hi! > Thank's for your reply. > All the failing queries was on the following model > > SELECT distinct ?candidate ?label WHERE { > SERVICE wikibase:mwapi { > bd:serviceParam wikibase:api "EntitySearch" . > bd:serviceParam wikibase:endpoint "www.wikidata.org > <http://www.wikidata.org>" . > bd:serviceParam mwapi:search "Musée Cernuschi" . > bd:serviceParam mwapi:language "fr" . > bd:serviceParam wikibase:limit 5 . > ?candidate wikibase:apiOutputItem mwapi:item . > } > > ?candidate wdt:P17 wd:Q142 . > > SERVICE wikibase:mwapi { > bd:serviceParam wikibase:api "EntitySearch" . > bd:serviceParam wikibase:endpoint "www.wikidata.org > <http://www.wikidata.org>" . > bd:serviceParam mwapi:search "Paris" . > bd:serviceParam mwapi:language "fr" . > bd:serviceParam wikibase:limit 5 . > ?city wikibase:apiOutputItem mwapi:item . > } > ?candidate wdt:P131 ?city . > > ?candidate rdfs:label ?label; > filter(lang(?label)="fr") > } Could you describe in a bit more detail what you're trying to do here? Doing two service calls is not a pattern one would commonly use... It can be slow if query optimizer misunderstands such query, too. I feel I'd have a bit more insight if I understood what you are trying to achieve with this query. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Fwd: Querying Wikidata
Hi! > Is there a good mean to query the sparql wdqs service of wikidata? > I've tried some python code to do it, with relative success. > Success, because some requests gives the expected result. > Relative, because the same query sometimes gives an empty response > either from my code or directly in the WDQS interface, where it's > possible to see that a sparql query sometimes gives an empty response, > sometimes the expected reponse without message or status to know that > the response is erroneous. > (demo is difficult, because it seems to depend of the load of the wdqs > service) Looks like you're running some heavy queries. So the question would be, which queries are those and how often do you run them? > I've found the following info: > * > https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits, > which suggest a possible error HTTP code |429, which I never receive| 429 means you're calling the service too fast or too frequently. If you are just running a single query, you never get 429. > * https://phabricator.wikimedia.org/T179879, which suggest a possible > connexion with OAuth, but such possibility is never documented in the > official documentation https://phabricator.wikimedia.org/T179879 is an open task, thus it's not implemented yet. > None of them gives a practical method to get a response and trust it. Any method that uses HTTP access to SPARQL endpoint would give you the same result. Which depends on query. So I'd suggest providing some info about the queries and specific issues you're having, and then we could see if it's possible to improve it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Slow response and incomplete result for RecentChange API in wikidata.org
Hi! > Also given that it uses oresscores, we recently fixed some performance > issues caused by it. Do you still have issues with it? Yes, the issues I have listed still happen. My API calls do not use ORES. E.g. see: https://logstash.wikimedia.org/goto/63db4ce68fb5da3cdc7828150de10c59 -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Slow response and incomplete result for RecentChange API in wikidata.org
Hi! > Can you please check and let us know if you are still experiencing the > problem? We have a task https://phabricator.wikimedia.org/T202764 which I suspect describes the same issue. It is still open, and even though WDQS is running on Kafka in production and thus is not affected by it, I see it every time I run it on Labs (where Kafka stream is not available). So I think the issues with RC API on wikidata are still alive. There's also a parallel issue of https://phabricator.wikimedia.org/T207718 with RDF fetching, which also still happens. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikibase as a decentralized perspective for Wikidata
Hi! > I don't think this would cause a confusion, because the lexicographical > project is really a separate project that just happens to reside on the > same Wikidata domain. Essentially you did internally what we are asking No, the difference here is that L items are not the same as Q items - e.g. L items do not have sitelinks, and do have lemmas and senses. Data structure is different. If you use different data structure than Q items - i.e., no labels, descriptions, sitelinks, etc. - then you should use a different letter. But if it's the same structure, but for different domain - then it should be Q. > Most other sites that link to Wikidata only care about just one of those > projects. E.g. OSM would have very little interest in lexical data, so > it is OK if "L" prefix would be used in OSM and in WD because it won't > be as confusing to the users as reusing the Q. No, that would be confusing. If OSM wants own data type, because Q item does not fit - e.g. OSM doesn't want descriptions and sitelinks - then it should use a separate letter, like MediaInfo uses M. But using L would not be smart since then this data would not integrate well with lexicografical data. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] BlazeGraph/wikibase:label performance
Hi! > But of course the original query should normally be streaming and not > depend on any such smartness to push LIMIT inwards. You are correct, but this may be a consequence of how Blazegraph treats services. I'll try to look into it - it is possible that it doesn't do streaming correctly there. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata-tech] [BREAKING] Planned RDF ontology prefix change
Hi! > We are planning to change the prefix and associated URIs in RDF > representation for Wikidata from: > > PREFIX wikibase: <http://wikiba.se/ontology-beta#> > > to: > > PREFIX wikibase: <http://wikiba.se/ontology#> The change has been implemented now, and RDF data is generated without the beta prefix. Please tell me if you notice any problems or have any questions. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons
Hi! > data on Commons. I also think that I understand your statement above. > What I'm not understanding is how Daniel's proposal to "start using the > ontology as an ontology on wikimedia projects, and thus expose the fact > that the ontology is broken." isn't a proposal to add poor quality > information from Wikidata onto Wikipedia and, in the process, give > Wikipedians more problems to fix. Can you or Daniel explain this? While I can not pretend to have expert knowledge and do not purport to interpret what Daniel meant, I think here we must remember that Wikipedia, while being of course of huge importance, is not the only Wikimedia project, so "start using it on Wikimedia projects" does not necessarily mean "start using it on Wikipedia", yet less "start adding bad information to Wikipedia" (there are other ways to use the data, including imperfect ontologies - e.g. for search, for bot guidance, for quality assurance and editor support, and many other ways) I am not prescribing a specific scenario here, just reminding that "using the ontology on wikimedia projects" can mean a wide variety of things. > Separately, someone wrote to me off list to make the point that > Wikipedians who are active in non-English Wikipedias also wouldn't > appreciate having their workloads increased by having a large quantity > poor-quality information added to their edition of Wikipedia. I think I am sure that would be a bad thing. But I don't think anything we are discussing here would lead to that happening. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons
Hi! > Cparle wants to make sure that people searching for "clarinet" also get > shown images of "piccolo clarinet" etc. > > To make this possible, where an image has been tagged "basset horn" he > is therefore looking to add "clarinet" as an additional keyword, so that > if somebody types "clarinet" into the search box, one of the images > retrieved by ElasticSearch will be the basset horn one. Generally if the image is tagged with "basset horn" and the user query is "clarinet", we can do one of the following: 1. Index all upstream-hierarchy for "basset horn" (presumably we would have to cut off when it gets too deep or too abstract) and then match directly when searching. 2. Expand hierarchy down-stream from "clarinet" and then match against search index. 3. Have some manual or automatic process that ensures that both "clarinet" and "basset horn" are indexed (not necessarily at once) and rely on it to discover the matches. The problem with (1) is that if hierarchy changes, we will have to do huge number of updates which might overwhelm the system, and most of these updates would be not even for things people search for, but we have no way to know that. The problem with (2) is that downstream hierarchies explode very fast, and if you search for "clarinet" and there are 1 descendants in these hierarchies, we can't search for all of them, so you may never get a chance to find the basset horn. Also, of course, querying big downstream hierarchies takes time too, which means performance hit. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons
Hi! > possibility to find more results by letting the search engine traverse > the "more-general-than" links stored in Wikidata. People have discovered > cases where some of these links are not correct (surprise! it's a wiki > ;-), and the suggestion was that such glitches would be fixed with > higher priority if there would be an application relying on it. But even The main problem I see here is not that some links are incorrect - which may have bad effects, but it's not the most important issue. The most important one, IMHO, that there's no way to figure out in any scalable and scriptable way what "more-general-than" means for any particular case. It's different for each type of objects and often inconsistent within the same class (e.g. see confusion between whether "dog" is an animal, a name of the animal, name of the taxon, etc.) It's not that navigating the hierarchy would lead as astray - we're not even there yet to have this problem, because we don't even have a good way to navigate it. Using instance-of/subclass-of only seems to not be that useful, because a lot of interesting things are not represented in this way - e.g. finding out that Donna Strickland (Q56855591) is a woman (Q467) is impossible using only this hierarchy. We could special-case a bunch of those but given how diverse Wikidata is, I don't think this will ever cover any significant part of the hierarchy unless we find a non-ad-hoc method of doing this. This also makes it particularly hard to do something like "let's start using it and fix the issues as we discover them", because the main issue here is that we don't have a way to start with anything useful beyond a tiny subset of classes that we can special-case manually. We can't launch a rocket and figure how to build the engine later - having a working engine is a prerequisite to launching the rocket! There are also significant technical challenges in this - indexing dynamically changing hierarchy is very problematic, and with our approach to ontology anything can be a class, so we'd have to constantly update the hierarchy. But this is more of a technical challenge, which will come after we have some solution for the above. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [BREAKING] Planned RDF ontology prefix change
Hi! > If you're making the change, maybe worth going to https: as it'll be > painful to do later? Please see https://phabricator.wikimedia.org/T153563 where it was discussed. In general, there's no reason to use https for ontology URIs, as ontology URIs do not have any data in them and accessing them would not be very useful. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata-tech] [BREAKING] Planned RDF ontology prefix change
Hi! We are planning to change the prefix and associated URIs in RDF representation for Wikidata from: PREFIX wikibase: <http://wikiba.se/ontology-beta#> to: PREFIX wikibase: <http://wikiba.se/ontology#> If you are using Wikidata Query Service, you do not have to do anything, as WDQS already is using the new definition. However, if you consume RDF exports from Wikidata or RDF dumps directly, you will need to change your clients to expect the new URI scheme for Wikibase ontology. Also, if you're using Wikibase extension in your project, please be aware that the RDF URIs generated by it will use this prefix after the change. This is defined in repo/includes/Rdf/RdfVocabulary.php around line 175: self::NS_ONTOLOGY => self::ONTOLOGY_BASE_URI . "#", The new data will have schema:softwareVersion "1.0.0" triple on the dataset node[1], which will allow your software to distinguish the new data format from the old one. The task tracking the change is https://phabricator.wikimedia.org/T112127. I will make another announcement when the change is merged and deployed and the data produced by Wikidata is going to change. Please contact me (or comment in the task) if you have any questions or concerns. [1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Header -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
[Wikidata] [BREAKING] Planned RDF ontology prefix change
Hi! We are planning to change the prefix and associated URIs in RDF representation for Wikidata from: PREFIX wikibase: <http://wikiba.se/ontology-beta#> to: PREFIX wikibase: <http://wikiba.se/ontology#> If you are using Wikidata Query Service, you do not have to do anything, as WDQS already is using the new definition. However, if you consume RDF exports from Wikidata or RDF dumps directly, you will need to change your clients to expect the new URI scheme for Wikibase ontology. Also, if you're using Wikibase extension in your project, please be aware that the RDF URIs generated by it will use this prefix after the change. This is defined in repo/includes/Rdf/RdfVocabulary.php around line 175: self::NS_ONTOLOGY => self::ONTOLOGY_BASE_URI . "#", The new data will have schema:softwareVersion "1.0.0" triple on the dataset node[1], which will allow your software to distinguish the new data format from the old one. The task tracking the change is https://phabricator.wikimedia.org/T112127. I will make another announcement when the change is merged and deployed and the data produced by Wikidata is going to change. Please contact me (or comment in the task) if you have any questions or concerns. [1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Header -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons
Hi! > Apparently the Wikidata hierarchies were simply too complicated, too > unpredictable, and too arbitrary and inconsistent in their design across > different subject areas to be readily assimilated (before one even > starts on the density of bugs and glitches that then undermine them). The main problem is that there is no standard way (or even defined small number of ways) to get the hierarchy that is relevant for "depicts" from current Wikidata data. It may even be that for a specific type or class the hierarchy is well defined, but the sheer number of different ways it is done in different areas is overwhelming and ill-suited for automatic processing. Of course things like "is "cat" a common name of an animal or a taxon and which one of these will be used in depicts" adds complexity too. One way of solving it is to create a special hierarchy for "depicts" purposes that would serve this particular use case. Another way is to amend existing hierarchies and meta-hierarchies so that there would be an algorithmic way of navigating them in a common case. This is something that would be nice to hear about from people that are experienced in ontology creation and maintenance. > to be chosen that then need to be applied consistently? Is this > something the community can do, or is some more active direction going > to need to be applied? I think this is very much something that the community can do. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Mapping Wikidata to other ontologies
Hi! > That's one way of linking up, but another way is using equivalent > property ( https://www.wikidata.org/wiki/Property:P1628 ) and equivalent > class ( https://www.wikidata.org/wiki/Property:P1709 ). See for example It is technically possible to add values for P1628 into RDF export. However, the following questions arise: 1. Are we ready to claim these are exact equivalents? Sometimes semantic meanings differ, and some properties have class requirements - e.g. http://schema.org/illustrator expects value to be of class Person, but of course Wikidata item would not have that class. Same for the subject - it expected to be of a class Book, but won't be. This may confuse some systems. Is that ok? 2. How we deal with multiple ontologies with the same meanings? E.g. https://www.wikidata.org/wiki/Property:P21 has 4 equivalent properties. There might be more. Do we want to generate them all? Why there are two properties for the same FOAF ontology - is that right? 3. If you change P1628, that does not automatically make all items with the relevant predicate update. You need to do an extensive update process - which is currently does not exist, and for popular property may require significant resources to complete, some properties have millions of uses. Using P1709 is even more tricky since Wikidata ontology (provided we call what we have an ontology, which may also not be acceptable to some) is rather different from traditional semantic ontologies, and we do not really enforce any of the rules with regard to classes, property domain/ranges, etc. and have frequent and numerous exceptions to those. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Stemming in Search
Hi! > The general Search box was what I was using (top right corner of interface) > > I typed in the following: > > Readers Digest > > and expected to see > Reader's Digest Q371820 > > but it did not appear. > > Today I just checked again the same scenario, as I typed this email.> > and now it does appear. Yes, this is how it should work. There were no changes lately, AFAIK, but it is possible that you hit some glitch or maintenance on your previous search. If that happen again, please tell me when and with which search string /URL, I'll try to investigate. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Stemming in Search
Hi! > When will stemming be supported in Search ? In general, I think it already should be, for fields and contexts that use appropriate analyzers, but I'd like to hear more details: 1. Which search? 2. What you're looking for, i.e. search string? 3. What you expect to find? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata SPARQL query logs available
Hi! On 8/23/18 2:07 PM, Daniel Mietchen wrote: > On Thu, Aug 23, 2018 at 10:44 PM wrote: >> I was wondering why our research section was number 8. Then I recalled >> our dashboard running from >> "http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html;. It >> updates around each 3 minute all day long... > > Such automated queries should not be in the organic query file that I looked > at. If it's a browser page and the underlying code does not set distinctive user agent, I think they will be. It'd be hard to identify such cases otherwise (ccing Markus in case he knows more on the topic). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata SPARQL query logs available
Hi! > I just ran Max' one-liner over one of the dump files, and it worked > smoothly. Not sure where the best place would be to store such things, > so I simply put it in my sandbox for now: > https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160 If you think it's a dataset others may want to reuse, tabular data on Commons may be a venue: https://www.mediawiki.org/wiki/Help:Tabular_Data -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)
Hi! > This is a bit tangential to the topic, but isn’t that basically what > schema.org was developed for? (I’m not sure if that’s still its primary > purpose, but as far as I know it was started by a group of search > engines to develop a unified format websites could use to make their > semantics more accessible to those search engines.) There are a number of schemas, like Dublin Core, that try to address issues like that. However, none is even close to what we're talking about - covering several thousands properties that change all the time. They have very basic things covered, but AFAIK not much beyond. And I think those vocabularies still do not solve our problem with updating labels in multiple languages and keeping them in sync. That said, this would be quite offtopic for *this* thread, but still if anybody has any ideas on how to present Wikidata content better to search engines using well-known metadata vocabularies, I think it would be a very welcome effort. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Indexing all item properties in ElasticSearch
Hi! > I tried searching for a few DOIs today which are string properties > (i.e. 10.1371/JOURNAL.PCBI.1002947) and didn't get any results. Statements are indexed, but you have to use haswbstatement with specific property to look for them. > Is this the phabricator task for > this: https://phabricator.wikimedia.org/T163642 ? This is the task to make strings searchable _without_ haswbstatement keyword. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)
Hi! > https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently > our query service is a very strong and complete service, but Wikidata > search is very poor. Let's take Blade Runner. I don't think it's *very* poor anymore, but it certainly can be better. > In my ideal world, everything I see as a human gets indexed into the > search engine preferably in a per language index. For example for Dutch Err The problem is that what you see as a human and what search engine uses for lookups are very different things. While for text articles it is similar, for structured data it's quite different, and treating structured data the same way as text is not going to produce good results, partially because most search algorithms make assumptions that come from text world, partially because we'd be ignoring useful clues present in structured data. > something like a text_nl field with the, label, description, aliases, > statements and references in there. So index *everything* and never see There are such fields, but it makes no sense to put references there, because there's no such thing as "Dutch reference". References do not change with language. > a Qnumber or Pnumber in there (extra incentive for people to add labels > in their language). Probably also everything duplicated in the text That presents a problem. While you see "instance of": "human", the data is P31:Q5. We can, of course, put "instance of": "human" in the index. But what if label for Q5 changes? Now we have to re-index 10 million records. And while we're doing it, what if another label for such item changes again? We'd have to start another million-size reindex. In a week, we'd have a backlog of hopeless size, or will require processing power that we just don't have. Note also that ElasticSearch doesn't really do document updates - it just writes a new document. So frequent updates to the same document is not its optimal scenario, and we're talking about propagating each label edit to each item that is linked to that one. I'm afraid that would explode on us very quickly. The problem is not indexing labels, the problem is keeping them up-to-date on 50 million interlinked items. When displaying, it's easy - you don't need to worry until you show it, and most items are shown only rarely. Even then you see a label out of date now and then. But with search, you can't update label on use - when you want to use it (i.e. look up), it should already be up-to-date, otherwise it's useless. > As for implementation: We already have the logic to serialize our json > to the RDF format. Maybe also add a serialization format for this that > is easy to ingest by search engines? I don't know any such special format, do you? We of course have JSON updates to ElasticSearch, but as I noted before, updates are the problem there, not format. RDF of course also does not carry denormalized data, so we also update only entries that need updating, and fetch labels on use. We can not do it for search index. I don't think format here is the problem. > . Making it easier to index not only for our own search would be a nice > added benefit. Sure, but experience have shown that the strategy of "dump everything into one huge text" works very poorly in Wikidata. That's why we implemented specialized search that knows about how the structured data works. If the search sucks less now than it did before, that's the reason. > How feasible is this? Do we already have one or multiple tasks for this > on Phabricator? Phabricator has gotten a bit unclear when it comes to > Wikidata search, I think because of misunderstanding between people what > the goal of the task is. Might be worthwhile spending some time on > structuring that. Wikidata search tasks would be under "Wikidata" + "Discovery-Search". There are multiple tasks for it, but if you want to add any, please feel welcome to browse and add. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Indexing all item properties in ElasticSearch
Hi! > * I would really like dates (mainly, born/died), especially if they work > for "greater units", that is, I search for a year and get an item back, > even though the statament is month- or day-precise This is something I've been thinking about for a while, mainly because the way we index dates now does not serve some important use cases. Even in the Query Service we treat dates as fixed instants on the time scale, whereas some dates are not instants but intervals (which in captured in wikidata Precision but we are currently not paying any attention to it), in fact many of the dates we use are more of interval-y nature than instant-y. This makes searching for "somebody that was born in 1820" possible but laborious (you need to do intervals manually) and inefficient since we can't just look up by year. There are certainly improvement possible in this area, not yet sure how to do it though. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Indexing all item properties in ElasticSearch
Hi! > I could definitely see a usecase for 1) and maybe for 2). For example, > let's say i remember that one movie that Rutger Hauer played in, just > searching for 'movie rutger hauer' gives back nothing: > > https://www.wikidata.org/w/index.php?search=movie+rutger+hauer > > While Wikipedia gives back quite a nice list of options: > > https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer Well, this is not going to change with the work we're discussing. The reason you don't get anything from Wikidata is because "movie" and "rutger hauer" are labels from different documents and ElasticSearch does not do joins. We only index each document in itself, and possibly some additional data, but indexing labels from other documents is now beyond what we're doing. We could certainly discuss it but that would be separate (and much bigger) discussion. > If we would index item properties as well, you could get back Blade > Runner (Q184843) which has Rutger Hauer as one of its 'cast member' > values. You could, but not by asking something like "movie rutger hauer", at least not without a lot of additional work. Indexing "cast member" would get you a step closer, but only a tiny step and there are a number of other steps to take before that can work. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch
Hi! > The top 1000 > is: > https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing This one is pretty interesting, how do I extract this data? It may be useful independently of what we're discussing here. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch
Hi! > I think we already index way more than P31 and P279. Oh yes, all the string properties. > So I think that the increase is smaller than what you anticipate. > What I'd try to avoid in general is indexing terms that have only doc > since they are pretty useless. For unique string properties, that would be a frequent occurrence. But I am not sure why it's useless - won't it be a legit use case to look up something by external ID? > I think we should investigate what kind of data we may have here, and at > least for statement_keywords I would not index data that contain random > text (esp. natural language) since they are prone to be unique and > impossible to search. Yes, we definitely should not do that. I tried to exclude such properties but if you notice more of them, let's add them to exclusion config. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Indexing all item properties in ElasticSearch
Hi! > * I would really like dates (mainly, born/died), especially if they work > for "greater units", that is, I search for a year and get an item back, > even though the statament is month- or day-precise What would be the use case for this? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Indexing all item properties in ElasticSearch
Hi! Today we are indexing in ElasticSearch almost all string properties (except a few) and select item properties (P31 and P279). We've been asked to extend this set and index more item properties (https://phabricator.wikimedia.org/T199884). We did not do it from the start because we did not want to add too much data to the index at once, and wanted to see how the index behaves. To evaluate what this change would mean, some statistics: All usage of item properties in statements is about 231 million uses (according to sqid tool database). Of those, about 50M uses are "instance of" which we are already indexing. Another 98M uses belong to two properties - published in (P1433) and cites (P2860). Leaving about 86M for the rest of the properties. So, if we index all the item properties except P2860 and P1433, we'll be a little more than doubling the amount of data we're storing for this field, which seems OK. But if we index those too, we'll be essentially quadrupling it - which may be OK too, but is bigger jump and one that may potentially cause some issues. So, we have two questions: 1. Do we want to enable indexing for all item properties? Note that if you just want to find items with certain statement values, Wikidata Query Service matches this use case best. It's only in combination with actual fulltext search where on-wiki search is better. 2. Do we need to index P2860 and P1433 at all, and if so, would it be ok if we omit indexing for now? Would be glad to hear thoughts on the matter. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] UniProt license change to CC-BY 4.0 could we be added to the federatable sparql endpoints
Hi! On 7/19/18 1:07 AM, Jerven Tjalling Bolleman wrote: > Dear WikiData community, > > I am very happy to announce that all UniProt datasets are now available > under CC-BY 4.0 > > https://www.uniprot.org/news/2018/07/18/release > https://www.uniprot.org/help/license > > > https://www.sib.swiss/about-us/news/1186-encouraging-knowledge-reuse-to-foster-innovation > > > Could our sparql endpoint https://sparql.uniprot.org/sparql please be > added to the > the endpoints that are usable with SPARQL 1.1. SERVICE clauses at > https://query.wikidata.org/sparql. Thank you! I will take care of it in the next update (next week). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] On traceability and reliability of data we publish [was Re: [Wikimedia-l] Solve legal uncertainty of Wikidata]
Hi! > I agree this is misconception that a copyright license make any direct > change to data reliability. But attribution requirement does somewhat > indirectly have an impact on it, as it legally enforce traceability. While true, I don't think it's of much practical use if traceability is what you are seriously interested in. Imagine Wikidata were CC-BY, so each piece of data you use from Wikidata now has to be marked as "coming from Wikidata.Org". What have you gained? Wikidata is huge, and this mark doesn't even tell you which item it is from, while being completely satisfactory legally. Even more useless it is for actually ensuring the data is correct or tracing its provenance to primary sources - you'd still have to find the item and check the references manually (or automatically, maybe) as you could do for CC0. CC-BY license would not have added very much on Wikidata side. All this is while, of course, even with CC0 nothing prevents you from importing Wikidata data in such a way that each piece of data still carries the mark "coming from Wikidata". While it is not a legal requirement with CC0, nothing in CC0 prevents that from happening. If your provenance needs are matched by this, there's nothing preventing you from doing this, and legal requirements of CC-BY do not improve it for you in any way - they just would force people that *do not* need to do it still do it. > That is I strongly disagree with the following assertion: "a license > that requires BY sucks so hard for data [because] attribution > requirements grow very quickly". To my mind it is equivalent to say that I think this assertion (that attribution requirements grow) is factually true. Each data piece from CC-BY data set needs to carry attribution. If your data needs require to combine several data sets, each of them needs to carry attribution. This attribution should be carried through all data processing pipelines. You may be OK with this growth, but as I just explained above, these requirements, while being onerous for people that don't need tracing each piece of data, are still unsatisfactory in many cases for those that do. So having CC-BY would be both onerous and useless. > we will throw away traceability because it is subjectively judged too > large a burden, without providing any start of evidence that it indeed > can't be managed, at least with Wikimedia current ressources. It's not Wikimedia that will be shouldering the burden, it's every user of Wikimedia data sets. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata-tech] lexeme fulltext search display
Hi! > I can reimplement it manually, but I would be largely duplicating what > HtmlPageLinkRendererBeginHookHandler is supposed to do. The problem > seems to be that it is not doing it right. When the code works on the > link like /wiki/Lexeme:L2#L2-F1, it does this: > > $entityId = $foreignEntityId ?: > $this->entityIdLookup->getEntityIdForTitle( $target ); > > Which produces back LexemeId instead of Form ID. It can't return Lexeme > ID since lexeme does not have content model, and getEntityIdForTitle > uses content model to get from Title to ID. So, I could duplicate all > this code but I don't particularly like it. Could we fix > HtmlPageLinkRendererBeginHookHandler instead maybe? Also, looks like Form actually doesn't have link-formatter-callback and its own link formatter code. So I wonder if there's an existing facility to format links to Forms? Leszek, do you have any information on this? Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] lexeme fulltext search display
Hi! > You can use an EntityTitleLookup to get the Title object for an EntityId. In > case of a Form, it will point to the appropriate section. You can use the OK, I see it's just adding form id as a fragment, so it's easy I guess. > LinkRenderer service to make a link. Or you use an EntityIdHtmlLinkFormatter, > which should do the right thing. You can get one from a > OutputFormatValueFormatterFactory. I can reimplement it manually, but I would be largely duplicating what HtmlPageLinkRendererBeginHookHandler is supposed to do. The problem seems to be that it is not doing it right. When the code works on the link like /wiki/Lexeme:L2#L2-F1, it does this: $entityId = $foreignEntityId ?: $this->entityIdLookup->getEntityIdForTitle( $target ); Which produces back LexemeId instead of Form ID. It can't return Lexeme ID since lexeme does not have content model, and getEntityIdForTitle uses content model to get from Title to ID. So, I could duplicate all this code but I don't particularly like it. Could we fix HtmlPageLinkRendererBeginHookHandler instead maybe? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] lexeme fulltext search display
Hi! >> color/colour (L123) >> colors: plural for color (L123): English noun > > I'd rather have this: > > colors/colours (L123-F2) > plural of color (L123): English noun This part is a bit trickier since the title is still L123, so the system now is generating the link for L123. I could override that, but I see two questions: 1. What the link will be pointing to? I haven't found the code to generate the link to specific Form. I could write a new one but if it'd sit outside main classes it may be a fragile design. 2. This means overriding standard linking code and possibly reimplementing part of it (depending on whether this code supports generating Form link instead of Lexeme) - may again be a bit fragile. Unless I find standard means to do it. > Note that in place of "plural", you may have something like "3rd person, > singular, past, conjunctive", derived from multiple Q-ids. Yes, of course. > Again, I don't think any highlighting is needed. Not strictly speaking needed, but might be nice. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
[Wikidata-tech] lexeme fulltext search display
Hi! I am working now on Lexeme fulltext search. One of the unclear moments I have encountered is how to display Lexemes as search results. I am basing on assumption that we want to match both Lemmas and Forms (please tell me if I'm wrong). Having the match, I plan to display Lemma match like this: title (LN) Synthetic description e.g. color/colour (L123) English noun Meaning, the first line with link would be standard lexeme link generated by Lexeme code (which also deals with multiple lemmas) and the description line is generated description of the Lexeme - just like in completion search. The problem here, however, is since the link is generated by the Lexeme code, which has no idea about search, we can not properly highlight it. This can be solved with some trickery, probably, e.g. to locate search matches inside generated string and highlight them, but first I'd like to ensure this is the way it should be looking. More tricky is displaying the Form (representation) match. I could display here the same as above, but I feel this might be confusing. Another option is to display Form data, e.g. for "colors": color/colour (L123) colors: plural for color (L123): English noun The description line features matched Form's representation and synthetic description for this form. Right now the matched part is not highlighted - because it will otherwise always be highlighted, as it is taken from the match itself, so I am not sure whether it should be or not. So, does this display look as what we want to produce for Lexemes? Is there something that needs to be changed or improved? Would like to hear some feedback. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] Wikidata full text search
Hi! > Would there be any drawback with the following steps as way forward > and possibility to learn more as we go? > 1. We return results for the Lexeme namespace only when people > explicitly select it If you mean "it and only it" (as opposed to Lexemes + any other namespace), then yes, this is doable and this is probably what I am going to start with. However, a lot of people - as I observed with several community members - tend to use "All" option and expect it to work. > 2. We get feedback > 3. We go the "Best possible query" route when people select all namespaces > 4. We get feedback > 5. We go the "Best possible query" route for all searches if feedback > indicates this is useful (I don't know at this point) I am not sure which mode is best for Wikidata now, there are at least several plausible ways do go by default for Special:Search: 1. Search in Items only 2. Search in Items + Properties 3. Search in Items + Properties + Lexemes 4. Search in Items + Lexemes 5. Any of the above plus some of the article spaces (i.e. Wikidata or Help) This requires mixed search working (except for 1 and 2) but is a separate decision from it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
[Wikidata-tech] Wikidata full text search
Hi! While working on fulltext search for Lexemes, I have encountered a question which I think needs to be discussed and resolved. The question is how fulltext search should be working when dealing with different content models and what search should do by default and in specialized cases. The main challenge in Wikidata is that we are dealing with substantially different content models - articles, Items (including Properties, because while being formally different type, they are similar enough to Items for search to ignore the difference) and Lexemes organize their data in a different way, and should be searched using different specialized queries. This is currently unique for Wikidata, but SDC might eventually have the same challenge to deal with. I've described challenges and questions there are here in more detail: https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_search#Fulltext_search I'd like to first hear some feedback about what are the expectations about the combined search are - what is expected to work, how it is expected to work, what are the defaults, what are the use cases for these. I have outlined some solutions that were proposed on wiki, if you have any comments please feel welcome to respond either here or on wiki. TLDR version of it is that doing search on different data models is hard, and we would need to sacrifice something to make it work. We need to figure out and decide which of these sacrifices are acceptable and what is enabled/disabled by default. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata] WDQS with use of automated requests
Hi! On 5/15/18 3:27 PM, Justin Maltais wrote: > Hi, > > I am looking for the most efficient way of getting the following > information out of WDQS: > > * One language only (e.g. fr.wikipedia.org) > * All instances of human (e.g. of the abstraction: wd:Q9916|Dwight > David > Let's say we have a list of all sovereign states (Q16, Q30, Q142, ...) > and all letters of the requested language (French: a, b, c, ...) , we > can automate requests and get a lot of results. Unfortunately, it's > costly and not efficient. It takes about a day to succeed. The first thing I would like to ask is please don't do that again. This created a significant load on the server, the script completely ignored the throttling headers we sent, and in the future we would ban such clients for extended periods of time, to prevent harm to the service. If your client can not abide by 429/Retry-After headers, please do not run it in automated repeated fashion until it either can handle them properly, or insert delays long enough so you can be sure you are not launching an avalanche of heavy requests and crowding out other users. If something takes too long, that's a good moment to ask for help, not to put it in a loop that would hit the server repeatedly for days. If you need to deal with a massive data set that needs to be processed, I would suggest trying the following strategy: 1. Load the primary key data - like list of all humans if that's what you need - to your own storage. You can use either LDF server or parsing the dump directly for that for Q5 (maybe with Wikidata Toolkit?). For some scenarios, even direct query would be fine, but for Q5 it probably would be too much. 2. Split this data set into palatable batches - like 100 items per batch or so, you can experiment on that, it's fine to cause a couple of timeouts if it's not an automated script doing it 20 times a second for a long time. Once you have sane batch size, run the query that needs to fetch other data using VALUES clause to substitute primary key data. Watch the 429 responses - if you're getting them, insert delays or lower batch size, or ask for help again if it doesn't work. Alternatively, segmenting the records by some other criteria may work too, but I don't think filter like STRSTARTS(?personLabel, "D")) is going to be effective - I don't think Blazegraph query optimizer is smart enough to convert this to index lookup, and without that, this is just slowing things down by introducing more checks in the query. And even if it did, there's a lot of labels starting with "D", so that probably won't be too useful for speeding it up. Having said that, I am curious - what exactly you are doing with this data set? Why you need a list of all humans - how this list is going to be used? Knowing that may help to devise better specialized strategy of achieving the same. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] [Wikitech-l] GSoC 2018 Introduction: Prssanna Desai
Hi! > Greetings, > I'm Prssanna Desai, an undergraduate student from NMIMS University, Mumbai, > India and I've been selected for GSoC '18. > > *My Project:* *Improve Data Explorer for query.wikidata.org > <http://query.wikidata.org>* Welcome! Thanks for participating and helping to make the Query Service better! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikiata and the LOD cloud
Hi! > you should read your own emails. In fact it is quite easy to join the > LOD cloud diagram. > > The most important step is to follow the instructions on the page: > http://lod-cloud.net under how to contribute and then add the metadata. I may not be reading it right or misunderstanding something, but I tried to locate up-to-date working instructions for doing this a few times and it always ended up going nowhere - the instructions turned out to be out of date, or new process not working yet, or something else. It would be very nice and very helpful if you could point out specifically where on that page are step-by-step instructions which could be followed and result in resolving this issue? > Do you really think John McCrae added a line in the code that says "if > (dataset==wikidata) skip; " ? I don't think anybody thinks that. And I think most of people there think it would be nice to have Wikidata added to LOD. It sounds like you know how to do it, could you please share more specific information about it? > You just need to add it like everybody else in LOD, DBpedia also created > its entry and updates it now and then. The same accounts for > http://lov.okfn.org Somebody from Wikidata needs to upload the Wikidata > properties as OWL. If nobody does it, it will not be in there. Could you share more information about lov.okfn.org? Going there produces 502, and it's not mentioned anywhere on lod-cloud.net. Where it is documented and what is exactly the process and what you mean by "upload the Wikidata properties as OWL"? More detailed information would be hugely helpful. Thanks in advance, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Use Repology to update software package data
Hi! > i want to inform you that Repology has your "repository" of software > versions included and can list problems or outdated versions that way. What does this list actually include? Is this the list of software and versions present in Wikidata as items? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] CC-BY-SA
Hi! > No. There is no such thing as "category namespace" in Wikidata. There You are correct. I was talking about category namespace in Wikidata Query Service. It is documented here: https://www.mediawiki.org/wiki/Wikidata_query_service/Categories -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Election data
Hi! > Something I wish was available is the voting record, at least at a > country/state level. Knowing the politician's time in office is a great > start, but how that person voted is what really makes democracy work. I think Ballotpedia has this data. E.g.: https://ballotpedia.org/Marco_Rubio Not sure however if it's structured or available in API form. It also has state level politicians, e.g.: https://ballotpedia.org/Bill_Monning - but it seems it's even harder to parse there. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] CC-BY-SA
Hi! > Just checking so let's say for: > https://www.wikidata.org/wiki/Q2201 > Narrative location (NY City) is from English Wikipedia, then it's CC BY SA? No, it's CC0 since it's Wikidata data. Facts as such are not copyrightable, so the fact that the particular movie is set in NYC is not subject to any license. Specific arrangement (collection) of facts can be copyrighted and licensed though and this specific one is Wikidata, which is licensed under CC0. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] CC-BY-SA
Hi! > Are all data that can be fetched via SPARQL CC0-licensed? Data in the main namespace, without the use of federation - yes. Federated data - https://query.wikidata.org/copyright.html, federated endpoints can have different licenses, either CC0-like or CC-BY-SA like (I don't think we accepted any that have anything stricter than that) Category namespace - since it comes from Wikipedias, it's CC-BY-SA I assume. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] stats on WD edits and WDQS uptime
Hi! > Second, are there any stats on the uptime of the WDQS SPARQL endpoint? I am not entirely sure how you define "uptime" here? If you try to access query.wikidata.org, it'd be very close to 100%. That said, we had a couple of incidents where one or more servers failed, causing some queries to get stuck or be rejected, see https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs and https://wikitech.wikimedia.org/wiki/Incident_documentation/20171130-wdqs These do not take the whole service down, so I am not sure how they qualify uptime-wise. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata fulltext search prototype
Hi! > I guess its using an older index from a few weeks ago ? Doesn't seem to > have the latest properties that have landed, but that's ok if the ES > index isn't current yet and your just experimenting and getting feedback. Yes, exactly. Wikidata index is big, and we can not use main index since we're experimenting on it, so we make a copy and use that. Of course, the copy gets out of date :) This one is couple of weeks old. > > http://wikidata-wdsearch.wmflabs.org/w/index.php?search=partition=Special:Search=advanced=1=1=25rdek6vt4n1ekkk5ht0ew0vv > > Didn't see > https://www.wikidata.org/wiki/Property:P4653 yes, too recent :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata fulltext search prototype
Hi! > Where can I learn about the internals of this jewel? (which search > engine, what metrics are used to rank items, and so on). Thanks for your kind words. You can track it here: https://phabricator.wikimedia.org/T125500 and associated tasks like this one: https://phabricator.wikimedia.org/T178851 which contain links to the patches. The search runs on the same ElasticSearch we use for search on other sites, but the prototype has specific code to deal with Wikidata specific data structure and the fact that it is, unlike most other Wikimedia sites, multilingual by design. The rankings are hand-tuned now and kind of hard to read right now (we're working on improving this), they are contained here: https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/repo/config/ and specific functions we're using here: https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/repo/config/ElasticSearchRescoreFunctions.php;4c6aa54e56c68ebd3543b23c88f52ae6f176a079$25 Basically it's a combination of match score (how well the string matches the query), incoming link count, sitelink count and special boosts like demoting the disambiguation pages. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Wikidata fulltext search prototype
Hi! Search Platform team would like to present a prototype test site of new and improved Wikidata fulltext search: http://wikidata-wdsearch.wmflabs.org/wiki/Special:Search Please try your favorite searches on it and report whether it looks good and which problems you notice. Important to note for this prototype: - The data in the search is imported from Wikidata index but not updated from it after import, so it may be slightly out of date - The search is in English by default but you can try other languages by using uselang parameter, e.g.: http://wikidata-wdsearch.wmflabs.org/w/index.php?search=Wien=Special:Search=advanced=1=1=de Note that since it's a test site, this is probably the best way to test non-English searches as logins etc. may not work there properly. - Search would work properly only for main & property namespace (0 and 120). What kind of problems we are looking for? - Ranking and retrieval problems, i.e. result X appears too low or too high in specific search, or does not appear at all (please tell us specific search query and expected result) - UI problems - i.e. the ranking is fine but highlighting or label or description is broken or look bad, or not highlighting the result that should be highlighted Of course, if some search result worked spectacularly better for you, it would be nice to know too :) What should work? Any search in Special:Search in main namespace and Property namespace should produce sensible result. Searches without advanced syntax should have better results than before, and search with advanced syntax (+, -, *, quotes, etc.) should work no worse than before. Please note that this is a test wiki, so nothing else but search is expected to work, including clicking on other links, editing, browsing to other pages, etc. This is also a test site, so short disruptions might be possible when we update or change things or fix bugs reported by you :) How to provide feedback? Several ways are possible: - Reply to this list or personally to me if you prefer - On-wiki message on my talk page: https://www.wikidata.org/wiki/User_talk:Smalyshev_(WMF) - Talk to us on IRC: #wikimedia-discovery Thanks! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
Hi! > Somebody pointed me to the following issue: > https://phabricator.wikimedia.org/T179681 Unfortunately I'm not able > to log in there with the "Phabricator" so I cannot edit the issue > directly. I'm sending this email instead. Thank you, I've updated the task with references to your comments. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] RDF: All vs Truthy
Hi! > Can somebody please explain (in simple terms) what's the difference > between "all" and "truthy" RDF dumps? I've read the explanation > available on the wiki [1] but I still don't get it. Technically "truthy" is the set of statements with best non-deprecated rank for the property. Semantically, it is the value you most likely expect as the answer to a simple question "what is X of Y", like "what is the population of London" or "who is the wife of the US president?" > If I'm just a user of the data, because I want to retrieve > information about a particular item and link items with other > graphs... what am I missing/leaving-out by using "truthy" instead of > "all"? Historical data - i.e. current population vs. all historic population figures, current spouse vs.all previous marriages, current head of state vs. list of all people occupying the office. Some other data, possibly, such as official name vs. alias (provided that is expressed as a property), commonly accepted value vs. alternative possibilities, etc. > A practical example would be appreciated since it will clarify > things, I suppose. Current (as in, latest/best available for now) population of London would be found as "truthy" value (wdt), all other population figures - e.g. historical figures - will be under "all" (p/ps/psv). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata-tech] Tracking internal uses of Wikidata Query Service
Hi! We are seeing more use of the Wikidata Query Service by Wikimedia projects. Which is excellent news, but somewhat worse news is that the maintainers of WDQS do not have a good idea what these services are, what they needs are and so on. So, we have decided we want to start tracking internal uses of Wikidata Query Service. To that point, if you run any functionality on Wikimedia sites (Wikipedias, Wikidata, etc., anything with wikimedia domain) that uses queries to the Wikidata Query Service, please go to: https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage and add your project there. That is both if your project runs queries by itself on the background, or if it uses queries as part of user interaction scenario. We do not include labs tools currently unless it is absolutely vital infrastructure (i.e. if it went down, would it substantially degrade the main site functionality or make some features unusable?) If you still feel we should know about certain lab tool, please leave a note on the talk page. What's in it for you? We want to know these in order to better understand the scope of internal usage and as preparation for T178492 (creating internal WDQS setup) - with the goal to provide internal users more robust and more flexible service. Also we want it to ensure we do not break anything important when we do maintenance, and we know who to talk to if some queries do not work as expected and we want to fix it. What we want to know? - We'd like to have general description of the functionality (i.e., what the service is for) - How to recognize queries run by it - user agent? source host? specific query pattern? some other mark? It is recommended that it would be possible to recognize - What kind of queries it runs (no need to list every possible one of course but if there are typical cases it'd help to see it)? - How often the queries run - if it's periodic, or what is expected/statistical usage of the tool if it's user driven tool? - Where could we see the code at the base of it and who maintains it? - Feel free to add any other information about anything you think would be useful for us to know. What was that page again? https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage Thanks in advance, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata] Do you use the Wikidata entity dump dcatap.rdf?
Hi! > How about adding the RDF to query.wikidata.org so we can get a current > list? We could probably load the rdf we have now into Blazegraph relatively easily. Updating may be a bit tricky (should we delete historical items?) but it's possible to figure it out. I'll look into it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata-tech] Wikidata fulltext search results output
Hi! > When showing labels from fallback languages we do have little language > indicators in other places. I believe we should have this here as Makes sense. I'll look into how to get those. Is language code OK or we need full language name (uk vs. Ukrainian)? One thing to note here is that secondary languages have no order - i.e. if you look in German, and there's no matching German label, but there are 10 other language labels all the same (happens a lot for names & places), which language will be selected is anybody's guess. We could add rule that says "look at English as secondary first", in theory, but not sure whether we should - after all, besides having most languages, (and us speaking it :) there's not much special about it. > I'm slightly leaning toward showing both. OK. > I'd say in this case we could get rid of the word/byte count. To get a > good glimpse of the quality of the item I'd say we'd want to show > count of statements (excluding identifier statements), identifiers and > sitelinks. OK, I'll try to make this. >> 5. Display format for Wikidata and for other wikipedia sites is different: >> Wikpedia: >> >> Title >> Snippet >> >> Wikidata: >> >> Title: Description >> >> I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on >> the same line, separated by colon. Is there any reason for this >> difference? Do we want to go back to the common format? > > Not sure if we had a reason tbh. OK then, I'll feel free to shuffle things around then :) Having more freedom in the title line is good because we can then display both label & aliases. Thanks! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata] Wikidata HDT dump
Hi! > OK. I wonder though, if it would be possible to setup a regular HDT > dump alongside the already regular dumps. Looking at the dumps page, > https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a > new dump is generated once a week more or less. So if a HDT dump > could True, the dumps run weekly. "More or less" situation can arise only if one of the dumps fail (either due to a bug or some sort of external force majeure). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
Hi! > The first part of the Turtle data stream seems to contain syntax errors > for some of the XSD decimal literals. The first one appears on line 13,291: > > ```text/turtle > <http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35> > <http://wikiba.se/ontology-beta#geoPrecision> > "1.0E-6"^^<http://www.w3.org/2001/XMLSchema#decimal> . I've added https://phabricator.wikimedia.org/T179228 to handle this. geoPrecision is a float value and assigning decimal type to it is a mistake. I'll review other properties to see if we don't have more of this. Thanks for bringing it to my attention! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
Hi! > The first part of the Turtle data stream seems to contain syntax errors > for some of the XSD decimal literals. The first one appears on line 13,291: > > ```text/turtle > <http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35> > <http://wikiba.se/ontology-beta#geoPrecision> > "1.0E-6"^^<http://www.w3.org/2001/XMLSchema#decimal> . > ``` Could you submit a phabricator task (phabricator.wikimedia.org) about this? If it's against the standard it certainly should not be encoded like that. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata HDT dump
Hi! > I will look into the size of the jnl file but should that not be > located where the blazegraph is running from the sparql endpoint or > is this a special flavour? Was also thinking of looking into a gitlab > runner which occasionally could generate a HDT file from the ttl dump > if our server can handle it but for this an md5 sum file would be > preferable or should a timestamp be sufficient? Publishing jnl file for Blazegraph may be not as useful as one would think, because jnl file is specific for a specific vocabulary and certain other settings - i.e., unless you run the same WDQS code (which customizes some of these) of the same version, you won't be able to use the same file. Of course, since WDQS code is open source, it may be good enough, so in general publishing such file may be possible. Currently, it's about 300G size uncompressed. No idea how much compressed. Loading it takes a couple of days on reasonably powerful machine, more on labs ones (I haven't tried to load full dump on labs for a while, since labs VMs are too weak for that). In general, I'd say it'd take about 100M per million of triples. Less if triples are using repeated URIs, probably more if they contain ton of text data. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata prefix search is now Elastic
Hi! > Thanks a lot Stas for this present. > Could you please share any pointers on how to integrate it into other > tools? It's the same API as before, wbsearchentities. If you need additional profiles - i.e., different scoring/filtering, talk to me and/or file phab task and we can look into it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Wikidata prefix search is now Elastic
Hi! Wikidata’s birthday is still a few days away but since there are no deployments on Sundays we’ll get started with an early present ;-) Wikidata and Search Platform teams are happy to announce that Wikidata prefix search (aka wbsearchentities API aka the thing you use when you type into that box on the top right or any time you edit an item or property and use the selector widget) is now using new and improved ElasticSearch backend. You should not see any changes except for relevancy and ranking improvements. Specifically improved are: - better language support (matches along fallback chain and also can match in any language, with lower score) - flexibility - we now can use Elasticsearch rescore profiles which can be tuned to take advantage of any fields we index for both matching and boosting, including links counts, statement counts, label counts, (some) statement values, etc. etc. More improvement coming soon in this area, e.g. scoring disambig pages lower, scoring units higher in proper context, etc. - optimization - we do not need to store all search data in both DB tables and Elastic indexes anymore, all the data that is needed for search and retrieval of the results is stored in Elastic index and retrieved in a single query. - maintainability - since it is now part of the general Wikimedia search ecosystem, it can be maintained together with the rest of the search mechanisms, using the same infrastructure, monitoring, etc. Please tell us if you have any suggestions, comments or experience any problems with it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata-tech] Wikidata fulltext search results output
Hi! > while you are at it, some things would be very useful to be search-able > (maybe some are already by now): > * "primary" (not references/qualifiers) years, for birth/death/flourit etc. > * "primary" string/monolingual values (title, taxon name, etc.) > * "primary" IDs, e.g. VIAF (might cause confusion with years, so maybe > only add numerical IDs if 5+ digits?) We have the code to index statements already, and we're already indexing P31 and P279. We could index more properties. We don't have syntax or any other way though to actually use those in search - yet, except for boosting (see https://gerrit.wikimedia.org/r/#/c/384632/). We're looking at which properties to add (nominations welcome, probably in the form of phab ticket?) - since adding them requires full reindex of wikidata (couple of days) we probably don't want to add them one by one but want to collect a set and then do it in one hit. We also do not have syntax for searching (as in match, instead of boost) by statement values, but it should not be hard - we just need to design proper syntax and implement it (syntaxes are now pluggable, so should not be too big of a problem). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
[Wikidata-tech] Wikidata fulltext search results output
Hi! As I am working on improving Wikidata fulltext search[1], I'd like to talk about search results page. Right now search results page for Wikidata is less than ideal, here are the issues I see with it: - No match highlighting - Meaningless data, like word count (anybody cares to guess what it is counting? Anybody ever used it?) and byte count (more useful than word count but not by much) - Obviously, search quality is not super high, but that should be improved with proper description indexing While working on improving the situation, I would like to solicit opinions on the set of questions about how the search results page should look like. Namely: 1. If the match is made on label/description that does not match current display language, we could opt for: a) Displaying the description that matched, highlighted. Optionally maybe display the language of the match (in display language?) b) Displaying the description in display language, un-highlighted. Which option is preferable? 2. What we do if the match is on alias? Do we display matching alias, original label or both? The question above also applies if the match is on other language alias. 3. It looks clear to me that words count is useless. Is byte count useful and does it need to be kept? 4. Do we want to display any other parameters of the entity? E.g. we have in the index: statement_count, sitelink_count, label_count, incoming_links, etc. Do we want to display any? 5. Display format for Wikidata and for other wikipedia sites is different: Wikpedia: Title Snippet Wikidata: Title: Description I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on the same line, separated by colon. Is there any reason for this difference? Do we want to go back to the common format? Also if you have any other things/ideas/comments about how fulltext search output for wikidata should be, please tell me. I am sending this to wikidata-tech and discovery team list only for now, since it's still work in progress and half-baked, we could open this for wider discussion later if necessary. [1] https://phabricator.wikimedia.org/T178851 Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata] Turning Lists to Wikidata
Hi! > when you say "wikidata is not well suited for lists data", you refer > to wikibase or WDQS here? Wikidata is not good for storing list data, or any serial data. WDQS can produce all kinds of amazing lists via queries, but it's not a primary data storage. In general, it could store series data, but since it's based off Wikidata and feeds from it, that creates certain issue when data is not very suitable for Wikidata. > the data:Bea.gov/GDP by state.tab above is certainly a good > representation for efficient delivery (via json) and display of data. > but inefficient for further data sharing without URIs. The question of querying data like "GDB by state.tab" is an interesting one. I'm not sure whether triple store would be a good medium, but maybe it could be... Needs some research on the idea. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata-tech] query with incomplete result set
Hi! On 10/3/17 4:49 PM, Marco Neumann wrote: > thank you Lucas and Stas, this works for me. > > so it would be fair to say that p:P39 by-passes the semantics of > wdt:P39 with ranking*. for my own understanding why is a wdt property > called a direct property**? Because wdt: links directly to value, while p: links to a statement (where ps: links to the value). But that's not the only property of wdt: - another property that it links to "truthy" (current, best, etc.) value - one that has best rank in this property (hence the "t" letter). This may be what you want or not, depending on general semantics and your particular case. For many properties, ranks do not play significant role, since these properties do not change with time and do not have temporally limited statements. So for these, using wdt: is always ok. For some, like positions, offices, relationships between humans, etc. the values can have temporal limits and if you want best/current one, you use wdt:, otherwise you use p:/ps:. If you still want to account for rank using p:/ps:, there are rank triples and wikibase:BestRank class (see https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Statement_representation). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata-tech] query with incomplete result set
Hi! On 10/3/17 4:02 PM, Marco Neumann wrote: > why doesn't the following query produce > http://www.wikidata.org/entity/Q17905 in the result set? The query asks for wdt:P39 wd:Q1939555, however current preferred value for P39 there is Q29576752. When the item has preferred value, only this value shows up in wdt. If you want all values, use something like: https://query.wikidata.org/#SELECT%20%3FMdB%20%3FMdBLabel%20WHERE%20%7B%0ASERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%3FMdB%20p%3AP39%20%3FxRefNode.%20%0A%3FxRefNode%20pq%3AP2937%20wd%3AQ30579723%3B%0A%20%20%20ps%3AP39%20wd%3AQ1939555.%0A%7D Or change "preferred" status on Q17905:P39. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Re: [Wikidata] Do you use the Wikidata entity dump dcatap.rdf?
Hi! > is anyone using the Wikidata entity dump dcatap.rdf at > https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf? > > It is very rarely used and is thus causing us a (probably) undue > maintenance burden, because of which we plan to remove it. What's the issue with it? I don't use it but it seems to be part of standard for dataset descriptions, so I wonder if the issues can be fixed. I don't know too much about it but from the description is seems to be very automatable. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Categories in RDF/WDQS
Hi! I'd like to announce that the category tree of certain wikis is now available as RDF dump and in Wikidata Query Service. More documentation is at: https://www.mediawiki.org/wiki/Wikidata_query_service/Categories which I will summarize shortly below. The dumps are located at https://dumps.wikimedia.org/other/categoriesrdf/. You can use these dumps any way you wish, data format is described at the link above[1]. The same dump is loaded into "categories" namespace in WDQS, which can be queried by https://query.wikidata.org/bigdata/namespace/categories/sparql?query=SPARQL. Sorry, no GUI support yet (probably will happen later). See example in the docs[2]. These datasets are not updated automatically yet, so they'll be up to date roughly for the date of the latest dump. Hopefully soon it will be automated and then the datasets will be updated daily. The list of currently supported wikis is here: https://noc.wikimedia.org/conf/categories-rdf.dblist - these are basically all 1M+ wikis and couple more that I added for various reasons. If you have a good candidate wiki to add, please tell me or write on the talk page for the document above. Please note this is only the first step for the project, so there might still be some rough edges. I am announcing it early since I think it would be useful for people to look at the dumps and SPARQL endpoint and see if something is missing or does not work properly, and share ideas on how it can be used. We plan eventually to use it for search improvement[3] - this work is still in progress. As always, we welcome any comments and suggestions. [1] https://www.mediawiki.org/wiki/Wikidata_query_service/Categories#Data_format [2] https://www.mediawiki.org/wiki/Wikidata_query_service/Categories#Accessing_the_data [3] https://phabricator.wikimedia.org/T165982 Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] CHANGE: mwapi: prefix in WDQS changes underlying URI
Hi! In order to fix a compatibility issue (https://phabricator.wikimedia.org/T174930) and make the URI more clean, the URI underlying mwapi: prefix in WDQS queries will be changed to https://www.mediawiki.org/ontology#API/. This URI is not used anywhere in the data, but only to designate parameters for MWAPI services[1]. Since this is not change in data but only in service prefixes/URIs, there should not be any impact except for any queries that may use full URI instead of mwapi: prefix for calling the services - such queries will have to be updated. There's no real reason to do it and I recommend to never do it and follow the examples in the manual[1] instead, but out of the abundance of caution I am announcing this change anyway. The change will likely be deployed in next Monday, in the usual WDQS deployment window[2]. [] https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual/MWAPI [2] https://wikitech.wikimedia.org/wiki/Deployments -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Coordinate precision in Wikidata, RDF & query service
Hi! > The reason why we save the actual value with more digits than the > precision (and why we keep the precision as an explicit value at all) is > because the value could be entered and displayed either as decimal > digits or in minutes and seconds. So internally one would save 20' as > 0.3, but the precision is still just 2. This allows to roundtrip. > > I hope that makes any sense? Yes, for primary data storage (though roundtripping via limited-precision doubles is not ideal, but I guess good enough for now). But for secondary data/query interface, I am not sure 0.3 is that useful. What would one do with it, especially in SPARQL? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Coordinate precision in Wikidata, RDF & query service
Hi! > I am not sure I understand the issue and what the suggestion is to solve > it. If we decide to arbitrarily reduce the possible range for the Well, there are actually several issues right now. 1. Our RDF output produces coordinates with more digits that specified precision of the actual value. 2. Our precision values as specified in wikibase:geoPrecision seem to make little sense. 3. We may represent the same coordinates for objects located in the same place as different ones because of precision values being kinda chaotic. 4. We may have different data from other databases because our coordinate is over-precise. (1) is probably easiest to fix. (2) is a bit harder, and I am still not sure how wikibase:geoPrecision is used, if at all. (3) and (4) are less important, but it would be nice to improve, and maybe they will be mostly fixed once (1) and (2) are fixed. But before approaching the fix, I wanted to understand what are expectations from precision and if there can or should be some limits there. Technically, it doesn't matter too much - except that some formulae for distances do not work well for high precisions because of limited accuracy of 64-bit double, but there are ways around it. So technically we can keep 9 digits or however many we need, if we wanted to. I just wanted to see if we should. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata