from:"Stas Malyshev"

Re: [Wikidata] dcatap namespace in WDQS

2019-08-15 Thread Stas Malyshev

Hi!

> As part of our Wikidata Query Service setup, we maintain the namespace
> serving DCAT-AP (DCAT Application Profile) data[1]. (If you don't know
> what I'm talking about you can safely ignore the rest of the message).

Following up on this discussion and the feedback received, I have
decided to move dcatap namespace to separate endpoint -
https://dcatap.wmflabs.org/. I've updated the manual to reflect it[1].
The old setup is still working, but we'll be disabling updates, and
eventually also disable the namespace itself, so while it still be used
for now, if you plan to use it (logs suggest there's virtually no usage
now, but that can change of course) please use the endpoint above.

[1]
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#DCAT-AP
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] dcatap namespace in WDQS

2019-07-28 Thread Stas Malyshev

Hi!

As part of our Wikidata Query Service setup, we maintain the namespace
serving DCAT-AP (DCAT Application Profile) data[1]. (If you don't know
what I'm talking about you can safely ignore the rest of the message).

Recent check showed that this namespace is virtually unused - over the
last two months, only 3 query per month were served from that namespace,
and all of them coming from WMF servers (not sure whether it's a tool or
somebody querying manually, did not dig further).

So I wonder if it makes sense to continue maintaining this namespace?
While it does not require very significant effort - it's mostly
automated - it does need occasional attention when maintenance is
performed, and some scripts and configurations become slightly more
complex because of it. No big deal if somebody is using it, that's what
the service is for, but if it is completely unused, no point is spending
even minimal effort on it, at least on main production servers (of
course, it'd be possible to set up a simple SPARQL server in labs with
the same data).

In any case, RDF dcatap data will be available in
https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf, no change
is planned there, but if the namespace is phased out, the data could no
longer be queried using WDQS. One could still download it and, since
it's a very small dataset, use any tool that can read RDF to parse it
and work with it.

I'd like to hear from anybody interested in this whether they are using
this namespace or plan to use it and what for. Please either answer here
or even better in the task[2] on Phabricator.

[1]
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#DCAT-AP
[2] https://phabricator.wikimedia.org/T228297
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-23 Thread Stas Malyshev

Hi!

> Forgive my ignorance. I don't know much about infrastructure of WDQS and
> how it works. I just want to mention how application servers do it. In
> appservers, there are dedicated nodes both for apache and the replica
> database. So if a bot overdo things in Wikipedia (which happens quite a
> lot), users won't feel anything but the other bots take the hit. Routing
> based on UA seems hard though while it's easy in mediawiki (if you hit
> api.php, we assume it's a bot).

We have two clusters - public and internal, with the latter serving only
Wikimedia tasks thus isolated from outside traffic. However, we do not
have a practical way right now to separate bot and non-bot traffic, and
I don't think we now have resources for another cluster.

> Routing based on UA seems hard though while it's easy in mediawiki

I don't think our current LB setup can route based on user agent. There
could be a gateway that does that, but given that we don't have
resources for another cluster for now, it's not too useful to spend time
on developing something like that for now.

Even if we did separate browser and bot traffic, we'd still have the
problem on bot cluster - most bots are benign and low-traffic, and we
want to do our best to enable them to function smoothly. But for this to
work, we need ways to weed out outliners that consume too much
resources. In a way, the bucketing policy is a sort of version of what
you described - if you use proper identification, you are judged on your
traffic. If you use generic identification, you are bucketed with other
generic agents, and thus may be denied if that bucket is full. This is
not the best final solution, but experience so far shows it reduced the
incidence of problems. Further ideas on how to improve it of course are
welcome.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Wikidata Query Service User-Agent requirements for script users

2019-07-23 Thread Stas Malyshev

Hello all!

Here is (at last!) an update on what we are doing to protect the
stability of Wikidata Query Service.

For 4 years we have been offering to Wikidata users the Query Service, a
powerful tool that allows anyone to query the content of Wikidata,
without any identification needed. This means that anyone can use the
service using a script and make heavy or very frequent requests.
However, this freedom has led to the service being overloaded by a too
big amount of queries, causing the issues or lag that you may have noticed.

A reminder about the context:

We have had a number of incidents where the public WDQS endpoint was
overloaded by bot traffic. We don't think that any of that activity was
intentionally malicious, but rather that the bot authors most probably
don't understand the cost of their queries and the impact they have on
our infrastructure. We've recently seen more distributed bots, coming
from multiple IPs from cloud providers. This kind of pattern makes it
harder and harder to filter or throttle an individual bot. The impact
has ranged from increased update lag to full service interruption.

What we have been doing:

While we would love to allow anyone to run any query they want at any
time, we're not able to sustain that load, and we need to be more
aggressive in how we throttle clients. We want to be fair to our users
and allow everyone to use the service productively. We also want the
service to be available to the casual user and provide up-to-date access
to the live Wikidata data. And while we would love to throttle only
abusive bots, to be able to do that we need to be able to identify them.

We have two main means of identifying bots:

1) their user agent and IP address
2) the pattern of their queries

Identifying patterns in queries is done manually, by a person inspecting
the logs. It takes time and can only be done after the fact. We can only
start our identification process once the service is already overloaded.
This is not going to scale.

IP addresses are starting to be problematic. We see bots running on
cloud providers and running their workloads on multiple instances, with
multiple IP addresses.

We are left with user agents. But here, we have a problem again. To
block only abusive bots, we would need those bots to use a clearly
identifiable user agent, so that we can throttle or block them and
contact the author to work together on a solution. It is unlikely that
an intentionally abusive bot will voluntarily provide a way to be
blocked. So we need to be more aggressive about bots which are using a
generic user agent. We are not blocking those, but we are limiting the
number of requests coming from generic user agents. This is a large
bucket, with a lot of bots that are in this same category of "generic
user agent". Sadly, this is also the bucket that contains many small
bots that generate only a very reasonable load. And so we are also
impacting the bots that play fair.

At the moment, if your bot is affected by our restrictions, configure a
custom user agent that identifies you; this should be sufficient to give
you enough bandwidth. If you are still running into issues, please
contact us; we'll find a solution together.

What's coming next:

First, it is unlikely that we will be able to remove the current
restrictions in the short term. We're sorry for that, but the
alternative - service being unresponsive or severely lagged for everyone
- is worse.

We are exploring a number of alternatives. Adding authentication to the
service, and allowing higher quotas to bots that authenticate. Creating
an asynchronous queue, which could allow running more expensive queries,
but with longer deadlines. And we are in the process of hiring another
engineer to work on these ideas.

Thanks for your patience!

WDQS Team

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Significant change of Wikidata dump size

2019-06-26 Thread Stas Malyshev

Hi!

On 6/25/19 11:17 PM, Ariel Glenn WMF wrote:
> I think the issue is with the 0624 json dumps, which do seem a lot
> smaller than previous weeks' runs.

Ah, true, I didn't realize that. I think this may be because of that
dumpJson.php issue, which is now fixed. Maybe rerun the dump?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Significant change of Wikidata dump size

2019-06-25 Thread Stas Malyshev

Hi!

> Which script, please, and which dump? (The conversation was not
> forwarded so I don't have the context.)

Sorry, the original complaint was:

> I apologize if I missed something, but why the current JSON dump size
is ~25GB while a week ago it was ~58GB? (see
https://dumps.wikimedia.org/wikidatawiki/entities/20190617/)

But looking at it now, I see wikidata-20190617-all.json.gz  is
comparable with the last week, so looks like it's fine now?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Significant change of Wikidata dump size

2019-06-25 Thread Stas Malyshev

Hi!

> Follow-up: according to my processing script, this dump contains
> only 30280591 entries, while the main page is still advertising 57M+
> data items.
> Isn't it a bug in the dump process?

There was a problem with dump script (since fixed), so the dump may
indeed be broken. CCing Ariel to take a look. Probably needs to be
re-run or we can just wait for the next one.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Result format change for WDQS JSON query output

2019-06-21 Thread Stas Malyshev

Hi!

> from 2014, so I will research which form is more correct. But for now I
> would recommend to update the tools to recognize that these literals now
> may have type. If I discover that the standards or accepted practices
> recommend otherwise, I'll update further. You can also watch
> https://phabricator.wikimedia.org/T225996 for final resolution of this.

I surveyed existing practices of SPARQL endpoints and tools, and looks
like the accepted practice is to omit the datatypes for such literals
even within the context of RDF 1.1. Example:
https://issues.apache.org/jira/browse/JENA-1077
I will adjust the code in Blazegraph accordingly, so WDQS will comply
with this practice (i.e. result format will be as it was before). This
will be implemented in coming days.
Sorry again for the disruption.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Result format change for WDQS JSON query output

2019-06-20 Thread Stas Malyshev

Hi!

Due to upgrade to more current version of Sesame toolkit, the format of
JSON output of Wikidata Query Service has changed slightly[1]. The
change is that plain literals (ones that do not have explicit data type,
like "string" or "string"@de) now have "datatype" field. The language
literals will have type
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString and the
non-language ones http://www.w3.org/2001/XMLSchema#string. This is in
accordance with RDF 1.1 standard [2], where all literals have data type
(even though for these types it is implicit).

I apologize for not noting this in advance - though I knew this change
in the standard happened, I did not foresee it will also carry over to
the JSON output format. I am not sure yet which output form is actually
correct, since standards seem to be conflicting, maybe due to the fact
that JSON results standard hasn't been updated since 2013 and RDF 1.1 is
from 2014, so I will research which form is more correct. But for now I
would recommend to update the tools to recognize that these literals now
may have type. If I discover that the standards or accepted practices
recommend otherwise, I'll update further. You can also watch
https://phabricator.wikimedia.org/T225996 for final resolution of this.

[1] https://phabricator.wikimedia.org/T225996
[2] https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Planned filename change for Wikidata RDF entity dumps

2019-06-20 Thread Stas Malyshev

Hi!

As outlined in https://phabricator.wikimedia.org/T226153, we are
planning to change filename scheme for Wikidata RDF entity dumps, by
removing the "-BETA" suffix from the filename. The Wikidata RDF ontology
is not beta anymore and dumps have been working stable for a while now,
so it's time to drop the beta mark from the name. It may take a week or
two for the change to propagate and be applied to dumps, but if your
tools depend on exact naming, please prepare them for the eventual
change in the name.

Note that links like
https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.gz would
still be pointing to the right files, and if all you care is downloading
the latest dump, using these links is always recommended.

We will send another message once the change has been implemented and
deployed.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Overload of query.wikidata.org (Guillaume Lederrey)

2019-06-18 Thread Stas Malyshev

Hi!

On 6/18/19 2:29 PM, Tim Finin wrote:
> I've been using wdtaxonomy
> <https://wdtaxonomy.readthedocs.io/en/latest/> happily for many months
> on my macbook. Starting yesterday, every call I make (e.g., "wdtaxonomy
> -c Q5") produces an immediate "SPARQL request failed" message.

Could you provide more details, which query is sent and what is the full
response (including HTTP code)?

> 
> Might these requests be blocked now because of the new WDQS policies?

One thing I may think of it that this tool does not send the proper
User-Agent header. According to
https://meta.wikimedia.org/wiki/User-Agent_policy, all clients should
identify with valid user agent. We've started enforcing it recently, so
maybe this tool has this issue. If not, please provide the data above.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-17 Thread Stas Malyshev

Hi!

> The documented limits about FDB states that it to support up to 100TB of
> data
> <https://apple.github.io/foundationdb/known-limitations.html#database-size>.
> That is 100x times more
> than what WDQS needs at the moment.

"Support" is such a multi-faceted word. It can mean "it works very well
with such amount of data and is faster than the alternatives" or "it is
guaranteed not to break up to this number but breaks after it" or "it
would work, given massive amounts of memory and super-fast hardware and
very specific set of queries, but you'd really have to take an effort to
make it work" and everything in between. The devil is always in the
details, which this seemingly simple word "supports" is rife with.

> I am offering my full-time services, it is up to you decide what will
> happen.

I wish you luck with the grant, though I personally think if you expect
to have a production-ready service in 6 month that can replace WDQS then
in my personal opinion it is a bit too optimistic. I might be completely
wrong on this of course. If you just plan to load the Wikidata data set
and evaluate the queries to ensure they are fast and produce proper
results on the setup you propose, then it can be done. Good luck!
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Overload of query.wikidata.org

2019-06-17 Thread Stas Malyshev

Hi!

> We are currently dealing with a bot overloading the Wikidata Query
> Service. This bot does not look actively malicious, but does create
> enough load to disrupt the service. As a stop gap measure, we had to
> deny access to all bots using python-request user agent.
> 
> As a reminder, any bot should use a user agent that allows to identify
> it [1]. If you have trouble accessing WDQS, please check that you are
> following those guidelines.

To add to this, we have had this trouble because two events that WDQS
currently does not deal well with have coincided:

1. An edit bot that edited with 200+ edits per minute. This is too much.
Over 60/m is really almost always too much. And also it would be a good
thing to consider if your bots does multiple changes (e.g. adds multiple
statements) doing it in one call instead of several, since WDQS
currently will do an update on each change separately, and this may be
expensive. We're looking into various improvements to this, but it is
the state currently.

2. Several bots have been flooding the service query endpoint with
requests. There is recently a growth in bots that a) completely ignore
both regular limits and throttling hints b) do not have proper
identifying user agent and c) use distributed hosts so our throttling
system has a problem to deal with them automatically. We intend to crack
down more and more on such clients, because they look a lot like DDOS
and ruin the service experience for everyone.

I will write down more detailed rules probably a bit later, but so far
these:
https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Usage_constraints
and additionally having distinct User-Agent if you're running a bot is a
good idea.

And for people who are thinking it's a good idea to launch a
max-requests-I-can-stuff-into-the-pipe bot, put it on several Amazon
machines so that throttling has hard time detecting it, and then when
throttling does detect it neglecting to check for a week that all the
bot is doing is fetching 403s from the service and wasting everybody's
time - please think again. If you want to do something non-trivial
querying WDQS and limits get in the way - please talk to us (and if you
know somebody who isn't reading this list but is considering wiring a
bot interfacing with WDQS - please educate them and refer them for help,
we really prefer to help than to ban). Otherwise, we'd be forced to put
more limitations on it that will affect everyone.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-13 Thread Stas Malyshev

Hi!

> Data living in an RDBMS engine distinct from Virtuoso is handled via the
> engines Virtual Database module i.e., you can build powerful RDF Views
> over ODBC- or JDBC- accessible data using Virtuoso. These view also have
> the option of being materialized etc..

Yes, but the way the data are stored now is JSON blob within a text
field in MySQL. I do not see how RDF View over ODBC would help it any -
of course Virtuoso would be able to fetch JSON text for a single item,
but then what? We'd need to run queries across millions of items,
fetching and parsing JSON for every one of them every time is
unfeasible. Not to mention this JSON is not an accurate representation
of the RDF data model. So I don't think it is worth spending time in
this direction... I just don't see how any query engine could work with
that storage.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-13 Thread Stas Malyshev

Hi!

> It handles data locality across a shared nothing cluster just fine i.e.,
> you can interact with any node in a Virtuoso cluster and experience
> identical behavior (everyone node looks like single node in the eyes of
> the operator).

Does this mean no sharding, i.e. each server stores the full DB? This is
the model we're using currently, but given the growth of the data it may
be non sustainable on current hardware. I see in your tables that
Uniprot has about 30B triples, but I wonder how update loads there look
like. Our main issue is that the hardware we have now is showing its
limits when there's a lot of updates in parallel to significant query
load. So I wonder if the "single server holds everything" model is
sustainable in the long term.

> There are live instances of Virtuoso that demonstrate its capabilities.
> If you want to explore shared-nothing cluster capabilities then our live
> LOD Cloud cache is the place to start [1][2][3]. If you want to see the
> single-server open source edition that you have DBpedia, DBpedia-Live,
> Uniprot and many other nodes in the LOD Cloud to choose from. All of
> these instance are highly connected.

Again, here the question is not too much in "can you load 7bn triples
into Virtuoso" - we know we can. What we want to figure out whether
given specific query/update patterns we have now - it is going to give
us significantly better performance allowing to support our projected
growth.
And also possibly whether Virtuoso has ways to make our update workflow
be more optimal - e.g. right now if one triple changes in Wikidata item,
we're essentially downloading and updating the whole item (not exactly
since triples that stay the same are preserved but it requires a lot of
data transfer to express that in SPARQL). Would there be ways to update
the things more efficiently?

> Virtuoso handles both shared-nothing clusters and replication i.e., you
> can have a cluster configuration used in conjunction with a replication
> topology if your solution requires that.

Replication could certainly be useful I think it it's faster to update
single server and then replicate than simultaneously update all servers
(that's what is happening now).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-13 Thread Stas Malyshev

Hi!

> Unlike, most sites we do have our own custom frontend in front of
> virtuoso. We did this to allow more styling, as well as being flexible
> and change implementations at our whim. e.g. we double parse the SPARQL
> queries and even rewrite some to be friendlier. I suggest you do the
> same no matter which DB you use in the end, and we would be willing to
> open source ours (it is in Java, and uses RDF4J and some ugly JSPX but
> it works, if not to use at least as an inspiration). We did this to
> avoid being locked into endpoint specific features.

It would be interesting to know more about this, if this is open source.
Is there any more information about it online?

> Pragmatically, while WDS is a Graph database, the queries are actually
> very relational. And none of the standard graph algorithms are used. To

If you mean algorithms like A* or PageRank, then yes, they are not used
too much (likely also because SPARQL has no standard support for any of
these, too), though Blazegraph implements some of them as custom services.

> be honest RDF is actually a relational system which means that
> relational techniques are very good at answering them. The sole issue is
> recursive queries (e.g. rdfs:subClassOf+) in which the virtuoso
> implementation is adequate but not great.

Yes, path queries are pretty popular on WDQS too, especially given as
many relationships like administrative/territorial placement or
ownership are hierarchical and transitive, which often requires path
queries.

> This is why recovering physical schemata from RDF data is such a
> powerful optimization technique [1]. i.e. you tend to do joins not
> traversals. This is not always true but I strongly suspect it will hold
> for the vast majority of the Wikidata Query Service case.

Would be interesting to see if we can apply anything from the article.
Thanks for the link!

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Stas Malyshev

Hi!

>> So there needs to be some smarter solution, one that we'd unlike to
> develop inhouse
> 
> Big cat, small fish. As wikidata continue to grow, it will have specific
> needs.
> Needs that are unlikely to be solved by off-the-shelf solutions.

Here I think it's good place to remind that we're not Google, and
developing a new database engine inhouse is probably a bit beyond our
resources and budgets. Fitting existing solution to our goals - sure,
but developing something new of that scale is probably not going to happen.

> FoundationDB and WiredTiger are respectively used at Apple (among other
> companies)
> and MongoDB since 3.2 all over-the-world. WiredTiger is also used at Amazon.

I believe they are, but I think for our particular goals we have to
limit themselves for a set of solution that are a proven good match for
our case.

>> We also have a plan on improving the throughput of Blazegraph, which
> we're working on now.
> 
> What is the phabricator ticket? Please.

You can see WDQS task board here:
https://phabricator.wikimedia.org/tag/wikidata-query-service/

> That will be vendor lock-in for wikidata and wikimedia along all the
> poor souls that try to interop with it.

Since Virtuoso is using standard SPARQL, it won't be too much of a
vendor lock in, though of course the standard does not cover all, so
some corners are different in all SPARQL engines. This is why even
migration between SPARQL engines, even excluding operational aspects, is
non-trivial. Of course, migration to any non-SPARQL engine would be
order of magnitude more disruptive, so right now we do not seriously
consider doing that.

> It has two backends: MMAP and rocksdb.

Sure, but I was talking about the data model - ArangoDB sees the data as
set of documents. RDF approach is a bit different.

> ArangoDB is a multi-model database, it support:

As I already mentioned, there's a difference between "you can do it" and
"you can do it efficiently". Graphs are simple creatures, and can be
modeled on many backends - KV, document, relational, column store,
whatever you have. The tricky part starts when you need to run millions
of queries on 10B triples database. If your backend is not optimal for
that task, it's not going to perform.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev

Hi!

> thanks for the elaboration. I can understand the background much better.
> I have to admit, that I am also not a real expert, but very close to the
> real experts like Vidal and Rahm who are co-authors of the SWJ paper or
> the OpenLink devs.

If you know anybody at OpenLink that would be interested in trying to
evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
provide support for this project, it would be interesting to discuss it.
While open-source thing is still a barrier and in general the
requirements are different, at least discussing it and maybe getting
some numbers might be useful.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev

Hi!

> Yes, sharding is what you need, I think, instead of replication. This is
> the technique where data is repartitioned into more manageable chunks
> across servers.

Agreed, if we are to get any solution that is not constrained by
hardware limits of a single server, we can not avoid looking at sharding.

> Here is a good explanation of it:
> 
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Thanks, very interesting article. I'd certainly would like to know how
this works with database in the size of 10 bln. triples and queries both
accessing and updating random subsets of them. Updates are not covered
very thoroughly there - this is, I suspect, because many databases of 10
bln. size do not have as active (non-append) update workload as we do.
Maybe they still manage to solve it, if so, I'd very much like to know
about it.

> Just a note here: Virtuoso is also a full RDMS, so you could probably
> keep wikibase db in the same cluster and fix the asynchronicity. That is

Given how the original data is stored (JSON blob inside mysql table) it
would not be very useful. In general, graph data model and Wikitext data
model on top of which Wikidata is built are very, very different, and
expecting same storage to serve both - at least without very major and
deep refactoring of the code on both sides - is not currently very
realistic. And of course moving any of the wiki production databases to
Virtuoso would be a non-starter. Given than original Wikidata database
stays on Mysql - which I think is a reasonable assumption - there would
need to be a data migration pipeline for data to come from Mysql to
whatever is the WDQS NG storage.

> also true for any mappers like Sparqlify:
> http://aksw.org/Projects/Sparqlify.html However, these shift the
> problem, then you need a sharded/repartitioned relational database

Yes, relational-RDF bridges are known but my experience is they usually
are not very performant (the difference in "you can do it" and "you can
do it fast" is sometimes very significant) and in our case, it would be
useless anyway as Wikidata data is not really stored in relational
database per se - it's stored in JSON blob opaquely saved in relational
database structure that knows nothing about Wikidata. Yes, it's not the
ideal structure for optimal performance of Wikidata itself, but I do not
foresee this changing, at least in any short term. Again, we could of
course have data export pipeline to whatever storage format we want -
essentially we already have one - but the concept of having single data
store is probably not realistic at least within foreseeable timeframes.
We use separate data store for search (ElasticSearch) and probably will
have to have separate one for queries, whatever would be the mechanism.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev

special arrangement. Since this arrangement will probably not include
open-sourcing the enterprise part of Virtuoso, it should deliver a very
significant, I dare say enormous advantage for us to consider running it
in production. It may be possible that just OS version is also clearly
superior to the point that it is worth migrating, but this needs to be
established by evaluation.

> - I recently heard a presentation from Arango-DB and they had a good
> cluster concept as well, although I don't know anybody who tried it. The
> slides seemed to make sense.

We considered AgangoDB in the past, and it turned out we couldn't use it
efficiently on the scales we need (could be our fault of course). They
also use their own proprietary language for querying, which might be
worth it if they deliver us a clear win on all other aspects, but that
does not seem to be the case.
Also, AgangoDB seems to be document database inside. This is not what
our current data model is. While it is possible to model Wikidata in
this way, again, changing the data model from RDF/SPARQL to a different
one is an enormous shift, which can only be justified by an equally
enormous improvement in some other areas, which currently is not clear.
This project seems to be still very young. While I would be very
interested if somebody took on themselves to model Wikidata in terms of
ArangoDB documents, load the whole data and see what the resulting
performance would be, I am not sure it would be wise for us to invest
our team's - very limited currently - resources into that.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] searching for Wikidata items

2019-06-04 Thread Stas Malyshev

Hi!

> Yes, the api is
> at https://www.wikidata.org/w/api.php?action=query=search=Bush

There's also
https://www.wikidata.org/w/api.php?action=wbsearchentities=Bush=en=json

This is what completion search in Wikidata is using.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Where did label filtering break recently and how?

2019-05-30 Thread Stas Malyshev

Hi!

> and if I enable any of the FILTER lines, it returns 0 results.
> What changed / Why ?

Thanks for reporting, I'll check into it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Are we ready for our future

2019-05-04 Thread Stas Malyshev

Hi!

> WQS data doesn't have versions, it doesn't have to be in one space and
> can easily be separated. The whole point of LOD is to decentralize your
> data. But I understand that Wikidata/WQS is currently designend as a
> centralized closed shop service for several reasons granted.

True, WDQS does not have versions. But each time the edit is made, we
now have to download and work through the whole 2M... It wasn't a
problem when we were dealing with regular-sized entities, but current
system certainly is not good for such giant ones.

As for decentralizing, WDQS supports federation, but for obvious reasons
federated queries are slower and less efficient. That said, if there
were separate store for such kind of data, it might work as
cross-querying against other Wikidata data wouldn't be very frequent.
But this is something that Wikidata community needs to figure out how to do.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Are we ready for our future

2019-05-04 Thread Stas Malyshev

Hi!

> For the technical guys, consider our growth and plan for at least one
> year. When the impression exists that the current architecture will not
> scale beyond two years, start a project to future proof Wikidata.

We may also want to consider if Wikidata is actually the best store for
all kinds of data. Let's consider example:

https://www.wikidata.org/w/index.php?title=Q57009452

This is an entity that is almost 2M in size, almost 3000 statements and
each edit to it produces another 2M data structure. And its dump, albeit
slightly smaller, still 780K and will need to be updated on each edit.

Our database is obviously not optimized for such entities, and they
won't perform very well. We have 21 million scientific articles in the
DB, and if even 2% of them would be like this, it's almost a terabyte of
data (multiplied by number of revisions) and billions of statements.

While I am not against storing this as such, I do wonder if it's
sustainable to keep such kind of data together with other Wikidata data
in a single database. After all, each query that you run - even if not
related to that 21 million in any way - will have to still run in within
the same enormous database and be hosted on the same hardware. This is
especially important for services like Wikidata Query Service where all
data (at least currently) occupies a shared space and can not be easily
separated.

Any thoughts on this?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata-tech] RDF export for SDC

2019-05-01 Thread Stas Malyshev

Hi!

I started looking into how to produce RDF dump of MediaInfo entities,
and I've encountered some roadblocks that I am not sure how to get
around. Would like to hear suggestions on this, here or on phabricator
directly, or on IRC:

1. https://phabricator.wikimedia.org/T99
Basically right now when we are enumerating entities for certain types,
we are just looking at pages from namespace related to entity types and
assume page title is parseable directly into entity ID. However, with
slot entities like MediaInfo it is not the case. So, we need there
generic service that would take a page and set of entity types, and
figure out:
a. Which of those entity types are "regular" entities with dedicated
page IDs and which ones live in slots
b. For the regular entities, do $this->entityIdParser->parse(
$row->page_title ) as before
c. For slot entities, check that the slot is present and if so, produce
entity ID specific to this slot. Preferably this is also done without
separate db access (may not be easy) since SqlEntityIdPager needs to
have good performance.

I am not sure whether there's an API that does that.
EntityByLinkedTitleLookup comes very close and even has a hook that does
the right thing, but it does DB access even for local IDs for Wikidata
(can be fixed) and does not support batching. Any other suggestions how
the above can be properly done?

There's also complication that pages to slots is no longer one-to-one,
so fetch() operation can return not only $limit but anywhere from 0 to
(number of slots)*$limit entity Ids. Probably not a huge deal but might
need some careful handling.

2. https://phabricator.wikimedia.org/T222306
The entities in SDC are not local entities - e.g. if I am looking at
https://commons.wikimedia.org/wiki/Special:EntityData/M9103972.json P180
and Q83043 do not come from Commons, they come from Wikidata. However,
they do not have prefixes, which means RDF builder thinks they are
local, and assigns them Commons-based namespaces, which is obviously
wrong, since they are Wikidata entities. While Commons has a bunch of
redirects set up, RDF identifies data by literal URL and has no idea
about redirects, so querying data would be problematic if Wikidata
datataset is combined with Commons dataset. It would, for example, make
it next to impossible to run federated queries between Wikidata and
Commons, as two stores would use different URIs for Wikidata entities.

Additionally, current RDF generation process assumes wd: prefix always
belongs to local wiki, so on Commons wd: is
<https://commons.wikimedia.org/entity/> but on Wikidata it's of course
the Wikidata URL. This may be very confusing to people. If wd: means
different things in Commons and Wikidata, then federated queries may be
confusing as it'd be unclear which wd: means what where. Ideally, we'd
not use wd: prefix for Commons at all, but this goes against the
assumption hardcoded in RdfVocabulary that local wiki entities are wd:.
So again I am not sure what's the best way to treat this situation,
since I am not sure how federation model in SDC is working - the code
suggests there should be some kinds of prefixes for entity IDs, but SDC
does not seem to use any.

Any suggestions about the above are welcome.
Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Request

2019-04-10 Thread Stas Malyshev

Hi!

>> I am facing a problem where I can’t get enough data for my project. So is 
>> there anything that can be done to extend the limit of queries as they 
>> timeout ?

If you have queries that take longer than timeout permits, the options
usually would be:

1. Working with Wikidata dumps, as mentioned before

2. Looking into optimizing your query - maybe timeout happens because
your query is too slow. Check out
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization
and https://www.wikidata.org/wiki/Wikidata:Request_a_query .

3. Download information in smaller chunks using LIMIT/OFFSET clauses.
Note that this doesn't speed up query itself.

4. Use LDF server:
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Linked_Data_Fragments_endpoint

Depending on what data do you need, there probably would be the options.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Fwd: [rdf4j-users] SPARQL 1.2 Community Group

2019-03-29 Thread Stas Malyshev

Hi!

There is a discussion going on in W3C SPARQL 1.2 Community Group about
the improvements in SPARQL language. May be interesting to people that
are using SPARQL and those that may have some ideas of how to improve it.

-- Forwarded message -
From: *Andy Seaborne* mailto:a...@seaborne.org>>
Date: Fri, Mar 29, 2019 at 7:31 AM
Subject: [rdf4j-users] SPARQL 1.2 Community Group
To: mailto:rdf4j-us...@googlegroups.com>>

SPARQL 1.2 Community Group starts up:

    http://www.w3.org/community/sparql-12/

It will document features found as extensions and capture common needs
from the user community.

-- 
You received this message because you are subscribed to the Google
Groups "RDF4J Users" group.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] On the use of "prop/direct-normalized" in RDF dumps

2019-03-15 Thread Stas Malyshev

Hi!

> rg --search-zip -F "http://www.wikidata.org/prop/direct-normalized;
> wikidata_latest-truthy.nt.bz2 | pv > wikidata-extids.txt
> 
> But I get as a result a little less than 29.5 million lines. Pubmed and
> DOI, which alone account for about 33 million statements, are not included.

Could you provide specific properties and preferably also some Q-ids for
which you expected to find direct-normalized props but didn't?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Fwd: [Wikimedia-l] Developing instructional material for Wikidata Query Service

2019-02-18 Thread Stas Malyshev

Hi!

> In recent years, Wikimedia Israel has developed online instructional
> materials, such as the Wikipedia courseware and the guide for creating
> encyclopedic content. We plan to use our experience in this field, and in
> collaboration with Wikimedia Deutschland, we intend to develop a website
> with a step-by-step tutorial to learn how to use the Wikidata Query
> Service. The instructional material will be available in three languages
> (Hebrew, Arabic and English) but it will be possible to add the same
> instructions in other languages. We are quite confident that having a
> tutorial that explains and teaches the Query Service will help expand
> Wikidata to new audiences worldwide.

This sounds great, thank you!

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] WikibaseCirrusSearch extension

2019-02-14 Thread Stas Malyshev

Hi!

I've been working for a while now on splitting the code that does
searching - and more specifically, searching using
ElasticSearch/CirrusSearch - out from Wikibase extension code and into a
separate extension (see https://phabricator.wikimedia.org/T190022). If
you don't know what I'm talking about here (or not interested in this
topic), you can safely skip the rest of this message.

The extension WikibaseCirrusSearch is meant to have all the code related
to ElasticSearch and CirrusSearch extension integration to Wikibase, so
main Wikibase repo does not have any Elastic-specific code. This means
that if you have your own Wikibase install, you'll need (after migration
is done) to install WikibaseCirrusSearch to get search functionality
like we have on Wikidata now. There will also be change in
configurations - I'll make a migration document and announce it
separately. We're now working on deploying and testing it on
Beta/testwiki, after which we'll start migrating production to running
the code in this extension for search, after which the search code in
the Wikibase repo itself will be removed. You can track the progress in
the Phabricator task mentioned above.

Since code migration is in pretty advanced stage now, I'd like to ask if
you make any changes to any code under repo/includes/Search or
repo/config in Wikibase repo, or any tests or configs related to those,
please inform me (by adding me to patch reviewers/CC or by email or by
any other reasonable means) so that these changes won't be lost in the
migration. I'll be looking into the latest patches for anything related
periodically, but I might miss things.

WikibaseLexeme code that relates to search will be also migrated to a
separate extension (WikibaseLexemeCirrusSearch), that work will be
starting soon. So the request above applies to the search parts of the
WikibaseLexeme code also.

If you have any questions/comments, please feel free to ask me, on the
lists or on the IRC.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] [discovery] Data corruption on 2 Wikidata Query Service servers

2019-01-09 Thread Stas Malyshev

Hi!

> We are having some issues with 2 of the Wikidata Query Service
> servers. So far, the issue looks like data corruption, probably
> related to an issue in Blazegraph itself (the database engine behind
> Wikidata Query Service). The issue prevents updates to the data, but
> reads are unaffected as far as we can tell.

The incident report for this issue is here:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20190110-WDQS

It will be updated if we have any new developments or new information.
As of now, all servers are working normally.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Fwd: Querying Wikidata

2019-01-09 Thread Stas Malyshev

Hi!

> Thank's for your reply.
> All the failing queries was on the following model
> 
> SELECT distinct ?candidate ?label WHERE {
>   SERVICE wikibase:mwapi {
>   bd:serviceParam wikibase:api "EntitySearch" .
>   bd:serviceParam wikibase:endpoint "www.wikidata.org 
> <http://www.wikidata.org>" .
>   bd:serviceParam mwapi:search "Musée Cernuschi" .
>   bd:serviceParam mwapi:language "fr" .
>   bd:serviceParam wikibase:limit 5 .
>   ?candidate wikibase:apiOutputItem mwapi:item .
>   }
> 
> ?candidate wdt:P17 wd:Q142 . 
>  
>   SERVICE wikibase:mwapi {
>   bd:serviceParam wikibase:api "EntitySearch" .
>   bd:serviceParam wikibase:endpoint "www.wikidata.org 
> <http://www.wikidata.org>" .
>   bd:serviceParam mwapi:search "Paris" .
>   bd:serviceParam mwapi:language "fr" .
>   bd:serviceParam wikibase:limit 5 .
>   ?city wikibase:apiOutputItem mwapi:item .
>   } 
>   ?candidate wdt:P131 ?city .
>  
>   ?candidate rdfs:label ?label;
>  filter(lang(?label)="fr")
> } 

Could you describe in a bit more detail what you're trying to do here?
Doing two service calls is not a pattern one would commonly use... It
can be slow if query optimizer misunderstands such query, too. I feel
I'd have a bit more insight if I understood what you are trying to
achieve with this query.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Fwd: Querying Wikidata

2019-01-08 Thread Stas Malyshev

Hi!

> Is there a good mean to query the sparql wdqs service of wikidata?
> I've tried some python code to do it, with relative success.
> Success, because some requests gives the expected result.
> Relative, because the same query sometimes gives an empty response
> either from my code or directly in the WDQS interface, where it's
> possible to see that a sparql query sometimes gives an empty response,
> sometimes the expected reponse without message or status to know that
> the response is erroneous.
> (demo is difficult, because it seems to depend of the load of the wdqs
> service)

Looks like you're running some heavy queries. So the question would be,
which queries are those and how often do you run them?

> I've found the following info:
> *
> https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits,
> which suggest a possible error HTTP code |429, which I never receive|

429 means you're calling the service too fast or too frequently. If you
are just running a single query, you never get 429.

> * https://phabricator.wikimedia.org/T179879, which suggest a possible
> connexion with OAuth, but such possibility is never documented in the
> official documentation

https://phabricator.wikimedia.org/T179879 is an open task, thus it's not
implemented yet.

> None of them gives a practical method to get a response and trust it.

Any method that uses HTTP access to SPARQL endpoint would give you the
same result. Which depends on query. So I'd suggest providing some info
about the queries and specific issues you're having, and then we could
see if it's possible to improve it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Slow response and incomplete result for RecentChange API in wikidata.org

2018-12-03 Thread Stas Malyshev

Hi!

> Also given that it uses oresscores, we recently fixed some performance
> issues caused by it. Do you still have issues with it?

Yes, the issues I have listed still happen. My API calls do not use
ORES. E.g. see:
https://logstash.wikimedia.org/goto/63db4ce68fb5da3cdc7828150de10c59

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Slow response and incomplete result for RecentChange API in wikidata.org

2018-12-02 Thread Stas Malyshev

Hi!

> Can you please check and let us know if you are still experiencing the
> problem? 

We have a task https://phabricator.wikimedia.org/T202764 which I suspect
describes the same issue. It is still open, and even though WDQS is
running on Kafka in production and thus is not affected by it, I see it
every time I run it on Labs (where Kafka stream is not available). So I
think the issues with RC API on wikidata are still alive.

There's also a parallel issue of
https://phabricator.wikimedia.org/T207718 with RDF fetching, which also
still happens.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikibase as a decentralized perspective for Wikidata

2018-11-29 Thread Stas Malyshev

Hi!

> I don't think this would cause a confusion, because the lexicographical
> project is really a separate project that just happens to reside on the
> same Wikidata domain. Essentially you did internally what we are asking

No, the difference here is that L items are not the same as Q items -
e.g. L items do not have sitelinks, and do have lemmas and senses. Data
structure is different. If you use different data structure than Q items
- i.e., no labels, descriptions, sitelinks, etc. - then you should use a
different letter. But if it's the same structure, but for different
domain - then it should be Q.

> Most other sites that link to Wikidata only care about just one of those
> projects. E.g. OSM would have very little interest in lexical data, so
> it is OK if "L" prefix would be used in OSM and in WD because it won't
> be as confusing to the users as reusing the Q.

No, that would be confusing. If OSM wants own data type, because Q item
does not fit - e.g. OSM doesn't want descriptions and sitelinks - then
it should use a separate letter, like MediaInfo uses M. But using L
would not be smart since then this data would not integrate well with
lexicografical data.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] BlazeGraph/wikibase:label performance

2018-11-26 Thread Stas Malyshev

Hi!

> But of course the original query should normally be streaming and not
> depend on any such smartness to push LIMIT inwards.

You are correct, but this may be a consequence of how Blazegraph treats
services. I'll try to look into it - it is possible that it doesn't do
streaming correctly there.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata-tech] [BREAKING] Planned RDF ontology prefix change

2018-11-16 Thread Stas Malyshev

Hi!

> We are planning to change the prefix and associated URIs in RDF
> representation for Wikidata from:
> 
> PREFIX wikibase: <http://wikiba.se/ontology-beta#>
> 
> to:
> 
> PREFIX wikibase: <http://wikiba.se/ontology#>

The change has been implemented now, and RDF data is generated without
the beta prefix. Please tell me if you notice any problems or have any
questions.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons

2018-10-19 Thread Stas Malyshev

Hi!

> data on Commons. I also think that I understand your statement above.
> What I'm not understanding is how Daniel's proposal to "start using the
> ontology as an ontology on wikimedia projects, and thus expose the fact
> that the ontology is broken." isn't a proposal to add poor quality
> information from Wikidata onto Wikipedia and, in the process, give
> Wikipedians more problems to fix. Can you or Daniel explain this?

While I can not pretend to have expert knowledge and do not purport to
interpret what Daniel meant, I think here we must remember that
Wikipedia, while being of course of huge importance, is not the only
Wikimedia project, so "start using it on Wikimedia projects" does not
necessarily mean "start using it on Wikipedia", yet less "start adding
bad information to Wikipedia" (there are other ways to use the data,
including imperfect ontologies - e.g. for search, for bot guidance, for
quality assurance and editor support, and many other ways) I am not
prescribing a specific scenario here, just reminding that "using the
ontology on wikimedia projects" can mean a wide variety of things.

> Separately, someone wrote to me off list to make the point that
> Wikipedians who are active in non-English Wikipedias also wouldn't
> appreciate having their workloads increased by having a large quantity
> poor-quality information added to their edition of Wikipedia. I think

I am sure that would be a bad thing. But I don't think anything we are
discussing here would lead to that happening.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons

2018-10-19 Thread Stas Malyshev

Hi!

> Cparle wants to make sure that people searching for "clarinet" also get
> shown images of "piccolo clarinet" etc.
> 
> To make this possible, where an image has been tagged "basset horn" he
> is therefore looking to add "clarinet" as an additional keyword, so that
> if somebody types "clarinet" into the search box, one of the images
> retrieved by ElasticSearch will be the basset horn one.

Generally if the image is tagged with "basset horn" and the user query
is "clarinet", we can do one of the following:

1. Index all upstream-hierarchy for "basset horn" (presumably we would
have to cut off when it gets too deep or too abstract) and then match
directly when searching.

2. Expand hierarchy down-stream from "clarinet" and then match against
search index.

3. Have some manual or automatic process that ensures that both
"clarinet" and "basset horn" are indexed (not necessarily at once) and
rely on it to discover the matches.

The problem with (1) is that if hierarchy changes, we will have to do
huge number of updates which might overwhelm the system, and most of
these updates would be not even for things people search for, but we
have no way to know that.

The problem with (2) is that downstream hierarchies explode very fast,
and if you search for "clarinet" and there are 1 descendants in
these hierarchies, we can't search for all of them, so you may never get
a chance to find the basset horn. Also, of course, querying big
downstream hierarchies takes time too, which means performance hit.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons

2018-10-19 Thread Stas Malyshev

Hi!

> possibility to find more results by letting the search engine traverse
> the "more-general-than" links stored in Wikidata. People have discovered
> cases where some of these links are not correct (surprise! it's a wiki
> ;-), and the suggestion was that such glitches would be fixed with
> higher priority if there would be an application relying on it. But even

The main problem I see here is not that some links are incorrect - which
may have bad effects, but it's not the most important issue. The most
important one, IMHO, that there's no way to figure out in any scalable
and scriptable way what "more-general-than" means for any particular case.

It's different for each type of objects and often inconsistent within
the same class (e.g. see confusion between whether "dog" is an animal, a
name of the animal, name of the taxon, etc.) It's not that navigating
the hierarchy would lead as astray - we're not even there yet to have
this problem, because we don't even have a good way to navigate it.

Using instance-of/subclass-of only seems to not be that useful, because
a lot of interesting things are not represented in this way - e.g.
finding out that Donna Strickland (Q56855591) is a woman (Q467) is
impossible using only this hierarchy. We could special-case a bunch of
those but given how diverse Wikidata is, I don't think this will ever
cover any significant part of the hierarchy unless we find a non-ad-hoc
method of doing this.

This also makes it particularly hard to do something like "let's start
using it and fix the issues as we discover them", because the main issue
here is that we don't have a way to start with anything useful beyond a
tiny subset of classes that we can special-case manually. We can't
launch a rocket and figure how to build the engine later - having a
working engine is a prerequisite to launching the rocket!

There are also significant technical challenges in this - indexing
dynamically changing hierarchy is very problematic, and with our
approach to ontology anything can be a class, so we'd have to constantly
update the hierarchy. But this is more of a technical challenge, which
will come after we have some solution for the above.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] [BREAKING] Planned RDF ontology prefix change

2018-10-17 Thread Stas Malyshev

Hi!

> If you're making the change, maybe worth going to https: as it'll be
> painful to do later?

Please see https://phabricator.wikimedia.org/T153563 where it was
discussed. In general, there's no reason to use https for ontology URIs,
as ontology URIs do not have any data in them and accessing them would
not be very useful.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata-tech] [BREAKING] Planned RDF ontology prefix change

2018-10-17 Thread Stas Malyshev

Hi!

We are planning to change the prefix and associated URIs in RDF
representation for Wikidata from:

PREFIX wikibase: <http://wikiba.se/ontology-beta#>

to:

PREFIX wikibase: <http://wikiba.se/ontology#>

If you are using Wikidata Query Service, you do not have to do anything,
as WDQS already is using the new definition.

However, if you consume RDF exports from Wikidata or RDF dumps directly,
you will need to change your clients to expect the new URI scheme for
Wikibase ontology.
Also, if you're using Wikibase extension in your project, please be
aware that the RDF URIs generated by it will use this prefix after the
change. This is defined in repo/includes/Rdf/RdfVocabulary.php around
line 175:

self::NS_ONTOLOGY => self::ONTOLOGY_BASE_URI . "#",

The new data will have schema:softwareVersion "1.0.0" triple on the
dataset node[1], which will allow your software to distinguish the new
data format from the old one.

The task tracking the change is
https://phabricator.wikimedia.org/T112127. I will make another
announcement when the change is merged and deployed and the data
produced by Wikidata is going to change.

Please contact me (or comment in the task) if you have any questions or
concerns.

[1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Header
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

[Wikidata] [BREAKING] Planned RDF ontology prefix change

2018-10-17 Thread Stas Malyshev

Hi!

We are planning to change the prefix and associated URIs in RDF
representation for Wikidata from:

PREFIX wikibase: <http://wikiba.se/ontology-beta#>

to:

PREFIX wikibase: <http://wikiba.se/ontology#>

If you are using Wikidata Query Service, you do not have to do anything,
as WDQS already is using the new definition.

However, if you consume RDF exports from Wikidata or RDF dumps directly,
you will need to change your clients to expect the new URI scheme for
Wikibase ontology.
Also, if you're using Wikibase extension in your project, please be
aware that the RDF URIs generated by it will use this prefix after the
change. This is defined in repo/includes/Rdf/RdfVocabulary.php around
line 175:

self::NS_ONTOLOGY => self::ONTOLOGY_BASE_URI . "#",

The new data will have schema:softwareVersion "1.0.0" triple on the
dataset node[1], which will allow your software to distinguish the new
data format from the old one.

The task tracking the change is
https://phabricator.wikimedia.org/T112127. I will make another
announcement when the change is merged and deployed and the data
produced by Wikidata is going to change.

Please contact me (or comment in the task) if you have any questions or
concerns.

[1] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Header
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata considered unable to support hierarchical search in Structured Data for Commons

2018-09-27 Thread Stas Malyshev

Hi!

> Apparently the Wikidata hierarchies were simply too complicated, too
> unpredictable, and too arbitrary and inconsistent in their design across
> different subject areas to be readily assimilated (before one even
> starts on the density of bugs and glitches that then undermine them).

The main problem is that there is no standard way (or even defined small
number of ways) to get the hierarchy that is relevant for "depicts" from
current Wikidata data. It may even be that for a specific type or class
the hierarchy is well defined, but the sheer number of different ways it
is done in different areas is overwhelming and ill-suited for automatic
processing. Of course things like "is "cat" a common name of an animal
or a taxon and which one of these will be used in depicts" adds
complexity too.

One way of solving it is to create a special hierarchy for "depicts"
purposes that would serve this particular use case. Another way is to
amend existing hierarchies and meta-hierarchies so that there would be
an algorithmic way of navigating them in a common case. This is
something that would be nice to hear about from people that are
experienced in ontology creation and maintenance.

> to be chosen that then need to be applied consistently?  Is this
> something the community can do, or is some more active direction going
> to need to be applied?

I think this is very much something that the community can do.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Mapping Wikidata to other ontologies

2018-09-23 Thread Stas Malyshev

Hi!

> That's one way of linking up, but another way is using equivalent
> property ( https://www.wikidata.org/wiki/Property:P1628 ) and equivalent
> class ( https://www.wikidata.org/wiki/Property:P1709 ). See for example

It is technically possible to add values for P1628 into RDF export.
However, the following questions arise:

1. Are we ready to claim these are exact equivalents? Sometimes semantic
meanings differ, and some properties have class requirements - e.g.
http://schema.org/illustrator expects value to be of class Person, but
of course Wikidata item would not have that class. Same for the subject
- it expected to be of a class Book, but won't be. This may confuse some
systems. Is that ok?

2. How we deal with multiple ontologies with the same meanings? E.g.
https://www.wikidata.org/wiki/Property:P21 has 4 equivalent properties.
There might be more. Do we want to generate them all? Why there are two
properties for the same FOAF ontology - is that right?

3. If you change P1628, that does not automatically make all items with
the relevant predicate update. You need to do an extensive update
process - which is currently does not exist, and for popular property
may require significant resources to complete, some properties have
millions of uses.

Using P1709 is even more tricky since Wikidata ontology (provided we
call what we have an ontology, which may also not be acceptable to some)
is rather different from traditional semantic ontologies, and we do not
really enforce any of the rules with regard to classes, property
domain/ranges, etc. and have frequent and numerous exceptions to those.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Stemming in Search

2018-09-20 Thread Stas Malyshev

Hi!

> The general Search box was what I was using (top right corner of interface)
> 
> I typed in the following:
> 
> Readers Digest
> 
> and expected to see
> Reader's Digest  Q371820
> 
> but it did not appear.
> 
> Today I just checked again the same scenario, as I typed this email.>
> and now it does appear.

Yes, this is how it should work. There were no changes lately, AFAIK,
but it is possible that you hit some glitch or maintenance on your
previous search. If that happen again, please tell me when and with
which search string /URL, I'll try to investigate.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Stemming in Search

2018-09-20 Thread Stas Malyshev

Hi!

> When will stemming be supported in Search ?

In general, I think it already should be, for fields and contexts that
use appropriate analyzers, but I'd like to hear more details:

1. Which search?
2. What you're looking for, i.e. search string?
3. What you expect to find?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Stas Malyshev

Hi!

On 8/23/18 2:07 PM, Daniel Mietchen wrote:
> On Thu, Aug 23, 2018 at 10:44 PM  wrote:
>> I was wondering why our research section was number 8. Then I recalled
>> our dashboard running from
>> "http://people.compute.dtu.dk/faan/cognitivesystemswikidata1.html;. It
>> updates around each 3 minute all day long...
> 
> Such automated queries should not be in the organic query file that I looked 
> at.

If it's a browser page and the underlying code does not set distinctive
user agent, I think they will be. It'd be hard to identify such cases
otherwise (ccing Markus in case he knows more on the topic).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata SPARQL query logs available

2018-08-23 Thread Stas Malyshev

Hi!

> I just ran Max' one-liner over one of the dump files, and it worked
> smoothly. Not sure where the best place would be to store such things,
> so I simply put it in my sandbox for now:
> https://www.wikidata.org/w/index.php?title=User:Daniel_Mietchen/sandbox=732396160

If you think it's a dataset others may want to reuse, tabular data on
Commons may be a venue: https://www.mediawiki.org/wiki/Help:Tabular_Data

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

2018-08-15 Thread Stas Malyshev

Hi!

> This is a bit tangential to the topic, but isn’t that basically what
> schema.org was developed for? (I’m not sure if that’s still its primary
> purpose, but as far as I know it was started by a group of search
> engines to develop a unified format websites could use to make their
> semantics more accessible to those search engines.)

There are a number of schemas, like Dublin Core, that try to address
issues like that. However, none is even close to what we're talking
about - covering several thousands properties that change all the time.
They have very basic things covered, but AFAIK not much beyond. And I
think those vocabularies still do not solve our problem with updating
labels in multiple languages and keeping them in sync.

That said, this would be quite offtopic for *this* thread, but still if
anybody has any ideas on how to present Wikidata content better to
search engines using well-known metadata vocabularies, I think it would
be a very welcome effort.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Indexing all item properties in ElasticSearch

2018-08-15 Thread Stas Malyshev

Hi!

> I tried searching for a few DOIs today which are string properties 
> (i.e. 10.1371/JOURNAL.PCBI.1002947) and didn't get any results. 

Statements are indexed, but you have to use haswbstatement with specific
property to look for them.

> Is this the phabricator task for
> this: https://phabricator.wikimedia.org/T163642 ? 

This is the task to make strings searchable _without_ haswbstatement
keyword.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

2018-08-14 Thread Stas Malyshev

Hi!

> https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently
> our query service is a very strong and complete service, but Wikidata
> search is very poor. Let's take Blade Runner.

I don't think it's *very* poor anymore, but it certainly can be better.

> In my ideal world, everything I see as a human gets indexed into the
> search engine preferably in a per language index. For example for Dutch

Err The problem is that what you see as a human and what search
engine uses for lookups are very different things. While for text
articles it is similar, for structured data it's quite different, and
treating structured data the same way as text is not going to produce
good results, partially because most search algorithms make assumptions
that come from text world, partially because we'd be ignoring useful
clues present in structured data.

> something like a text_nl field with the, label, description, aliases,
> statements and references in there. So index *everything* and never see

There are such fields, but it makes no sense to put references there,
because there's no such thing as "Dutch reference". References do not
change with language.

> a Qnumber or Pnumber in there (extra incentive for people to add labels
> in their language). Probably also everything duplicated in the text

That presents a problem. While you see "instance of": "human", the data
is P31:Q5. We can, of course, put "instance of": "human" in the index.
But what if label for Q5 changes? Now we have to re-index 10 million
records. And while we're doing it, what if another label for such item
changes again? We'd have to start another million-size reindex. In a
week, we'd have a backlog of hopeless size, or will require processing
power that we just don't have. Note also that ElasticSearch doesn't
really do document updates - it just writes a new document. So frequent
updates to the same document is not its optimal scenario, and we're
talking about propagating each label edit to each item that is linked to
that one. I'm afraid that would explode on us very quickly.

The problem is not indexing labels, the problem is keeping them
up-to-date on 50 million interlinked items.

When displaying, it's easy - you don't need to worry until you show it,
and most items are shown only rarely. Even then you see a label out of
date now and then. But with search, you can't update label on use - when
you want to use it (i.e. look up), it should already be up-to-date,
otherwise it's useless.

> As for implementation: We already have the logic to serialize our json
> to the RDF format. Maybe also add a serialization format for this that
> is easy to ingest by search engines? 

I don't know any such special format, do you? We of course have JSON
updates to ElasticSearch, but as I noted before, updates are the problem
there, not format. RDF of course also does not carry denormalized data,
so we also update only entries that need updating, and fetch labels on
use. We can not do it for search index. I don't think format here is the
problem.

> . Making it easier to index not only for our own search would be a nice
> added benefit.

Sure, but experience have shown that the strategy of "dump everything
into one huge text" works very poorly in Wikidata. That's why we
implemented specialized search that knows about how the structured data
works. If the search sucks less now than it did before, that's the reason.

> How feasible is this? Do we already have one or multiple tasks for this
> on Phabricator? Phabricator has gotten a bit unclear when it comes to
> Wikidata search, I think because of misunderstanding between people what
> the goal of the task is. Might be worthwhile spending some time on
> structuring that.

Wikidata search tasks would be under "Wikidata" + "Discovery-Search".
There are multiple tasks for it, but if you want to add any, please feel
welcome to browse and add.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Indexing all item properties in ElasticSearch

2018-07-27 Thread Stas Malyshev

Hi!

> * I would really like dates (mainly, born/died), especially if they work
> for "greater units", that is, I search for a year and get an item back,
> even though the statament is month- or day-precise

This is something I've been thinking about for a while, mainly because
the way we index dates now does not serve some important use cases. Even
in the Query Service we treat dates as fixed instants on the time scale,
whereas some dates are not instants but intervals (which in captured in
wikidata Precision but we are currently not paying any attention to it),
in fact many of the dates we use are more of interval-y nature than
instant-y.

This makes searching for "somebody that was born in 1820" possible but
laborious (you need to do intervals manually) and inefficient since we
can't just look up by year.

There are certainly improvement possible in this area, not yet sure how
to do it though.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Indexing all item properties in ElasticSearch

2018-07-27 Thread Stas Malyshev

Hi!

> I could definitely see a usecase for 1) and maybe for 2). For example,
> let's say i remember that one movie that Rutger Hauer played in, just
> searching for 'movie rutger hauer' gives back nothing:
> 
> https://www.wikidata.org/w/index.php?search=movie+rutger+hauer
> 
> While Wikipedia gives back quite a nice list of options:
> 
> https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer

Well, this is not going to change with the work we're discussing. The
reason you don't get anything from Wikidata is because "movie" and
"rutger hauer" are labels from different documents and ElasticSearch
does not do joins. We only index each document in itself, and possibly
some additional data, but indexing labels from other documents is now
beyond what we're doing. We could certainly discuss it but that would be
separate (and much bigger) discussion.

> If we would index item properties as well, you could get back Blade
> Runner (Q184843) which has Rutger Hauer as one of its 'cast member'
> values.

You could, but not by asking something like "movie rutger hauer", at
least not without a lot of additional work. Indexing "cast member" would
get you a step closer, but only a tiny step and there are a number of
other steps to take before that can work.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-27 Thread Stas Malyshev

Hi!

> The top 1000
> is: 
> https://docs.google.com/spreadsheets/d/1E58W_t_o6vTNUAx_TG3ifW6-eZE4KJ2VGEaBX_74YkY/edit?usp=sharing

This one is pretty interesting, how do I extract this data? It may be
useful independently of what we're discussing here.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] [discovery-private] Indexing all item properties in ElasticSearch

2018-07-27 Thread Stas Malyshev

Hi!

> I think we already index way more than P31 and P279.

Oh yes, all the string properties.

> So I think that the increase is smaller than what you anticipate.
> What I'd try to avoid in general is indexing terms that have only doc
> since they are pretty useless.

For unique string properties, that would be a frequent occurrence. But I
am not sure why it's useless - won't it be a legit use case to look up
something by external ID?

> I think we should investigate what kind of data we may have here, and at
> least for statement_keywords I would not index data that contain random
> text (esp. natural language) since they are prone to be unique and
> impossible to search. 

Yes, we definitely should not do that. I tried to exclude such
properties but if you notice more of them, let's add them to exclusion
config.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Indexing all item properties in ElasticSearch

2018-07-27 Thread Stas Malyshev

Hi!

> * I would really like dates (mainly, born/died), especially if they work
> for "greater units", that is, I search for a year and get an item back,
> even though the statament is month- or day-precise

What would be the use case for this?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Indexing all item properties in ElasticSearch

2018-07-26 Thread Stas Malyshev

Hi!

Today we are indexing in ElasticSearch almost all string properties
(except a few) and select item properties (P31 and P279). We've been
asked to extend this set and index more item properties
(https://phabricator.wikimedia.org/T199884). We did not do it from the
start because we did not want to add too much data to the index at once,
and wanted to see how the index behaves. To evaluate what this change
would mean, some statistics:

All usage of item properties in statements is about 231 million uses
(according to sqid tool database). Of those, about 50M uses are
"instance of" which we are already indexing. Another 98M uses belong to
two properties - published in (P1433) and cites (P2860). Leaving about
86M for the rest of the properties.

So, if we index all the item properties except P2860 and P1433, we'll be
a little more than doubling the amount of data we're storing for this
field, which seems OK. But if we index those too, we'll be essentially
quadrupling it - which may be OK too, but is bigger jump and one that
may potentially cause some issues.

So, we have two questions:
1. Do we want to enable indexing for all item properties? Note that if
you just want to find items with certain statement values, Wikidata
Query Service matches this use case best. It's only in combination with
actual fulltext search where on-wiki search is better.

2. Do we need to index P2860 and P1433 at all, and if so, would it be ok
if we omit indexing for now?

Would be glad to hear thoughts on the matter.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] UniProt license change to CC-BY 4.0 could we be added to the federatable sparql endpoints

2018-07-19 Thread Stas Malyshev

Hi!

On 7/19/18 1:07 AM, Jerven Tjalling Bolleman wrote:
> Dear WikiData community,
> 
> I am very happy to announce that all UniProt datasets are now available
> under CC-BY 4.0
> 
>   https://www.uniprot.org/news/2018/07/18/release
>   https://www.uniprot.org/help/license
> 
>  
> https://www.sib.swiss/about-us/news/1186-encouraging-knowledge-reuse-to-foster-innovation
> 
> 
> Could our sparql endpoint https://sparql.uniprot.org/sparql please be
> added to the
> the endpoints that are usable with SPARQL 1.1. SERVICE clauses at
> https://query.wikidata.org/sparql.

Thank you! I will take care of it in the next update (next week).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] On traceability and reliability of data we publish [was Re: [Wikimedia-l] Solve legal uncertainty of Wikidata]

2018-07-07 Thread Stas Malyshev

Hi!

> I agree this is misconception that a copyright license make any direct
> change to data reliability. But attribution requirement does somewhat
> indirectly have an impact on it, as it legally enforce traceability.

While true, I don't think it's of much practical use if traceability is
what you are seriously interested in. Imagine Wikidata were CC-BY, so
each piece of data you use from Wikidata now has to be marked as "coming
from Wikidata.Org". What have you gained? Wikidata is huge, and this
mark doesn't even tell you which item it is from, while being completely
satisfactory legally. Even more useless it is for actually ensuring the
data is correct or tracing its provenance to primary sources - you'd
still have to find the item and check the references manually (or
automatically, maybe) as you could do for CC0. CC-BY license would not
have added very much on Wikidata side.
All this is while, of course, even with CC0 nothing prevents you from
importing Wikidata data in such a way that each piece of data still
carries the mark "coming from Wikidata". While it is not a legal
requirement with CC0, nothing in CC0 prevents that from happening. If
your provenance needs are matched by this, there's nothing preventing
you from doing this, and legal requirements of CC-BY do not improve it
for you in any way - they just would force people that *do not* need to
do it still do it.

> That is I strongly disagree with the following assertion: "a license
> that requires BY sucks so hard for data [because] attribution
> requirements grow very quickly". To my mind it is equivalent to say that

I think this assertion (that attribution requirements grow) is factually
true. Each data piece from CC-BY data set needs to carry attribution. If
your data needs require to combine several data sets, each of them needs
to carry attribution. This attribution should be carried through all
data processing pipelines. You may be OK with this growth, but as I just
explained above, these requirements, while being onerous for people that
don't need tracing each piece of data, are still unsatisfactory in many
cases for those that do. So having CC-BY would be both onerous and useless.

> we will throw away traceability because it is subjectively judged too
> large a burden, without providing any start of evidence that it indeed
> can't be managed, at least with Wikimedia current ressources.

It's not Wikimedia that will be shouldering the burden, it's every user
of Wikimedia data sets.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata-tech] lexeme fulltext search display

2018-06-18 Thread Stas Malyshev

Hi!

> I can reimplement it manually, but I would be largely duplicating what
> HtmlPageLinkRendererBeginHookHandler is supposed to do. The problem
> seems to be that it is not doing it right. When the code works on the
> link like /wiki/Lexeme:L2#L2-F1, it does this:
> 
> $entityId = $foreignEntityId ?:
> $this->entityIdLookup->getEntityIdForTitle( $target );
> 
> Which produces back LexemeId instead of Form ID. It can't return Lexeme
> ID since lexeme does not have content model, and getEntityIdForTitle
> uses content model to get from Title to ID. So, I could duplicate all
> this code but I don't particularly like it. Could we fix
> HtmlPageLinkRendererBeginHookHandler instead maybe?

Also, looks like Form actually doesn't have link-formatter-callback and
its own link formatter code. So I wonder if there's an existing facility
to format links to Forms? Leszek, do you have any information on this?

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] lexeme fulltext search display

2018-06-18 Thread Stas Malyshev

Hi!

> You can use an EntityTitleLookup to get the Title object for an EntityId. In
> case of a Form, it will point to the appropriate section. You can use the

OK, I see it's just adding form id as a fragment, so it's easy I guess.

> LinkRenderer service to make a link. Or you use an EntityIdHtmlLinkFormatter,
> which should do the right thing. You can get one from a
> OutputFormatValueFormatterFactory.

I can reimplement it manually, but I would be largely duplicating what
HtmlPageLinkRendererBeginHookHandler is supposed to do. The problem
seems to be that it is not doing it right. When the code works on the
link like /wiki/Lexeme:L2#L2-F1, it does this:

$entityId = $foreignEntityId ?:
$this->entityIdLookup->getEntityIdForTitle( $target );

Which produces back LexemeId instead of Form ID. It can't return Lexeme
ID since lexeme does not have content model, and getEntityIdForTitle
uses content model to get from Title to ID. So, I could duplicate all
this code but I don't particularly like it. Could we fix
HtmlPageLinkRendererBeginHookHandler instead maybe?
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] lexeme fulltext search display

2018-06-18 Thread Stas Malyshev

Hi!

>> color/colour (L123)
>> colors: plural for color (L123): English noun
> 
> I'd rather have this:
> 
>  colors/colours (L123-F2)
>  plural of color (L123): English noun

This part is a bit trickier since the title is still L123, so the system
now is generating the link for L123. I could override that, but I see
two questions:
1. What the link will be pointing to? I haven't found the code to
generate the link to specific Form. I could write a new one but if it'd
sit outside main classes it may be a fragile design.
2. This means overriding standard linking code and possibly
reimplementing part of it (depending on whether this code supports
generating Form link instead of Lexeme) - may again be a bit fragile.
Unless I find standard means to do it.

> Note that in place of "plural", you may have something like "3rd person,
> singular, past, conjunctive", derived from multiple Q-ids.

Yes, of course.

> Again, I don't think any highlighting is needed.

Not strictly speaking needed, but might be nice.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

[Wikidata-tech] lexeme fulltext search display

2018-06-14 Thread Stas Malyshev

Hi!

I am working now on Lexeme fulltext search. One of the unclear moments I
have encountered is how to display Lexemes as search results. I am
basing on assumption that we want to match both Lemmas and Forms (please
tell me if I'm wrong). Having the match, I plan to display Lemma match
like this:

title (LN)
Synthetic description

e.g.

color/colour (L123)
English noun

Meaning, the first line with link would be standard lexeme link
generated by Lexeme code (which also deals with multiple lemmas) and the
description line is generated description of the Lexeme - just like in
completion search. The problem here, however, is since the link is
generated by the Lexeme code, which has no idea about search, we can not
properly highlight it. This can be solved with some trickery, probably,
e.g. to locate search matches inside generated string and highlight
them, but first I'd like to ensure this is the way it should be looking.

More tricky is displaying the Form (representation) match. I could
display here the same as above, but I feel this might be confusing.
Another option is to display Form data, e.g. for "colors":

color/colour (L123)
colors: plural for color (L123): English noun

The description line features matched Form's representation and
synthetic description for this form. Right now the matched part is not
highlighted - because it will otherwise always be highlighted, as it is
taken from the match itself, so I am not sure whether it should be or not.

So, does this display look as what we want to produce for Lexemes? Is
there something that needs to be changed or improved? Would like to hear
some feedback.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] Wikidata full text search

2018-05-31 Thread Stas Malyshev

Hi!

> Would there be any drawback with the following steps as way forward
> and possibility to learn more as we go?
> 1. We return results for the Lexeme namespace only when people
> explicitly select it

If you mean "it and only it" (as opposed to Lexemes + any other
namespace), then yes, this is doable and this is probably what I am
going to start with. However, a lot of people - as I observed with
several community members - tend to use "All" option and expect it to work.

> 2. We get feedback
> 3. We go the "Best possible query" route when people select all namespaces
> 4. We get feedback
> 5. We go the "Best possible query" route for all searches if feedback
> indicates this is useful (I don't know at this point)

I am not sure which mode is best for Wikidata now, there are at least
several plausible ways do go by default for Special:Search:
1. Search in Items only
2. Search in Items + Properties
3. Search in Items + Properties + Lexemes
4. Search in Items + Lexemes
5. Any of the above plus some of the article spaces (i.e. Wikidata or Help)

This requires mixed search working (except for 1 and 2) but is a
separate decision from it.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

[Wikidata-tech] Wikidata full text search

2018-05-30 Thread Stas Malyshev

Hi!

While working on fulltext search for Lexemes, I have encountered a
question which I think needs to be discussed and resolved. The question
is how fulltext search should be working when dealing with different
content models and what search should do by default and in specialized
cases.

The main challenge in Wikidata is that we are dealing with substantially
different content models - articles, Items (including Properties,
because while being formally different type, they are similar enough to
Items for search to ignore the difference) and Lexemes organize their
data in a different way, and should be searched using different
specialized queries. This is currently unique for Wikidata, but SDC
might eventually have the same challenge to deal with. I've described
challenges and questions there are here in more detail:

https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_search#Fulltext_search

I'd like to first hear some feedback about what are the expectations
about the combined search are - what is expected to work, how it is
expected to work, what are the defaults, what are the use cases for
these. I have outlined some solutions that were proposed on wiki, if you
have any comments please feel welcome to respond either here or on wiki.

TLDR version of it is that doing search on different data models is
hard, and we would need to sacrifice something to make it work. We need
to figure out and decide which of these sacrifices are acceptable and
what is enabled/disabled by default.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] WDQS with use of automated requests

2018-05-15 Thread Stas Malyshev

Hi!

On 5/15/18 3:27 PM, Justin Maltais wrote:
> Hi,
> 
> I am looking for the most efficient way of getting the following
> information out of WDQS:
> 
>  * One language only (e.g. fr.wikipedia.org)
>  * All instances of human (e.g. of the abstraction: wd:Q9916|Dwight
>    David

> Let's say we have a list of all sovereign states (Q16, Q30, Q142, ...)
> and all letters of the requested language (French: a, b, c, ...) , we
> can automate requests and get a lot of results. Unfortunately, it's
> costly and not efficient. It takes about a day to succeed.

The first thing I would like to ask is please don't do that again. This
created a significant load on the server, the script completely ignored
the throttling headers we sent, and in the future we would ban such
clients for extended periods of time, to prevent harm to the service. If
your client can not abide by 429/Retry-After headers, please do not run
it in automated repeated fashion until it either can handle them
properly, or insert delays long enough so you can be sure you are not
launching an avalanche of heavy requests and crowding out other users.

If something takes too long, that's a good moment to ask for help, not
to put it in a loop that would hit the server repeatedly for days.

If you need to deal with a massive data set that needs to be processed,
I would suggest trying the following strategy:

1. Load the primary key data - like list of all humans if that's what
you need - to your own storage. You can use either LDF server or parsing
the dump directly for that for Q5 (maybe with Wikidata Toolkit?). For
some scenarios, even direct query would be fine, but for Q5 it probably
would be too much.

2. Split this data set into palatable batches - like 100 items per batch
or so, you can experiment on that, it's fine to cause a couple of
timeouts if it's not an automated script doing it 20 times a second for
a long time. Once you have sane batch size, run the query that needs to
fetch other data using VALUES clause to substitute primary key data.
Watch the 429 responses - if you're getting them, insert delays or lower
batch size, or ask for help again if it doesn't work.

Alternatively, segmenting the records by some other criteria may work
too, but I don't think filter like STRSTARTS(?personLabel, "D")) is
going to be effective - I don't think Blazegraph query optimizer is
smart enough to convert this to index lookup, and without that, this is
just slowing things down by introducing more checks in the query. And
even if it did, there's a lot of labels starting with "D", so that
probably won't be too useful for speeding it up.

Having said that, I am curious - what exactly you are doing with this
data set? Why you need a list of all humans - how this list is going to
be used? Knowing that may help to devise better specialized strategy of
achieving the same.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] [Wikitech-l] GSoC 2018 Introduction: Prssanna Desai

2018-05-07 Thread Stas Malyshev

Hi!

> Greetings,
> I'm Prssanna Desai, an undergraduate student from NMIMS University, Mumbai,
> India and I've been selected for GSoC '18.
> 
> *My Project:* *Improve Data Explorer for query.wikidata.org
> <http://query.wikidata.org>*

Welcome! Thanks for participating and helping to make the Query Service
better!
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikiata and the LOD cloud

2018-05-07 Thread Stas Malyshev

Hi!

> you should read your own emails. In fact it is quite easy to join the
> LOD cloud diagram.
> 
> The most important step is to follow the instructions on the page:
> http://lod-cloud.net under how to contribute and then add the metadata.

I may not be reading it right or misunderstanding something, but I tried
to locate up-to-date working instructions for doing this a few times and
it always ended up going nowhere - the instructions turned out to be out
of date, or new process not working yet, or something else. It would be
very nice and very helpful if you could point out specifically where on
that page are step-by-step instructions which could be followed and
result in resolving this issue?

> Do you really think John McCrae added a line in the code that says "if
> (dataset==wikidata) skip; " ?

I don't think anybody thinks that. And I think most of people there
think it would be nice to have Wikidata added to LOD. It sounds like you
know how to do it, could you please share more specific information
about it?

> You just need to add it like everybody else in LOD, DBpedia also created
> its entry and updates it now and then. The same accounts for
> http://lov.okfn.org  Somebody from Wikidata needs to upload the Wikidata
> properties as OWL.  If nobody does it, it will not be in there.

Could you share more information about lov.okfn.org? Going there
produces 502, and it's not mentioned anywhere on lod-cloud.net. Where it
is documented and what is exactly the process and what you mean by
"upload the Wikidata properties as OWL"? More detailed information would
be hugely helpful.

Thanks in advance,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Use Repology to update software package data

2018-05-02 Thread Stas Malyshev

Hi!

> i want to inform you that Repology has your "repository" of software
> versions included and can list problems or outdated versions that way.

What does this list actually include? Is this the list of software and
versions present in Wikidata as items?
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] CC-BY-SA

2018-03-13 Thread Stas Malyshev

Hi!

> No. There is no such thing as "category namespace" in Wikidata. There

You are correct. I was talking about category namespace in Wikidata
Query Service. It is documented here:
https://www.mediawiki.org/wiki/Wikidata_query_service/Categories

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Election data

2018-03-12 Thread Stas Malyshev

Hi!

> Something I wish was available is the voting record, at least at a
> country/state level.  Knowing the politician's time in office is a great
> start, but how that person voted is what really makes democracy work.

I think Ballotpedia has this data. E.g.: https://ballotpedia.org/Marco_Rubio

Not sure however if it's structured or available in API form. It also
has state level politicians, e.g.: https://ballotpedia.org/Bill_Monning
- but it seems it's even harder to parse there.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] CC-BY-SA

2018-03-12 Thread Stas Malyshev

Hi!

> Just checking so let's say for:
> https://www.wikidata.org/wiki/Q2201
> Narrative location (NY City) is from English Wikipedia, then it's CC BY SA?

No, it's CC0 since it's Wikidata data. Facts as such are not
copyrightable, so the fact that the particular movie is set in NYC is
not subject to any license. Specific arrangement (collection) of facts
can be copyrighted and licensed though and this specific one is
Wikidata, which is licensed under CC0.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] CC-BY-SA

2018-03-12 Thread Stas Malyshev

Hi!

> Are all data that can be fetched via SPARQL CC0-licensed?

Data in the main namespace, without the use of federation - yes.
Federated data - https://query.wikidata.org/copyright.html, federated
endpoints can have different licenses, either CC0-like or CC-BY-SA like
(I don't think we accepted any that have anything stricter than that)
Category namespace - since it comes from Wikipedias, it's CC-BY-SA I
assume.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] stats on WD edits and WDQS uptime

2017-12-21 Thread Stas Malyshev

Hi!

> Second, are there any stats on the uptime of the WDQS SPARQL endpoint? 

I am not entirely sure how you define "uptime" here? If you try to
access query.wikidata.org, it'd be very close to 100%. That said, we had
a couple of incidents where one or more servers failed, causing some
queries to get stuck or be rejected, see
https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs
and https://wikitech.wikimedia.org/wiki/Incident_documentation/20171130-wdqs
These do not take the whole service down, so I am not sure how they
qualify uptime-wise.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata fulltext search prototype

2017-12-18 Thread Stas Malyshev

Hi!

> I guess its using an older index from a few weeks ago ?  Doesn't seem to
> have the latest properties that have landed, but that's ok if the ES
> index isn't current yet and your just experimenting and getting feedback.

Yes, exactly. Wikidata index is big, and we can not use main index since
we're experimenting on it, so we make a copy and use that. Of course,
the copy gets out of date :) This one is couple of weeks old.

> 
> http://wikidata-wdsearch.wmflabs.org/w/index.php?search=partition=Special:Search=advanced=1=1=25rdek6vt4n1ekkk5ht0ew0vv
> 
> Didn't see
> https://www.wikidata.org/wiki/Property:P4653

yes, too recent :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata fulltext search prototype

2017-12-18 Thread Stas Malyshev

Hi!

> Where can I learn about the internals of this jewel? (which search
> engine, what metrics are used to rank items, and so on).

Thanks for your kind words. You can track it here:

https://phabricator.wikimedia.org/T125500

and associated tasks like this one:
https://phabricator.wikimedia.org/T178851

which contain links to the patches. The search runs on the same
ElasticSearch we use for search on other sites, but the prototype has
specific code to deal with Wikidata specific data structure and the fact
that it is, unlike most other Wikimedia sites, multilingual by design.

The rankings are hand-tuned now and kind of hard to read right now
(we're working on improving this), they are contained here:
https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/repo/config/
and specific functions we're using here:
https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/repo/config/ElasticSearchRescoreFunctions.php;4c6aa54e56c68ebd3543b23c88f52ae6f176a079$25

Basically it's a combination of match score (how well the string matches
the query), incoming link count, sitelink count and special boosts like
demoting the disambiguation pages.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Wikidata fulltext search prototype

2017-12-18 Thread Stas Malyshev

Hi!

Search Platform team would like to present a prototype test site of new
and improved Wikidata fulltext search:

http://wikidata-wdsearch.wmflabs.org/wiki/Special:Search

Please try your favorite searches on it and report whether it looks good
and which problems you notice.

Important to note for this prototype:

- The data in the search is imported from Wikidata index but not updated
from it after import, so it may be slightly out of date

- The search is in English by default but you can try other languages by
using uselang parameter, e.g.:
http://wikidata-wdsearch.wmflabs.org/w/index.php?search=Wien=Special:Search=advanced=1=1=de
Note that since it's a test site, this is probably the best way to test
non-English searches as logins etc. may not work there properly.

- Search would work properly only for main & property namespace (0 and
120).

What kind of problems we are looking for?

- Ranking and retrieval problems, i.e. result X appears too low or too
high in specific search, or does not appear at all (please tell us
specific search query and expected result)

- UI problems - i.e. the ranking is fine but highlighting or label or
description is broken or look bad, or not highlighting the result that
should be highlighted

Of course, if some search result worked spectacularly better for you, it
would be nice to know too :)

What should work?

Any search in Special:Search in main namespace and Property namespace
should produce sensible result. Searches without advanced syntax should
have better results than before, and search with advanced syntax (+, -,
*, quotes, etc.) should work no worse than before.

Please note that this is a test wiki, so nothing else but search is
expected to work, including clicking on other links, editing, browsing
to other pages, etc. This is also a test site, so short disruptions
might be possible when we update or change things or fix bugs reported
by you :)

How to provide feedback?

Several ways are possible:
- Reply to this list or personally to me if you prefer
- On-wiki message on my talk page:
https://www.wikidata.org/wiki/User_talk:Smalyshev_(WMF)
- Talk to us on IRC: #wikimedia-discovery

Thanks!
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-12-15 Thread Stas Malyshev

Hi!

> Somebody pointed me to the following issue:
> https://phabricator.wikimedia.org/T179681  Unfortunately I'm not able
> to log in there with the "Phabricator" so I cannot edit the issue
> directly.  I'm sending this email instead.

Thank you, I've updated the task with references to your comments.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] RDF: All vs Truthy

2017-12-03 Thread Stas Malyshev

Hi!

> Can somebody please explain (in simple terms) what's the difference
> between "all" and "truthy" RDF dumps? I've read the explanation
> available on the wiki [1] but I still don't get it.

Technically "truthy" is the set of statements with best non-deprecated
rank for the property. Semantically, it is the value you most likely
expect as the answer to a simple question "what is X of Y", like "what
is the population of London" or "who is the wife of the US president?"

> If I'm just a user of the data, because I want to retrieve
> information about a particular item and link items with other
> graphs... what am I missing/leaving-out by using "truthy" instead of
> "all"?

Historical data - i.e. current population vs. all historic population
figures, current spouse vs.all previous marriages, current head of state
vs. list of all people occupying the office. Some other data, possibly,
such as official name vs. alias (provided that is expressed as a
property), commonly accepted value vs. alternative possibilities, etc.

> A practical example would be appreciated since it will clarify
> things, I suppose.

Current (as in, latest/best available for now) population of London
would be found as "truthy" value (wdt), all other population figures -
e.g. historical figures - will be under "all" (p/ps/psv).


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata-tech] Tracking internal uses of Wikidata Query Service

2017-11-30 Thread Stas Malyshev

Hi!

We are seeing more use of the Wikidata Query Service by Wikimedia
projects. Which is excellent news, but somewhat worse news is that the
maintainers of WDQS do not have a good idea what these services are,
what they needs are and so on. So, we have decided we want to start
tracking internal uses of Wikidata Query Service.

To that point, if you run any functionality on Wikimedia sites
(Wikipedias, Wikidata, etc., anything with wikimedia domain) that uses
queries to the Wikidata Query Service, please go to:
https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage
and add your project there. That is both if your project runs queries by
itself on the background, or if it uses queries as part of user
interaction scenario.

We do not include labs tools currently unless it is absolutely vital
infrastructure (i.e. if it went down, would it substantially degrade the
main site functionality or make some features unusable?) If you still
feel we should know about certain lab tool, please leave a note on the
talk page.

What's in it for you?

We want to know these in order to better understand the scope of
internal usage and as preparation for T178492 (creating internal WDQS
setup) - with the goal to provide internal users more robust and more
flexible service. Also we want it to ensure we do not break anything
important when we do maintenance, and we know who to talk to if some
queries do not work as expected and we want to fix it.

What we want to know?

- We'd like to have general description of the functionality (i.e., what
the service is for)
- How to recognize queries run by it - user agent? source host? specific
query pattern? some other mark? It is recommended that it would be
possible to recognize
- What kind of queries it runs (no need to list every possible one of
course but if there are typical cases it'd help to see it)?
- How often the queries run - if it's periodic, or what is
expected/statistical usage of the tool if it's user driven tool?
- Where could we see the code at the base of it and who maintains it?
- Feel free to add any other information about anything you think would
be useful for us to know.

What was that page again?

https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage

Thanks in advance,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Do you use the Wikidata entity dump dcatap.rdf?

2017-11-02 Thread Stas Malyshev

Hi!

> How about adding the RDF to query.wikidata.org so we can get a current
> list?

We could probably load the rdf we have now into Blazegraph relatively
easily. Updating may be a bit tricky (should we delete historical
items?) but it's possible to figure it out. I'll look into it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata-tech] Wikidata fulltext search results output

2017-11-02 Thread Stas Malyshev

Hi!

> When showing labels from fallback languages we do have little language
> indicators in other places. I believe we should have this here as

Makes sense. I'll look into how to get those. Is language code OK or we
need full language name (uk vs. Ukrainian)?

One thing to note here is that secondary languages have no order - i.e.
if you look in German, and there's no matching German label, but there
are 10 other language labels all the same (happens a lot for names &
places), which language will be selected is anybody's guess. We could
add rule that says "look at English as secondary first", in theory, but
not sure whether we should - after all, besides having most languages,
(and us speaking it :) there's not much special about it.

> I'm slightly leaning toward showing both.

OK.

> I'd say in this case we could get rid of the word/byte count. To get a
> good glimpse of the quality of the item I'd say we'd want to show
> count of statements (excluding identifier statements), identifiers and
> sitelinks.

OK, I'll try to make this.

>> 5. Display format for Wikidata and for other wikipedia sites is different:
>> Wikpedia:
>>
>> Title
>> Snippet
>>
>> Wikidata:
>>
>> Title: Description
>>
>> I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
>> the same line, separated by colon. Is there any reason for this
>> difference? Do we want to go back to the common format?
> 
> Not sure if we had a reason tbh.

OK then, I'll feel free to shuffle things around then :) Having more
freedom in the title line is good because we can then display both label
& aliases.

Thanks!
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Wikidata HDT dump

2017-10-31 Thread Stas Malyshev

Hi!

> OK. I wonder though, if it would be possible to setup a regular HDT
> dump alongside the already regular dumps. Looking at the dumps page,
> https://dumps.wikimedia.org/wikidatawiki/entities/, it looks like a
> new dump is generated once a week more or less. So if a HDT dump
> could

True, the dumps run weekly. "More or less" situation can arise only if
one of the dumps fail (either due to a bug or some sort of external
force majeure).
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Stas Malyshev

Hi!

> The first part of the Turtle data stream seems to contain syntax errors
> for some of the XSD decimal literals.  The first one appears on line 13,291:
> 
> ```text/turtle
> <http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35>
> <http://wikiba.se/ontology-beta#geoPrecision>
> "1.0E-6"^^<http://www.w3.org/2001/XMLSchema#decimal> .

I've added https://phabricator.wikimedia.org/T179228 to handle this.
geoPrecision is a float value and assigning decimal type to it is a
mistake. I'll review other properties to see if we don't have more of
this. Thanks for bringing it to my attention!


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Stas Malyshev

Hi!

> The first part of the Turtle data stream seems to contain syntax errors
> for some of the XSD decimal literals.  The first one appears on line 13,291:
> 
> ```text/turtle
> <http://www.wikidata.org/value/ec714e2ba0fd71ec7256d3f7f7606c35>
> <http://wikiba.se/ontology-beta#geoPrecision>
> "1.0E-6"^^<http://www.w3.org/2001/XMLSchema#decimal> .
> ```

Could you submit a phabricator task (phabricator.wikimedia.org) about
this? If it's against the standard it certainly should not be encoded
like that.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata HDT dump

2017-10-28 Thread Stas Malyshev

Hi!

> I will look into the size of the jnl file but should that not be
> located where the blazegraph is running from the sparql endpoint or
> is this a special flavour? Was also thinking of looking into a gitlab
> runner which occasionally could generate a HDT file from the ttl dump
> if our server can handle it but for this an md5 sum file would be
> preferable or should a timestamp be sufficient?

Publishing jnl file for Blazegraph may be not as useful as one would
think, because jnl file is specific for a specific vocabulary and
certain other settings - i.e., unless you run the same WDQS code (which
customizes some of these) of the same version, you won't be able to use
the same file. Of course, since WDQS code is open source, it may be good
enough, so in general publishing such file may be possible.

Currently, it's about 300G size uncompressed. No idea how much
compressed. Loading it takes a couple of days on reasonably powerful
machine, more on labs ones (I haven't tried to load full dump on labs
for a while, since labs VMs are too weak for that).

In general, I'd say it'd take about 100M per million of triples. Less if
triples are using repeated URIs, probably more if they contain ton of
text data.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata prefix search is now Elastic

2017-10-26 Thread Stas Malyshev

Hi!

> Thanks a lot Stas for this present.
> Could you please share any pointers on how to integrate it into other
> tools?

It's the same API as before, wbsearchentities. If you need additional
profiles - i.e., different scoring/filtering, talk to me and/or file
phab task and we can look into it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Wikidata prefix search is now Elastic

2017-10-25 Thread Stas Malyshev

Hi!

Wikidata’s birthday is still a few days away but since there are no
deployments on Sundays we’ll get started with an early present ;-)

Wikidata and Search Platform teams are happy to announce that Wikidata
prefix search (aka wbsearchentities API aka the thing you use when you
type into that box on the top right or any time you edit an item or
property and use the selector widget) is now using new and improved
ElasticSearch backend. You should not see any changes except for
relevancy and ranking improvements.

Specifically improved are:

- better language support (matches along fallback chain and also can
match in any language, with lower score)

- flexibility - we now can use Elasticsearch rescore profiles which can
be tuned to take advantage of any fields we index for both matching and
boosting, including links counts, statement counts, label counts, (some)
statement values, etc. etc. More improvement coming soon in this area,
e.g. scoring disambig pages lower, scoring units higher in proper
context, etc.

- optimization - we do not need to store all search data in both DB
tables and Elastic indexes anymore, all the data that is needed for
search and retrieval of the results is stored in Elastic index and
retrieved in a single query.

- maintainability - since it is now part of the general Wikimedia search
ecosystem, it can be maintained together with the rest of the search
mechanisms, using the same infrastructure, monitoring, etc.

Please tell us if you have any suggestions, comments or experience any
problems with it.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata-tech] Wikidata fulltext search results output

2017-10-25 Thread Stas Malyshev

Hi!

> while you are at it, some things would be very useful to be search-able
> (maybe some are already by now):
> * "primary" (not references/qualifiers) years, for birth/death/flourit etc.
> * "primary" string/monolingual values (title, taxon name, etc.)
> * "primary" IDs, e.g. VIAF (might cause confusion with years, so maybe
> only add numerical IDs if 5+ digits?)

We have the code to index statements already, and we're already indexing
P31 and P279. We could index more properties. We don't have syntax or
any other way though to actually use those in search - yet, except for
boosting (see https://gerrit.wikimedia.org/r/#/c/384632/).

We're looking at which properties to add (nominations welcome, probably
in the form of phab ticket?) - since adding them requires full reindex
of wikidata (couple of days) we probably don't want to add them one by
one but want to collect a set and then do it in one hit.

We also do not have syntax for searching (as in match, instead of boost)
by statement values, but it should not be hard - we just need to design
proper syntax and implement it (syntaxes are now pluggable, so should
not be too big of a problem).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

[Wikidata-tech] Wikidata fulltext search results output

2017-10-24 Thread Stas Malyshev

Hi!

As I am working on improving Wikidata fulltext search[1], I'd like to
talk about search results page. Right now search results page for
Wikidata is less than ideal, here are the issues I see with it:

- No match highlighting
- Meaningless data, like word count (anybody cares to guess what it is
counting? Anybody ever used it?) and byte count (more useful than word
count but not by much)
- Obviously, search quality is not super high, but that should be
improved with proper description indexing

While working on improving the situation, I would like to solicit
opinions on the set of questions about how the search results page
should look like. Namely:

1. If the match is made on label/description that does not match current
display language, we could opt for:
a) Displaying the description that matched, highlighted. Optionally
maybe display the language of the match (in display language?)
b) Displaying the description in display language, un-highlighted.
Which option is preferable?

2. What we do if the match is on alias? Do we display matching alias,
original label or both? The question above also applies if the match is
on other language alias.

3. It looks clear to me that words count is useless. Is byte count
useful and does it need to be kept?

4. Do we want to display any other parameters of the entity? E.g. we
have in the index: statement_count, sitelink_count, label_count,
incoming_links, etc. Do we want to display any?

5. Display format for Wikidata and for other wikipedia sites is different:
Wikpedia:

Title
Snippet

Wikidata:

Title: Description

I.e. Wikipedia puts title on a separate line, while Wikidata keeps it on
the same line, separated by colon. Is there any reason for this
difference? Do we want to go back to the common format?

Also if you have any other things/ideas/comments about how fulltext
search output for wikidata should be, please tell me.

I am sending this to wikidata-tech and discovery team list only for now,
since it's still work in progress and half-baked, we could open this for
wider discussion later if necessary.

[1] https://phabricator.wikimedia.org/T178851

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Turning Lists to Wikidata

2017-10-19 Thread Stas Malyshev

Hi!

> when you say "wikidata is not well  suited for lists data", you refer
> to wikibase or WDQS here?

Wikidata is not good for storing list data, or any serial data. WDQS can
produce all kinds of amazing lists via queries, but it's not a primary
data storage. In general, it could store series data, but since it's
based off Wikidata and feeds from it, that creates certain issue when
data is not very suitable for Wikidata.

> the  data:Bea.gov/GDP by state.tab above is certainly a good
> representation for efficient delivery (via json) and display of data.
> but inefficient for further data sharing without URIs.

The question of querying data like "GDB by state.tab" is an interesting
one. I'm not sure whether triple store would be a good medium, but maybe
it could be... Needs some research on the idea.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata-tech] query with incomplete result set

2017-10-03 Thread Stas Malyshev

Hi!

On 10/3/17 4:49 PM, Marco Neumann wrote:
> thank you Lucas and Stas, this works for me.
> 
> so it would be fair to say that p:P39 by-passes the semantics of
> wdt:P39 with ranking*. for my own understanding why is a wdt property
> called a direct property**?

Because wdt: links directly to value, while p: links to a statement
(where ps: links to the value). But that's not the only property of wdt:
- another property that it links to "truthy" (current, best, etc.) value
- one that has best rank in this property (hence the "t" letter). This
may be what you want or not, depending on general semantics and your
particular case. For many properties, ranks do not play significant
role, since these properties do not change with time and do not have
temporally limited statements. So for these, using wdt: is always ok.
For some, like positions, offices, relationships between humans, etc.
the values can have temporal limits and if you want best/current one,
you use wdt:, otherwise you use p:/ps:. If you still want to account for
rank using p:/ps:, there are rank triples and wikibase:BestRank class
(see
https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Statement_representation).


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata-tech] query with incomplete result set

2017-10-03 Thread Stas Malyshev

Hi!

On 10/3/17 4:02 PM, Marco Neumann wrote:
> why doesn't the following query produce
> http://www.wikidata.org/entity/Q17905 in the result set?

The query asks for wdt:P39 wd:Q1939555, however current preferred value
for P39 there is Q29576752. When the item has preferred value, only this
value shows up in wdt. If you want all values, use something like:

https://query.wikidata.org/#SELECT%20%3FMdB%20%3FMdBLabel%20WHERE%20%7B%0ASERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%3FMdB%20p%3AP39%20%3FxRefNode.%20%0A%3FxRefNode%20pq%3AP2937%20wd%3AQ30579723%3B%0A%20%20%20ps%3AP39%20wd%3AQ1939555.%0A%7D

Or change "preferred" status on Q17905:P39.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Re: [Wikidata] Do you use the Wikidata entity dump dcatap.rdf?

2017-09-27 Thread Stas Malyshev

Hi!

> is anyone using the Wikidata entity dump dcatap.rdf at
> https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf?
> 
> It is very rarely used and is thus causing us a (probably) undue
> maintenance burden, because of which we plan to remove it.

What's the issue with it? I don't use it but it seems to be part of
standard for dataset descriptions, so I wonder if the issues can be
fixed. I don't know too much about it but from the description is seems
to be very automatable.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Categories in RDF/WDQS

2017-09-18 Thread Stas Malyshev

Hi!

I'd like to announce that the category tree of certain wikis is now
available as RDF dump and in Wikidata Query Service.

More documentation is at:
https://www.mediawiki.org/wiki/Wikidata_query_service/Categories
which I will summarize shortly below.

The dumps are located at
https://dumps.wikimedia.org/other/categoriesrdf/. You can use these
dumps any way you wish, data format is described at the link above[1].

The same dump is loaded into "categories" namespace in WDQS, which can
be queried by
https://query.wikidata.org/bigdata/namespace/categories/sparql?query=SPARQL.
Sorry, no GUI support yet (probably will happen later). See example in
the docs[2].

These datasets are not updated automatically yet, so they'll be up to
date roughly for the date of the latest dump. Hopefully soon it will be
automated and then the datasets will be updated daily.

The list of currently supported wikis is here:
https://noc.wikimedia.org/conf/categories-rdf.dblist - these are
basically all 1M+ wikis and couple more that I added for various
reasons. If you have a good candidate wiki to add, please tell me or
write on the talk page for the document above.

Please note this is only the first step for the project, so there might
still be some rough edges. I am announcing it early since I think it
would be useful for people to look at the dumps and SPARQL endpoint and
see if something is missing or does not work properly, and share ideas
on how it can be used.

We plan eventually to use it for search improvement[3] - this work is
still in progress.

As always, we welcome any comments and suggestions.

[1]
https://www.mediawiki.org/wiki/Wikidata_query_service/Categories#Data_format
[2]
https://www.mediawiki.org/wiki/Wikidata_query_service/Categories#Accessing_the_data
[3] https://phabricator.wikimedia.org/T165982

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] CHANGE: mwapi: prefix in WDQS changes underlying URI

2017-09-12 Thread Stas Malyshev

Hi!

In order to fix a compatibility issue
(https://phabricator.wikimedia.org/T174930) and make the URI more clean,
the URI underlying mwapi: prefix in WDQS queries will be changed to
https://www.mediawiki.org/ontology#API/. This URI is not used anywhere
in the data, but only to designate parameters for MWAPI services[1].

Since this is not change in data but only in service prefixes/URIs,
there should not be any impact except for any queries that may use full
URI instead of mwapi: prefix for calling the services - such queries
will have to be updated. There's no real reason to do it and I recommend
to never do it and follow the examples in the manual[1] instead, but out
of the abundance of caution I am announcing this change anyway.

The change will likely be deployed in next Monday, in the usual WDQS
deployment window[2].

[] https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual/MWAPI
[2] https://wikitech.wikimedia.org/wiki/Deployments
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Coordinate precision in Wikidata, RDF & query service

2017-09-01 Thread Stas Malyshev

Hi!

> The reason why we save the actual value with more digits than the
> precision (and why we keep the precision as an explicit value at all) is
> because the value could be entered and displayed either as decimal
> digits or in minutes and seconds. So internally one would save 20' as
> 0.3, but the precision is still just 2. This allows to roundtrip.
> 
> I hope that makes any sense?

Yes, for primary data storage (though roundtripping via
limited-precision doubles is not ideal, but I guess good enough for
now). But for secondary data/query interface, I am not sure 0.3
is that useful. What would one do with it, especially in SPARQL?
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Coordinate precision in Wikidata, RDF & query service

2017-08-31 Thread Stas Malyshev

Hi!

> I am not sure I understand the issue and what the suggestion is to solve
> it. If we decide to arbitrarily reduce the possible range for the

Well, there are actually several issues right now.

1. Our RDF output produces coordinates with more digits that specified
precision of the actual value.
2. Our precision values as specified in wikibase:geoPrecision seem to
make little sense.
3. We may represent the same coordinates for objects located in the same
place as different ones because of precision values being kinda chaotic.
4. We may have different data from other databases because our
coordinate is over-precise.

(1) is probably easiest to fix. (2) is a bit harder, and I am still not
sure how wikibase:geoPrecision is used, if at all.
(3) and (4) are less important, but it would be nice to improve, and
maybe they will be mostly fixed once (1) and (2) are fixed. But before
approaching the fix, I wanted to understand what are expectations from
precision and if there can or should be some limits there. Technically,
it doesn't matter too much - except that some formulae for distances do
not work well for high precisions because of limited accuracy of 64-bit
double, but there are ways around it. So technically we can keep 9
digits or however many we need, if we wanted to. I just wanted to see if
we should.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

1 2 3 4 >

1 - 100 of 326 matches

Mail list logo