mkroetzsch added a comment.
@Lydia_Pintscher Are you asking about the discrepancy in the counts, or about
the general idea of this issue report?
I must admit thatI do not get the significance of the SPARQL queries above.
The missed properties seem to exist and work as expected
mkroetzsch added a comment.
In T244341#5862287 <https://phabricator.wikimedia.org/T244341#5862287>,
@Jheald wrote:
> Please don't think or refer to the blank nodes as "unknown values".
I fully agree. The use of the word "unknown" in the UI was a mista
mkroetzsch added a comment.
Hi,
Using the same value for "unknown" is a very bad idea and should not be
considered. You already found out why. This highlights another general design
principle: the RDF data should encode meaning in structure in a direct way. If
two triples hav
mkroetzsch added a comment.
CC0 seems to be fine. Using the same license as for the rest seems to be the
easiest choice for everybody.
TASK DETAIL
https://phabricator.wikimedia.org/T216842
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
mkroetzsch added a comment.
Well, for classes and properties, one would use owl:equivalentClass and owl:equivalentProperty rather than sameAs to encode this point. But I agree that this will hardly be considered by any consumer.TASK DETAILhttps://phabricator.wikimedia.org/T112127EMAIL
mkroetzsch added a comment.
This is good news -- thanks for the careful review! The lack of specific threat models for this data was also a challenge for us, for similar reasons, but it is also a good sign that many years after the first SPARQL data releases, there is still no realistic danger
mkroetzsch added a comment.
Hi,
The code is here: https://github.com/Wikidata/QueryAnalysis
It was not written for general re-use, so it might be a bit messy in places. The code includes the public Wikidata example queries as test data that can be used without accessing any confidential
mkroetzsch added a comment.
I agree with Stas: regular data releases are desirable, but need further thought. The task is easier for our current case since we already know what is in the data. For a regular process, one has to be very careful to monitor potential future issues. By releasing
mkroetzsch added a comment.
@AndrewSu As I just replied to Benjamin Good in this matter, it is a bit too early for this, since we only have the basic technical access as of very recently. We have not had a chance to extract any community shareable data sets yet, and it is clear
mkroetzsch added a comment.
Re parsing strings: You are skipping the first step here. The question is not
which format is better for advanced interpretation, but which format is
specified at all. Whatever your proposal is, I have not seen any //syntactic//
description of if yet
mkroetzsch added a comment.
+1 sounds like a workable design
TASK DETAIL
https://phabricator.wikimedia.org/T127929
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: mkroetzsch
Cc: mkroetzsch, Aklapper, daniel, Steinsplitter, Lydia_Pintscher, Izno
mkroetzsch added a comment.
Re chemical markup for semantics: this is true for Wikitext, where you cannot
otherwise know that "C" is carbon. It does not apply to Wikidata, where you
already get the same information from the property used. Think of
https://phabricator.wikimedi
mkroetzsch added a comment.
I really wonder if the introduction of all kinds of specific markup languages
in Wikidata is the right way to go. We could just have a Wikitext datatype,
since it seems that Wikitext became the gold standard for all these special
data types recently. Mark-up over
mkroetzsch added a comment.
> The MathML expression includes the TeX representation, which can be used in
> LaTeX documents and also to create new statements.
That would address the conversion back from MathML to TeX. With this in place,
we could indeed use MathML in JSON and RDF, if we
mkroetzsch added a comment.
The format should be the same as in JSON. If MathML is preferred there, then
this is fine with me. If LaTeX is preferred, we can also use this. It seems
that MathML would be a more reasonable data exchange format, but Moritz was
suggesting in his emails that he does
mkroetzsch added a comment.
In https://phabricator.wikimedia.org/T99820#1820662, @daniel wrote:
> Looking at the link, it seems to me we'd (trivially) meet these requirements.
Yes, that's what I meant. :-)
> But I'm not sure about the fine details, e.g. regarding the versi
mkroetzsch added a comment.
> ...and if we consider our data dump to be an ontology, then what isn't an
> ontology?
The word "ontology" has different meanings in different contexts. Here, we only
mean the notion of "ontology" meant by the term owl:ontology as use
mkroetzsch added a comment.
I don't want to detail every bit here, but it should be clear that one can
easily eliminate the dependency to $db in the formatter code. The Sites object
I mentioned is an example. It is *not* static in our implementation. You can
make it an interface. You can
mkroetzsch added a comment.
@daniel As long as it works for you, this is all fine by me, but in my
experience with PHP this could cost a lot of memory, which could be a problem
for the long item pages that already caused problems in the past.
> But it requires the serialization and formatt
mkroetzsch added a subscriber: mkroetzsch.
mkroetzsch added a comment.
Structurally, this would work, but it seems like a very general solution with a
lot of overhead. Not sure that this pattern works well on PHP, where the cost
of creating additional objects is huge. I also wonder whether
mkroetzsch added a comment.
This was a suggestion we came up with when discussing during WikiCon. People
are asking for a way to edit the data they pull into infobox templates.
Clearly, doing this in place will be a long-term effort that needs a
complicated solution and many more design
mkroetzsch added a comment.
Note that this discussion is no longer just about the wdt property values
(called "truthy" above). Simple values are now used on several levels in the
RDF encoding.
In general, the same argument as for coordinates applies: if we cannot do it
right, t
mkroetzsch added a comment.
If we could distinguish type quantity properties that require a unit from those
that do not allow units, there would be another options. Then we could use a
compound value as the "simple" value for all properties with unit to simulate
the missin
mkroetzsch added a comment.
I think the discussion now lists all main ideas on how to handle this in RDF,
but most of them are not feasible because of the very general way in which
Wikibase implements unit support now. Given that there is no special RDF
datatype for units and given that we
mkroetzsch added a comment.
Including more data (within reason) will not be a problem (other than a
performance/bandwidth problem for your servers).
However, if there are further ideas and small improvements that will take time
to implement, it would be good to switch to "dump" as t
mkroetzsch added a comment.
Data on the referenced entities does not have to be included as long as one can
get this data by resolving these entities' URIs. However, some basic data
(ontology header, license information) should be in each single entity export.
TASK DETAIL
https
mkroetzsch added a subscriber: mkroetzsch.
mkroetzsch added a comment.
One the mailing list, Stas brought up the question "which RDF" should be
delivered by the linked data URIs by default. Our dumps contain data in
multiple encodings (simple and complex), and the PHP code can crea
mkroetzsch added a subscriber: mkroetzsch.
mkroetzsch added a comment.
As another useful feature, this will also allow us to have our SPARQL endpoint
monitored at http://sparqles.ai.wu.ac.at/ Basic registration should not be too
much work; please look into it (I don't want to create an account
mkroetzsch added a comment.
It seems that the Web API for wbeditentities is also returning empty lists when
creating new items (at least on test.wikidata.org). Is this the same bug or a
different component?
TASK DETAIL
https://phabricator.wikimedia.org/T73349
EMAIL PREFERENCES
https
mkroetzsch added a comment.
If not dropped, then it should be fixed. The value of 1 (a string literal) is
not correct. Units should be represented by URIs, not by literals.
TASK DETAIL
https://phabricator.wikimedia.org/T105432
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings
mkroetzsch added a comment.
While I did say that pretty much all URIs I know use http, I do not have any
reason to believe that https would cause problems. It is not so extensively
tested maybe, but in most contexts it should work fine.
A bigger issue is that some people are already using our
mkroetzsch added a comment.
In https://phabricator.wikimedia.org/T95316#1373937, @Lydia_Pintscher wrote:
Are there any differences we're missing? Are we ok with these differences?
I will do a complete review of the update RDF mapping in the course of the next
week. I will report back
mkroetzsch added a comment.
we once planned a popup box with links to the various formats. It would be
shown when you click on the Q-id in the title.
A pop-up box is a good solution if there are several options, but the Qid is
not a good place to trigger it, since it gives no hint
mkroetzsch added a comment.
I think this is a useful change if you want Wikibase sites to be able to refer
to other Wikibase sites. In WDTK, all of our EntityId objects are external,
of course. A lesson learned for us was that it is not enough to know the base
URI in all cases. You sometimes
mkroetzsch added a comment.
A big advantage of the numbers is that you can search for values where the
precision is at least a certain value (e.g., dates with precision day or
above). This would be lost when using URIs.
TASK DETAIL
https://phabricator.wikimedia.org/T99907
EMAIL PREFERENCES
mkroetzsch added a comment.
@Jc3s5h You are right that date conversion only makes sense in a certain range.
I think the software should disallow day-precision dates in prehistoric eras
(certainly everything before -1). There are no records that could possibly
justify this precision
mkroetzsch added a comment.
Sounds good.
I am not aware of any best practice re http vs. https but all URIs I know are
using http as a protocol.
TASK DETAIL
https://phabricator.wikimedia.org/T97195
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
mkroetzsch added a comment.
I agree with the proposal of @Smalyshev.
TASK DETAIL
https://phabricator.wikimedia.org/T94747
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign
username.
EMAIL PREFERENCES
https://phabricator.wikimedia.org
mkroetzsch added a comment.
@daniel Changing the base URIs is not working as a way to communicate breaking
changes to users of RDF. You can change them, but there is no way to make users
notice this change, and it will just break a few more queries. It's just not
how RDF works. Most of our
mkroetzsch added a comment.
@Smalyshev You comment on my Item 1 by referring to BlazeGraph and Virtuoso.
However, my Item 1 is about reading Wikidata, not about exporting to RDF. Your
concerns about BlazeGraph compatibility are addressed by my item 2. I hope this
clarifies this part
mkroetzsch added a comment.
@Smalyshev
Re halting the work on the query engine/produce code now: The WDTK RDF
exports are generated based on the original specification. There is no
technical issue with this and it does not block development to do just this.
The reason we are in a blocker
mkroetzsch added a comment.
@mkroetzsch I already listed a few of the tools that implement XSD 1.0 style
BCE years and I read your answer as to say that you know of no tools that
implement XSD 1.1 style BCE years.
Then you misread my answer. Almost all tools that exist today use the 2000
mkroetzsch added a comment.
@Smalyshev We really want the same thing: move on with minimal disturbance as
quickly as possible. As you rightly say, the data we generate right now is not
meant for production use but for testing. We must make sure that our production
environment will understand
mkroetzsch added a comment.
@Smalyshev P.S. Your finding of years in our Virtuoso instance is quite
peculiar given that this endpoint is based on RDF 1.0 dumps as they are
currently generated in WDTK using this code:
https://github.com/Wikidata/Wikidata-Toolkit/blob
mkroetzsch added a comment.
@Smalyshev @Lydia_Pintscher Dates without years should not be allowed by the
time datatype. They are impossible to order, almost impossible to query, and
they do not have any meaning whatsoever in combination with a preferred
calendar model. All the arguments @Denny
mkroetzsch added a comment.
@mkroetzsch Do you know of some widely used software that implements XSD 1.1
handling of BCE dates?
Many applications that process dates are based on ISO rather than on XSD.
Java's SimpleDateFormat class, for example, is based on ISO and thus interprets
year
mkroetzsch added a comment.
Note that all current data representation formats assume that
-01-01T00:00:00 is a valid representation:
- XML Schema 1.1: http://www.w3.org/TR/xmlschema11-2/#dateTime
- RDF 1.1: http://www.w3.org/TR/rdf11-concepts/#section-Datatypes
- OWL 2: http://www.w3.org/TR
mkroetzsch added a comment.
Don't see why it would be this many. It'd be like 4 additional rows per
property:
I was referring to the labels. For some use cases, it could be convenient of
each of the property variants would also have the rdfs:label of the property
item. For example, RDF
mkroetzsch added a comment.
we don't know what year it was but it was July 4th
Ouch. Where has this been designed? Can you point to the specification of this?
@Denny, is this intended? Dates without a year are extremely hard to handle in
queries and don't work at all like the normal dates
mkroetzsch added a comment.
All RDF tools should be able to handle resources without labels (no matter if
used as subject, predicate, or objcet). But data browsers or other UIs will
simply show the URL (or an automatically created abbreviated version of it) to
the user. So instead of instance
mkroetzsch added a subscriber: mkroetzsch.
TASK DETAIL
https://phabricator.wikimedia.org/T94019
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign
username.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences
mkroetzsch added a subscriber: Lydia_Pintscher.
mkroetzsch added a comment.
Yes, the discussion on SPARQL has converged surprisingly quickly to the view
that XSD 1.1 is both normative and intended in SPARQL 1.1 (by the way, I can
only recommend this list if you have SPARQL questions
mkroetzsch added a comment.
@Smalyshev Yes, this is what I was saying. @hoo was proposing to create a
special directory for truthy based on offline discussion in the office.
TASK DETAIL
https://phabricator.wikimedia.org/T72385
REPLY HANDLER ACTIONS
Reply to comment or attach files
mkroetzsch added a comment.
@Smalyshev Yes, using lower-case local names for properties is a widely used
convention and we should definitely follow that for our ontology. However, I
would rather not change case of our P1234 property ids when they occur in
property URIs, since Wikibase ids
mkroetzsch added a comment.
@daniel Changing URIs of the ontology vocabulary is silently producing wrong
results as well. I understand the problems you are trying to solve. I am just
saying that changing the URIs does not actually solve them.
@adrianheine You are right. My example was less
mkroetzsch added a comment.
@hoo Thanks for the heads up! I do have comments.
(1) I would remove the full and truthy distinction from the path and rather
make this part of the dump type (for example statements and
truthy-statements). The reason is that we have many full dumps (terms
mkroetzsch added a comment.
@Lydia_Pintscher I understand this problem, but if you put different dumps for
different times all in one directory, won't this become quite big over time and
hard to use? Maybe one should group dumps by how often they are created (and
have date-directories only
mkroetzsch added a comment.
All of these dumps will be generated by exporting from the DB.
Why would one want to do this? The JSON dump contains all information we need
for building the other dumps, and it seems that the generation from the JSON
dump is much faster, avoids any load on the DB
mkroetzsch added a comment.
@Smalyshev
Re what does consistent mean: to be based on the same input data. All dumps
are based on Wikidata content. If they are based on the same content, they are
consistent, otherwise they are not.
Re discussing RDF dump partitioning in
https
mkroetzsch added a comment.
@JanZerebecki:
Re using the same code: That's not essential here. All we want is that the
dumps are the same. It's also not necessary to develop the code twice, since it
is already there twice anyway. It's just the question if we want to use a slow
method
mkroetzsch added a comment.
is there any existing ontology we may want to use to create such links
between entity:P1234 and v:P1234 or q:P1234? Or should we just invent our own?
We would have to make new URIs here. This depends on which/how many variants of
RDF property URIs we use: we
mkroetzsch added a comment.
Also, it was suggested that we may want to change the fact that we use
entity:P1234 in link Entity-Statement and give it a distinct URL. However,
then it is not clear what would be the link between entity:P1234 and the rest
of the data.
This is a good point
mkroetzsch added a comment.
@daniel It makes sense to use wikibase rather than wikidata, but I don't think
it matters very much at all. We should just define it rather sooner than later.
As for the versioning, I don't see how to convince you. Four more attempts:
- Try to apply your proposal
mkroetzsch added a comment.
@daniel: Have you wondered why XML Schema decided against changing their URIs?
It is by far the most disruptive thing that you could possibly do. Ontologies
don't work like software libraries where you download a new version and build
your tool against it, changing
mkroetzsch added a comment.
Hi Daniel.
Good point, I agree that this should change. A URL based on wikiba.se seems to
be the best. I don't think we need to worry about domain ownership here (why
would anybody sell this domain? Is it not WMF-owned?)
I think it is not a good idea to change
mkroetzsch created this task.
mkroetzsch added subscribers: mkroetzsch, Lydia_Pintscher.
mkroetzsch added a project: Wikibase-DataModel-Serialization.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
The XML dumps of Wikidata contain many JSON serialization errors where
mkroetzsch added a comment.
In https://phabricator.wikimedia.org/T89949#1052731, @daniel wrote:
Nik tells me that the HA features in Virtuoso are only available in the
closed source enterprise version. That basically means WMF is not going to
use it in production.
Yes, I guessed
mkroetzsch added a comment.
The RDF should certainly contain information about the entity type of exported
data. This is essential to ensure that the RDF data contains all the
information that is found in the JSON (other than the ordering). As I read it,
things that are of rdf:type Item
mkroetzsch added a comment.
Our primary goal is to encode the JSON information in RDF, and possibly to
enrich this information where it makes sense in an RDF-context (e.g., by adding
links to other datasets). The JSON data includes the entity type, so it is
clear that we want to encode
mkroetzsch added a comment.
Thanks for adding Denny. Long reply, but details matter here.
I agree that there are different things one could talk about (document, real
thing). However, for now I am mainly interested in talking about the latter,
since this should be our primary concern
mkroetzsch added a comment.
Now my reply was so long that the ticket has already been closed in the
meantime :-D Anyway, those are my two (or more) cents on this topic ;-) I don't
think the paper goes into these topics very much (as they are not so much
technical as philosophical).
TASK
mkroetzsch added a comment.
I think json should be in the path somewhere. It does not have to be at the
top-level, but it would not be good if dump files of one type end up in their
own directory. The only way for tools to detect and download dumps
automatically is to look at the HTML
mkroetzsch added a comment.
I don't know about the details of the import task discussed here, but for the
record: we are happy to support this use of WDTK by helping to update our
implementation where necessary.
TASK DETAIL
https://phabricator.wikimedia.org/T86524
REPLY HANDLER ACTIONS
mkroetzsch added a comment.
In https://phabricator.wikimedia.org/T86278#969184, @Multichill wrote:
I would like to turn it around. We should support indexing everything:
...
The fact that we're not creative enough to make up queries for everything
doesn't mean it isn't useful.
I have
mkroetzsch added a comment.
@Smalyshev My suggestion was just about the surface appearance, not about the
inner workings. I am saying that the following two phrases have the same
structure:
- Find things with a *sitelink* that *has badge* *featured*.
- Find things with a *population* that has
mkroetzsch added a comment.
This is not correct, original structure can be recovered
Then I misunderstood the transformation that was proposed. My impression was
that a statement with three qualifier snaks: P1 V1, P1 V2,
https://phabricator.wikimedia.org/P2 V3 would be stored as two
mkroetzsch added a comment.
@JanZerebecki I understand what you are saying about what indexing means
here. Makes sense to me. What you are saying about my example query sounds as
if you are planning to implement query execution manually. I hope this is not
the case and you can just give
mkroetzsch added a comment.
@Smalyshev My point is merely that sitelinks and labels //can// be handled like
statements. Since statements must be supported anyway, it would be sensible to
reuse the data structures and query expressions defined for them. I don't think
that confusion is likely
78 matches
Mail list logo