Re: [Wikidata-tech] Missing documentation of Wikibase Lexeme data model

2018-12-11 Thread Daniel Kinzler
Am 11.12.18 um 10:38 schrieb Antonin Delpeuch (lists):
> One way to generate a JSON schema would be to use Wikidata-Toolkit's
> implementation, which can generate a JSON schema via Jackson. It could
> be used to validate the entire data model.

Why a schema is nice, it's more important to have documentation that defines the
contract - that is, the intended semantics and guarantees.

-- 
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Missing documentation of Wikibase Lexeme data model

2018-12-11 Thread Daniel Kinzler
Am 11.12.18 um 08:38 schrieb Jakob Voß:> Hi,
>
> I just noted that the official description of the Wikibase data model at
>
> https://www.mediawiki.org/wiki/Wikibase/DataModel
>
> and the description of JSON serialization lack a description of Lexemes, 
> Forms,
> and Senses.
The abstract model for Lexemes is here:
https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/Data_Model

The RDF binding his here:
https://www.mediawiki.org/wiki/Extension:WikibaseLexeme/RDF_mapping

Looks like documentation for the JSON bindinng is indeed missing.

-- 
Daniel Kinzler
Principal Software Engineer, Core Platform
Wikimedia Foundation

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] lexeme fulltext search display

2018-06-18 Thread Daniel Kinzler
Am 18.06.2018 um 19:25 schrieb Stas Malyshev:
> 1. What the link will be pointing to? I haven't found the code to
> generate the link to specific Form.

You can use an EntityTitleLookup to get the Title object for an EntityId. In
case of a Form, it will point to the appropriate section. You can use the
LinkRenderer service to make a link. Or you use an EntityIdHtmlLinkFormatter,
which should do the right thing. You can get one from a
OutputFormatValueFormatterFactory.

-- daniel

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] lexeme fulltext search display

2018-06-18 Thread Daniel Kinzler
Hi Stas!

Your proposal is pretty much what I envision.

Am 14.06.2018 um 19:39 schrieb Stas Malyshev:
> I plan to display Lemma match like this:
> 
> title (LN)
> Synthetic description
> 
> e.g.
> 
> color/colour (L123)
> English noun
> 
> Meaning, the first line with link would be standard lexeme link
> generated by Lexeme code (which also deals with multiple lemmas) and the
> description line is generated description of the Lexeme - just like in
> completion search.

Sounds perfect to me.

> The problem here, however, is since the link is
> generated by the Lexeme code, which has no idea about search, we can not
> properly highlight it. This can be solved with some trickery, probably,
> e.g. to locate search matches inside generated string and highlight
> them, but first I'd like to ensure this is the way it should be looking.

Do we really need the highlight? It does not seem critical to me for this use
case. Just "nice to have".

> More tricky is displaying the Form (representation) match. I could
> display here the same as above, but I feel this might be confusing.
> Another option is to display Form data, e.g. for "colors":
> 
> color/colour (L123)
> colors: plural for color (L123): English noun

I'd rather have this:

 colors/colours (L123-F2)
 plural of color (L123): English noun

Note that in place of "plural", you may have something like "3rd person,
singular, past, conjunctive", derived from multiple Q-ids.

> The description line features matched Form's representation and
> synthetic description for this form. Right now the matched part is not
> highlighted - because it will otherwise always be highlighted, as it is
> taken from the match itself, so I am not sure whether it should be or not.

Again, I don't think any highlighting is needed.

But, as you know, it's all up to Lydia to decide :)

-- daniel

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Fastest way (API or whatever) to verify a QID

2018-05-15 Thread Daniel Kinzler
You can do this via the API, e.g.:

https://www.wikidata.org/w/api.php?action=query==json=Q1|Qx|Q1003|Q66=1

Note that this uses QIDs directy as page titles. This works on wikidata, but may
not work on all wikibase instances. It also does not work for PIDs: for these,
you have to prefix the Property namespace, as in Property:P31.

A more wikibase way would be to use the wbgetentities API, as in
https://www.wikidata.org/w/api.php?action=wbgetentities=Q42|Q64=

However, this API fails when you proivde a non-existing ID, without providing
any information about other IDs. So you can quickly check if all the IDs you
have are ok, but you may need several calls to get a list of all the bad IDs.

That's rather annoying for your use case. Feel free to file a ticket on
phabricator.wikimedia.org. Use the wikidata tag. Tahnks!

-- daniel

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Search on Wikibase/Wikidata sans CirrusSearch?

2017-12-30 Thread Daniel Kinzler
Yes, it's supposed to work, see FingerprintSearchTextGenerator and
EntityContent::getTextForSearchIndex

Am 30.12.2017 um 06:47 schrieb Stas Malyshev:
> Hi!
> 
> I wonder if anybody have run/is running Wikibase without CirrusSearch
> installed and whether the fulltext search is supposed to work in that
> configuration? The suggester/prefix search, aka wbsearchentities, works
> ok, but I can't make fulltext aka Special:Search find anything on my VM
> (which very well may be a consequence of me messing up, or some bug, or
> both :)
> So, I wonder - is it *supposed* to be working? Is anybody using it this
> way and does anybody care for such a use case?
> 
> Thanks,
> 


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Does a rollback also roll back revision history?

2017-07-31 Thread Daniel Kinzler
Am 31.07.2017 um 17:01 schrieb Eric Scott:
> * Is is indeed the case that rollbacks also roll back the revision history?

No. All edits are visible in the page history, including rollback, revert,
restore, undo, etc. The only kind of edit that is not recorded is a "null edit"
- an edit that changes nothing compared to the previous version (so it's not
actually an edit). This is sometimes used to rebuild cached derived data.

> * Is there some other place we could look that records such rollbacks?

No. The page history is authoritative. It reflects all changes to the page
content. If you could find a way to trigger this kind of behavior, that would be
a HUGE bug. Let us know.

Note that for wikitext content, this doesn't mean that it contains all changes
to the visible rendering: when a transcluded template is changed, this changes
the rendering, but is not visible in the page's history (but it is instead
visible in the template's history). However, no transclusion mechanism exists
for Wikidata entities.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Wikibase and PostgreSQL

2017-04-10 Thread Daniel Kinzler
Hi Denis!

Sorry for the late response.

The information is in the installation requirements, see
<https://www.mediawiki.org/wiki/Extension:Wikibase_Repository#Requirements>.

Where did you expect to find it? Perhaps we can add it in some more places to
avoid confusion and frustration. In the README file, maybe?

-- daniel

Am 06.03.2017 um 09:05 schrieb Denis Rykov:
> Hello!
> 
> It looks like Wikibase extension is not compatible with PostgreSQL backend.
> There are many MySQL specific code in sql scripts (e.g. auto_increment, 
> varbinary).
> How about to add this information to Wikibase docs?
> //


-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] [Wikidata] Significant change: new data type for geoshapes

2017-03-29 Thread Daniel Kinzler
Am 29.03.2017 um 15:19 schrieb Luca Martinelli:
>> One thing to note: We currently do not export statements that use this
>> datatype to RDF. They can therefore not be queried in the Wikidata Query
>> Service. The reason is that we are still waiting for geoshapes to get stable
>> URIs. This is handled in this ticket.

This ticket: <https://phabricator.wikimedia.org/T159517>. And more generally
<https://phabricator.wikimedia.org/T161527>.

The technically inclined of you may be interested in joining the relevant RFC
discussion on IRC tonight at 21:00 UTC (2pm PDT, 23:00 CEST) #wikimedia-office.

-- 
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Daniel Kinzler
Am 25.11.2016 um 12:16 schrieb David Cuenca Tudela:
>> If we want to avoid this complexity, we could just go by prefix. So if the
>> languages is "de", variants like "de-CH" or "de-DE_old" would be considered 
>> ok.
>> Ordering these alphabetically would put the "main" code (with no suffix) 
>> first.
>> May be ok for a start.
> 
> I find this issue potentially controversial, and I think that the community at
> large should be involved in this matter to avoid future dissatisfaction and to
> promote involvement in the decision-making.

We should absolutely discuss this with Wiktionarians. My suggestion was intended
as a baseline implementation. Details about the restrictions on which variants
are allowed on a Lexeme, or in what order they are shown, can be changed later
without breaking anything.

> In my opinion it would be more appropriate to use standardized language codes,
> and then specify the dialect with an item, as it provides greater flexibility.
> However, as mentioned before I would prefer if this topic in particular would 
> be
> discussed with wiktionarians.

Using Items to represent dialects is going to be tricky. We need ISO language
codes for use in HTML and RDF. We can somehow map between Items and ISO codes,
but that's going to be messy, especially when that mapping changes.

So it seems like we need to further discuss how to represent a Lexeme's language
and each lemma's variant. My current thinking is to represent the language as an
Item reference, and the variant as an ISO code. But you are suggesting the
opposite.

I can see why one would want items for dialects, but I currently have no good
idea for making this work with the existing technology. Further investigation is
needed.

I have filed a Phabricator task for investiagting this. I suggest to take the
discussion about how to represent languages/variants/dialects/etc there:

https://phabricator.wikimedia.org/T151626

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-25 Thread Daniel Kinzler
Thank you Denny for having an open mind! And sorry for being a nuisance ;)

I think it's very important to have controversial but constructive discussions
about these things. Data models are very hard to change even slightly once
people have started to create and use the data. We need to try hard to get it as
right as possible off the bat.

Some remarks inline below.

Am 25.11.2016 um 03:32 schrieb Denny Vrandečić:
> There is one thing that worries me about the multi-lemma approach, and that 
> are
> mentions of a discussion about ordering. If possible, I would suggest not to
> have ordering in every single Lexeme or even Form, but rather to use the
> following solution:
> 
> If I understand it correctly, we won't let every Lexeme have every arbitrary
> language anyway, right? Instead we will, for each language that has variants
> have somewhere in the configurations an explicit list of these variants, i.e.
> say, for English it will be US, British, etc., for Portuguese Brazilian and
> Portuguese, etc.

That approach is similar to what we are now doing for sorting Statement groups
on Items. There is a global ordering of properties defined on a wiki page. So
the community can still fight over it, but only in one place :) We can re-order
based on user preference using a Gadget.

For the multi-variant lemmas, we need to declare the Lexeme's language
separately, in addition to the language code associated with each lemma variant.
It seems like the language will probably represented as reference to a Wikidata
Item (that is, a Q-Id). That Item can be associated with an (ordered) list of
matching language codes, via Statements on the Item, or via configuration (or,
like we do for unit conversion, configuration generated from Statements on 
Items).

If we want to avoid this complexity, we could just go by prefix. So if the
languages is "de", variants like "de-CH" or "de-DE_old" would be considered ok.
Ordering these alphabetically would put the "main" code (with no suffix) first.
May be ok for a start.

I'm not sure yet on what level we want to enforce the restriction on language
codes. We can do it just before saving new data (the "validation" step), or we
could treat it as a community enforced soft constraint. I'm tending towards the
former, though.

> Given that, we can in that very same place also define their ordering and 
> their
> fallbacks.

Well, all lemmas would fall back on each other, the question is just which ones
should be preferred. Simple heuristic: prefer the shortest language code. Or go
by what MediaWiki does fro the UI (which is what we do for Item labels).

> The upside is that it seems that this very same solution could also be used 
> for
> languages with different scripts, like Serbian, Kazakh, and Uzbek (although it
> would not cover the problems with Chinese, but that wasn't solved previously
> either - so the situation is strictly better). (It doesn't really solve all
> problems - there is a reason why ISO treats language variants and scripts
> independently - but it improves on the vast majority of the problematic 
> cases).

Yes, it's not the only decision we have to make in this regard, but the most
fundamental one, I think.

One consequence of this is that Forms should probably also allow multiple
representations/spellings. This is for consistency with the lemma, for code
re-use, and for compatibility with Lemon.

> So, given that we drop any local ordering in the UI and API, I think that
> staying close to Lemon and choosing a TermList seems currently like the most
> promising approach to me, and I changed my mind. 

Knowing that you won't do that without a good reason, I thank you for the
compliment :)

> My previous reservations still
> hold, and it will lead to some more complexity in the implementation not only 
> of
> Wikidata but also of tools built on top of it,

The complexity of handling a multi-variant lemma is higher than a single string,
but any wikibase client already needs to have the relevant code anyway, to
handle item labels. So I expect little overhead. We'll want the lemma to be
represented in a more compact way in the UI than we currently use for labels,
though.


Thank you all for your help!


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-22 Thread Daniel Kinzler
Am 12.11.2016 um 00:08 schrieb Denny Vrandečić:
> I am not a friend of multi-variant lemmas. I would prefer to either have
> separate Lexemes or alternative Forms. 

We have created a decision matrix to help with discussing the pros and cons of
the different approaches. PLease have a look and comment:

https://docs.google.com/spreadsheets/d/1PtGkt6E8EadCoNvZLClwUNhCxC-cjTy5TY8seFVGZMY/edit?ts=5834219d#gid=0

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Two questions about Lexeme Modeling

2016-11-21 Thread Daniel Kinzler
y is creaking and not working well, and then think about
> these issues.

Slow iteration is nice as long as you don't produce artifact you need to stay
compatible with. I have become extremely wary of lock-in - Wikitext is the worst
lock-in I have ever seen. Some aspects of how we implemented the Wikibase model
for Wikidata also have proven to be really hard to iterate on. Iterating the
model itself is even harder, since it is bound to break all clients in a
fundamental way. We just got very annoyed comments just for making two fields in
the Wikibase model optional.

Switching from single-lemma to multi-lemma would be a major breaking change,
with lots of energy burned on backwards compatibility. The opposite switch would
be much simpler (because it adds guarantees, instead of removing them).

> But until then I would prefer to keep the system as dumb and
> simple as possible.

I would prefer to keep the user generated *data* as straight forward as
possible. That's more important to me than a simple meta-model. The complexity
of the instance data determines the maintenance burden.


Am 20.11.2016 um 21:06 schrieb Philipp Cimiano:
> Please look at the final spec of the lemon model:
>
>
https://www.w3.org/community/ontolex/wiki/Final_Model_Specification#Syntactic_Frames
>
> In particular, check example: synsem/example7

Ah, thank you! I think we could model this in a similar way, by referencing an
Item that represents a (type of) frame from the Sense. Whether this should be a
special field or just a Statement I'm still undecided on.

Is it correct that in the Lemon model, it's not *required* to define a syntactic
frame for a sense? Is there something like a default frame?

> 2) Such spelling variants are modelled in lemon as two different
> representations
> of the same lexical entry.
[...]
> In our understanding these are not two different forms as you mention, but two
> different spellings of the same form.

Indeed, sorry for being imprecise. And yes, if we have a multi-variant lemma, we
should also have multi-variant Forms. Our lemma corresponds to the canonical
form in Lemon, if I understand correctly.

> The preference for showing e.g. the American or English variant should be
> stated by the application that uses the lexicon.

I agree. I think Denny is concerned with putting that burden on the application.
Proper language fallback isn't trivial, and the application may be a light
weight JS library... But I think for the naive case, it's fine to simply show
all representations.


Thank you all for your input!

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Linking RDF resources for external IDs

2016-11-14 Thread Daniel Kinzler
By the way, I'm also re-considering my original approach:

Simply replace the plain value with the resolved URI when we can. This would
*not* cause the same property to be used with literals and non-literals, since
the predicate name is derived from the proeprty ID, and a property either
provides a URI mapping, or it doesn't.

Problems would arise during transition, making this a breaking change:

1) when introducing this feature, existing queries that compare a newly
URI-ified property to a string literal will fail.

2) when a URI mapping is added, we'd either need to immediately update all
statements that use that property, or the triple store would have some old
triples where the relevant predicates point to a literal, and some new triples
where it pints to a resource.

This would avoid duplicating more predicates, and keeps the model straight
forward. But it would cause a bumpy transition.

Please let me know which approach you prefer. Have a look at the files attached
to my original message.

Thanks,
Daniel

Am 09.11.2016 um 17:46 schrieb Daniel Kinzler:
> Hi Stas, Markus, Denny!
> 
> For a long time now, we have been wanting to generate proper resource 
> references
> (URIs) for external identifier values, see
> <https://phabricator.wikimedia.org/T121274>.
> 
> Implementing this is complicated by the fact that "expanded" identifiers may
> occur in four different places in the data model (direct, statement, 
> qualifier,
> reference), and that we can't simply replace the old string value, we need to
> provide an additional value.
> 
> I have attached three files with snippets of three different RDF mappings:
> - Q111.ttl - the status quo, with normalized predicates declared but not used.
> - Q111.rc.ttl - modeling resource predicates separately from normalized 
> values.
> - Q111.norm.ttl - modeling resource predicates as normalized values.
> 
> The "rc" variant means more overhead, the "norm" variant may have semantic
> difficulties. Please look at the two options for the new mapping and let me 
> know
> which you like best. You can use a plain old diff between the files for a 
> first
> impression.
> 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Two questions about Lexeme Modeling

2016-11-11 Thread Daniel Kinzler
Hi all!

There is two questions about modelling lexemes that are bothering me. One is an
old question, and one I only came across recently.

1) The question that came up for me recently is how we model the grammatical
context for senses. For instance, "to ask" can mean requesting information, or
requesting action, depending on whether we use "ask somebody about" or "ask
somebody to". Similarly, "to shit" has entirely different meanings when used
reflexively ("I shit myself").

There is no good place for this in our current model. The information could be
placed in a statement on the word Sense, but that would be kind of non-obvious,
and would not (at least not easily) allow for a concise rendering, in the way we
see it in most dictionaries ("to ask sbdy to do sthg"). The alternative would be
to treat each usage with a different grammatical context as a separate Lexeme (a
verb phrase Lexeme), so "to shit oneself" would be a separate lemma. That could
lead to a fragmentation of the content in a way that is quite unexpected to
people used to traditional dictionaries.

We could also add this information as a special field in the Sense entity, but I
don't even know what that field should contain, exactly.

Got a better idea?


2) The older question is how we handle different renderings (spellings, scripts)
of the same lexeme. In English we have "color" vs "colour", in German we have
"stop" vs "stopp" and "Maße" vs "Masse". In Serbian, we have a Roman and
Cyrillic rendering for every word. We can treat these as separate Lexemes, but
that would mean duplicating all information about them. We could have a single
Lemma, and represent the others as alternative Forms, or using statements on the
Lexeme. But that raises the question which spelling or script should be the
"main" one, and used in the lemma.

I would prefer to have multi-variant lemmas. They would work like the
multi-lingual labels we have now on items, but restricted to the variants of a
single language. For display, we would apply a similar language fallback
mechanism we now apply when showing labels.

2b) if we treat lemmas as multi-variant, should Forms also be multi-variant, or
should they be per-variant? Should the glosse of a Sense be multi-variant? I
currently tend towards "yes" for all of the above.


What do you think?


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Why term for lemma?

2016-11-11 Thread Daniel Kinzler
Am 11.11.2016 um 14:38 schrieb Thiemo Mättig:
> Tpt asked:
> 
>> why having both the Term and the MonolingualText data structures? Is it just 
>> for historical reasons (labels have been introduced before statements and so 
>> before all the DataValue system) or is there an architectural reason behind?
> 
> That's not the only reason.

Besides the code perspective that Thiemo just explained, there is also the
conceptual perspective: Terms are editorial information attached to an entity
for search and display. DataValues such as MonolingualText represent a value
withing a Statement, citing an external authority. This leads to slight
differences in behavior - for instance, the set of languages available for Terms
is suptly different from the set of languages available for MonolongualText.

Anyway, the fact that the two are totally separate has historical reasons. One
viable approach for code sharing would be to have MonolingualText contain a Term
object. But that would introduce more coupling between our components. I don't
think the little bit of code that could be shared is worth the effort.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Linking RDF resources for external IDs

2016-11-09 Thread Daniel Kinzler
Hi Stas, Markus, Denny!

For a long time now, we have been wanting to generate proper resource references
(URIs) for external identifier values, see
<https://phabricator.wikimedia.org/T121274>.

Implementing this is complicated by the fact that "expanded" identifiers may
occur in four different places in the data model (direct, statement, qualifier,
reference), and that we can't simply replace the old string value, we need to
provide an additional value.

I have attached three files with snippets of three different RDF mappings:
- Q111.ttl - the status quo, with normalized predicates declared but not used.
- Q111.rc.ttl - modeling resource predicates separately from normalized values.
- Q111.norm.ttl - modeling resource predicates as normalized values.

The "rc" variant means more overhead, the "norm" variant may have semantic
difficulties. Please look at the two options for the new mapping and let me know
which you like best. You can use a plain old diff between the files for a first
impression.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix wikibase: <http://wikiba.se/ontology-beta#> .
@prefix wdata: <http://localhost/daniel/wikidata/index.php/Special:EntityData/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix wds: <http://www.wikidata.org/entity/statement/> .
@prefix wdref: <http://www.wikidata.org/reference/> .
@prefix wdv: <http://www.wikidata.org/value/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix wdtn: <http://www.wikidata.org/prop/direct-normalized/> .
@prefix p: <http://www.wikidata.org/prop/> .
@prefix ps: <http://www.wikidata.org/prop/statement/> .
@prefix psv: <http://www.wikidata.org/prop/statement/value/> .
@prefix psn: <http://www.wikidata.org/prop/statement/value-normalized/> .
@prefix pq: <http://www.wikidata.org/prop/qualifier/> .
@prefix pqv: <http://www.wikidata.org/prop/qualifier/value/> .
@prefix pqn: <http://www.wikidata.org/prop/qualifier/value-normalized/> .
@prefix pr: <http://www.wikidata.org/prop/reference/> .
@prefix prv: <http://www.wikidata.org/prop/reference/value/> .
@prefix prn: <http://www.wikidata.org/prop/reference/value-normalized/> .
@prefix wdno: <http://www.wikidata.org/prop/novalue/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix prov: <http://www.w3.org/ns/prov#> .

wd:Q111 a wikibase:Item ;
	rdfs:label "silver"@en ;
	skos:prefLabel "silver"@en ;
	schema:name "silver"@en ;
	wdt:P20 "asdfasdf" ;
	wdtn:P20 <http://musicbrainz.org/asdfasdf/place> .

wd:Q111 p:P20 wds:Q111-5459c580-4b6f-c306-184f-b7fa132b32d8 .

wds:Q111-5459c580-4b6f-c306-184f-b7fa132b32d8 a wikibase:Statement,
		wikibase:BestRank ;
	wikibase:rank wikibase:NormalRank ;
	ps:P20 "asdfasdf" ;
	psn:P20 <http://musicbrainz.org/asdfasdf/place> ;
	pq:P30 "qwertyqwerty" ;
	pqn:P30 <http://vocab.getty.edu/aat/qwertyqwerty> ;
	prov:wasDerivedFrom wdref:7335a5598064cd8716cc9e31d164f2803e376b99 .

wdref:7335a5598064cd8716cc9e31d164f2803e376b99 a wikibase:Reference ;
	pr:P40 "zxcvbnzxcvbn" ;
	prn:P40 <https://www.sbfi.admin.ch/ontology/occupation/zxcvbnzxcvbn> .
	
wd:P20 a wikibase:Property ;
	wikibase:propertyType <http://wikiba.se/ontology-beta#ExternalId> ;
	wikibase:directClaim wdt:P20 ;
	wikibase:directClaimNormalized wdtn:P20 ;
	wikibase:claim p:P20 ;
	wikibase:statementProperty ps:P20 ;
	wikibase:statementValue psv:P20 ;
	wikibase:statementValueNormalized psn:P20 ;
	wikibase:qualifier pq:P20 ;
	wikibase:qualifierValue pqv:P20 ;
	wikibase:qualifierValueNormalized pqn:P20 ;
	wikibase:reference pr:P20 ;
	wikibase:referenceValue prv:P20 ;
	wikibase:referenceValueNormalized prn:P20 ;
	wikibase:novalue wdno:P20 .

p:P20 a owl:ObjectProperty .

psv:P20 a owl:ObjectProperty .

pqv:P20 a owl:ObjectProperty .

prv:P20 a owl:ObjectProperty .

psn:P20 a owl:ObjectProperty .

pqn:P20 a owl:ObjectProperty .

prn:P20 a owl:ObjectProperty .

wdt:P20 a owl:DatatypeProperty .

ps:P20 a owl:DatatypeProperty .

pq:P20 a owl:DatatypeProperty .

pr:P20 a owl:DatatypeProperty .

wdtn:P20 a owl:ObjectProperty .

wdno:P20 a owl:Class ;
	owl:complementOf _:genid2 .

_:genid2 a owl:Restriction ;
	owl:onProperty wdt:P20 ;
	owl:someValuesFrom owl:Thing .

wd:P20 rdfs:label "MusicBrainz place ID"@en .
@prefix rdf: <http://www.w3.org/1999/02/22-rd

[Wikidata-tech] BREAKING CHANGE: Quantity Bounds Become Optional

2016-11-04 Thread Daniel Kinzler
Hi all!

This is an announcement for a breaking change to the Wikidata API, JSON and RDF
binding, to go live on 2016-11-15. It affects all clients that process quantity
values.


As Lydia explained in the mail she just sent to the Wikidata list, we have been
working on improving our handling of quantity values. In particular, we are
making upper- and lower bounds optional: When the uncertainty of a quantity
measurement is not explicitly known, we no longer require the bounds to somehow
be specified anyway, but allow them to be omitted.

This means that the upperBound and lowerBound fields of quantity values become
optional in all API input and output, as well as the JSON dumps and the RDF 
mapping.

Clients that import quantities should now omit the bounds if they do not have
explicit information on the uncertainty of a quantity value.

Clients that process quantity values must be prepared to process such values
without any upper and lower bound set.


That is, instead of this

"datavalue":{
  "value":{
"amount":"+700",
"unit":"1",
"upperBound":"+710",
"lowerBound":"+690"
  },
  "type":"quantity"
},


clients may now also encounter this:

"datavalue":{
  "value":{
"amount":"+700",
"unit":"1"
  },
  "type":"quantity"
},


The intended semantics is that the uncertainty is unspecified if not bounds are
present in the XML, JSON or RDF representation. If they are given, the
interpretation is as before.


For more information, see the JSON model documentation [1]. Note that quantity
bounds have been marked as optional in the documentation since August. The RDF
mapping spec [2] has been adjusted accordingly.


This change is scheduled for deployment on November 15.

Please let us know if you have any comments or objections.

-- daniel


[1] https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON
[2] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Quantity


Relevant tickets:
* <https://phabricator.wikimedia.org/T115269>

Relevant patches:
* <https://gerrit.wikimedia.org/r/#/c/302248>
*
<https://github.com/DataValues/Number/commit/2e126eee1c0067c6c0f35b4fae0388ff11725307>

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Why term for lemma?

2016-11-02 Thread Daniel Kinzler
Am 02.11.2016 um 21:53 schrieb Denny Vrandečić:
> Hi,
> 
> I am not questioning or criticizing, just curious - why was it decided to
> implement lemmas as terms? I guess it is for code reuse purposes, but just
> wanted to ask.

Yes, ideed. We have code for rendering, serializing, indexing, and searching
Terms. We do not have any infrastructure for plain strings. We could also handle
it as a monolingual-text StringValue, but that offers less re-use, in particular
no search, and no batch lookup for rendering.

Also, conceptually, the lemma is rather similar to a label. And it's always *in*
a language. The only question is whether we only have one, or multiple (for
variants/scripts). But one will do for now.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Proposed update to the stable interfaces policy

2016-09-13 Thread Daniel Kinzler
Tomorrow I plan to apply the following update to the Stable Interface Policy:

https://www.wikidata.org/wiki/Wikidata_talk:Stable_Interface_Policy#Proposed_change_to_to_the_.22Extensibility.22_section

Please comment there if you have any objections.

Thanks!

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Announcing the Wikidata Stable Interface Policy

2016-08-23 Thread Daniel Kinzler
Hello all!

After a brief period for final comments (thanks everyone for your input!), the
Stable Interface Policy is now official. You can read it here:

<https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy>

This policy is intended to give authors of software that accesses Wikidata a
guide to what interfaces and formats they can rely on, and which things can
change without warning.

The policy is a statement of intent given by us, the Wikidata development team,
regarding the software running on the site. It does not apply to any content
maintained by the Wikidata community.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Policy on Interface Stability: final feedback wanted

2016-08-16 Thread Daniel Kinzler
Hello all,

repeated discussions about what constitutes a breaking change has prompted us,
the Wikidata development team, to draft a policy on interface stability. The
policy is intended to clearly define what kind of change will be announced when
and where.

A draft of the policy can be found at

 <https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy>

Please comment on the talk page.

Note that this policy is not about the content of the Wikidata site, it's a
commitment by the development team regarding the behavior of the software
running on wikidata.org. It is intended as a reference for bot authors, data
consumers, and other users of our APIs.

We plan to announce this as the development team's official policy on Monday,
August 22.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] URL strategy

2016-06-18 Thread Daniel Kinzler
Am 13.06.2016 um 12:12 schrieb Richard Light:
> returns a list of person URLs.  So I'm happy.  However, I am still intrigued 
> as
> to the logic behind the redirection of the statement URL to the URL for the
> person about whom the statement is being made.

The reason is a practical one: the statement data is part of the data about that
person. It's stored and addressed as part of that person's information. We
currently do not have an API that would return only the statement data itself,
so if you dereference the statement URI, you get all the data we have on the
subject, which includes the statement.

This is formally acceptable: dereferencing the statement URI should give you the
RDF representation of that statement (and possibly more - which is the case
here). The statement URI does not resolve to the the subject or the object, but
to Statement itself, which is an RFC resource in it's own right.

Perhaps the confusion arises from the fact that the SPARQL endpoint offers two
views on Statements: the "direct" or "naive" mapping (using the wds prefix) in
which a Statement is modeled as a single triple, and does not have a URI of it's
own. And the "full" or "deep" mapping, where the statement is a resource in it's
own right, and we use several triples to describe its type, value, rank,
qualifiers, references, etc.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] MathML is dead, long live MathML

2016-04-07 Thread Daniel Kinzler
Am 07.04.2016 um 20:00 schrieb Moritz Schubotz:
> Hi Daniel,
> 
> Ok. Let's discuss!

Great! But let's keep the discussion in one place. I made a mess by
cross-posting this to two lists, now it's three, it seems. Can we agree on
 as the venue of discussion? At least for the
discussion of MathML in the context of Wikimedia, that would be the best place,
I think.

-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] MathML is dead, long live MathML

2016-04-07 Thread Daniel Kinzler
Peter Krautzberger, maintainer of MathJax, apparently thinks that MathML has
failed as a web standard (even though it succeeded as an XML standard), and
should be removed from HTML5. Here's the link:

https://www.peterkrautzberger.org/0186/

It's quite a rant. Here's a quick TL;DR:

> It doesn’t matter whether or not MathML is a good XML language. Personally, I
> think it’s quite alright. It’s also clearly a success in the XML publishing
> world, serving an important role in standards such as JATS and BITS.
> 
> The problem is: MathML has failed on the web.

> Not a single browser vendor has stated an intent to work on the code, not a
> single browser developer has been seen on the MathWG. After 18 years, not a
> single browser vendor is willing to dedicate even a small percentage of a
> developer to MathML.

> Math layout can and should be done in CSS and SVG. Let’s improve them
> incrementally to make it simpler.
> 
> It’s possible to generate HTML+CSS or SVG that renders any MathML content –
> on the server, mind you, no client-side JS required (but of course possible).

> Since layout is practically solved (or at least achievable), we really need
> to solve the semantics. Presentation MathML is not sufficient, Content MathML
> is just not relevant.
> 
> We need to look where the web handles semantics today – that’s ARIA and HTML
> but also microdata, rdfa etc.

I think both, the rendering as well as the semantics, are well worth thinking
about. Perhaps Wikimedia should reach out to Peter Krautzberger, and discuss
some ideas of how math (and physics, and chemistry) content should be handled by
Wikipedia, Wikidata, and friends. This seems like a cross roads, and we should
have a hand in where things are going from here.

-- daniel (not a MathML expert all all)

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Caches for Special:EntityData json

2016-02-29 Thread Daniel Kinzler
Output from Special:EntityData is cached for 31 days. Looking at the code, it
seems we are not automatically purging the web caches when an entity is edited -
please file a ticket for that. I think we originally decided against it for
performance reasons (there are quite a few URLs to purge for every edit), but I
suppose we should look into that again.

You can force the cache to be purged by setting action=purge in the request.
Note that this will purge all serializations of the entity, not just the one
requested.

-- daniel

Am 29.02.2016 um 22:02 schrieb Markus Krötzsch:
> Hi,
> 
> I found that Special:EntityData returns outdated JSON data that is not in
> agreement with the page. I have fetched the data using wget to ensure that no
> browser cache is in the way. Concretely, I have been looking at
> 
> https://www.wikidata.org/wiki/Special:EntityData/Q17444909.json
> 
> where I recently changed the P279 value from Q217594 to Q16889133. Of course,
> this might no longer be a valid example when you read this email (in case the
> cache gets updated at some point).
> 
> Is this a bug in the configuration of the HTTP (or other) cache, or is this 
> the
> desired behaviour? When will the cache be cleared?
> 
> Thanks,
> 
> Markus
> 
> ___
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Wikibase CI is broken because auf Scribunto issue

2016-02-26 Thread Daniel Kinzler
Some Jenkins jobs now fail for all changes to Wikibase. E.g.
<https://gerrit.wikimedia.org/r/#/c/270008/> and
<https://gerrit.wikimedia.org/r/#/c/270572/>. Errors I see:

11:28:52 PHP Strict standards:  Declaration of
Capiunto\Test\BasicRowTest::testLua() should be compatible with
Scribunto_LuaEngineTestBase::testLua($key, $testName, $expected) in
/mnt/jenkins-workspace/workspace/mwext-testextension-php55-composer/src/extensions/Capiunto/tests/phpunit/output/BasicRowTest.php
on line 51

11:39:14 1) LuaSandbox:
Wikibase\Client\Tests\DataAccess\Scribunto\Scribunto_LuaWikibaseEntityLibraryTest::testRegister
11:39:14 Failed asserting that LuaSandboxFunction Object () is an instance of
class "Scribunto_LuaStandaloneInterpreterFunction".

I guess some change to Scribunto broke compatibility...

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Technical information about the new "math" and "external-id" data types

2016-02-05 Thread Daniel Kinzler
As Lydia announced, we are going to deploy support for two new data types soon
(think of "data types" as "property types", as opposed to "value types"):

* The "math" type for formulas. This will use TeX syntax and is provided by the
same extension that implements  for wikitext. We plan to roll this out on
Feb 9th.

* The "external-id" type for references to external resources. We plan to roll
this out on Feb 16th. NOTE: Many of the existing properties for external
identifiers will be converted from the plain "string" data type to the new
"external-id" data type, see
<https://www.wikidata.org/wiki/User:Addshore/Identifiers>.


Both these new types will use the "string" value type. Below are two examples of
Snaks that use the new data type, in JSON:

{
  "snaktype": "value",
  "property": "P717",
  "datavalue": {
"value": "\\sin x^2 + \\cos_b x ^ 2 = e^{2 \\tfrac\\pi{i}}",
"type": "string"
  },
  "datatype": "math"
}

{
  "snaktype": "value",
  "property": "P708",
  "datavalue": {
"value": "BADWOLF",
"type": "string"
  },
  "datatype": "external-id"
}

As you can see, the only thing that is new is the value of the "datatype" field.


Similarly, in RDF, both new data types use plain string literals for now, as you
can see from the turtle snippet below:

wd:Q2209 a wikibase:Item ;
wdt:P717 "\\sin x^2 + \\cos_b x ^ 2 = e^{2 \\tfrac\\pi{i}}" ;
wdt:P708 "BADWOLF" .

The datatypes themselves are declared as follows:

wd:P708 a wikibase:Property ;
wikibase:propertyType wikibase:ExternalId .

wd:P717 a wikibase:Property ;
wikibase:propertyType wikibase:Math .

Accordingly, the URIs of the datatypes (not the types of the literals!) are:
<http://wikiba.se/ontology-beta#ExternalId>
<http://wikiba.se/ontology-beta#Math>


These are, for now, the only changes to the representation of Snaks. We do
however consider some additional changes for the future. To avoid confusion,
I'll put them below a big separator:


ANNOUNCEMENT ABOVE!

ROUGH PLANS BELOW!


Here are some changes concerning the math and external-id data types that we are
considering or planning for the future.

* For the Math datatype, we may want to provide a type URI for the RDF string
literal that indicates that the format is indeed TeX.
Perhaps we could use <http://purl.org/xtypes/Fragment-LaTeX>.

* For the ExternalId data type, we would like to use resource URIs for external
IDs (in "direct claims"), if possible. This would only work if we know the base
URI for the property  (provided by a statement on the property definition). For
properties with no base URI set, we would still use plain string literals.

In our example above, the base URI for P708 might be
<https://tardis.net/allonzy/>. The Turtle snippet would read:

wd:Q2209 a wikibase:Item ;
  wdt:P717 "\\sin x^2 + \\cos_b x ^ 2 = e^{2 \\tfrac\\pi{i}}"
^^purl:Fragment-LaTeX;
  wdt:P708 <https://tardis.net/allonzy/BADWOLF> .

However, the full representation of the statement would still use the original
string literal:

wds:Q2209-24942a17-4791-a49d-6469-54e581eade55 a wikibase:Statement,
wikibase:BestRank ;
wikibase:rank wikibase:NormalRank ;
ps:P708 "BADWOLF" .


We would also like to provide the full URI of the external resource in JSON,
making us a good citizen of the web of linked data. We plan to do this using a
mechanism we call "derived values", which we also plan to use for other kinds of
normalization in the JSON output. The idea is to include additional data values
in the JSON representation of a Snak:

{
"snaktype": "value",
"property": "P708",
"datavalue": {
"value": "BADWOLF",
"type": "string"
},
"datavalue-uri": {
"value": "https://tardis.net/allonzy/BADWOLF;,
"type": "string"
},
"datatype": "external-id"
}

In some cases, such as ISBNs, we would want a URL as well as a URI:
  {
"snaktype": "value",
    "property"

[Wikidata-tech] On interface stability and forward compatibility

2016-02-05 Thread Daniel Kinzler
Hi all!

In the context of introducing the new "math" and "external-id" data types, the
question came up whether this introduction constitutes a breaking change to the
data model. The answer to this depends on whether you take the "English" or the
"German" approach to interpreting the format: According to
<https://en.wikipedia.org/wiki/Everything_which_is_not_forbidden_is_allowed>, in
England, "everything which is not forbidden is allowed", while, in Germany, the
opposite applies, so "everything which is not allowed is forbidden".

In my mind, the advantage of formats like JSON, XML and RDF is that they provide
good discovery by eyeballing, and that they use a mix-and-match approach. In
this context, I favour the English approach: anything not explicitly forbidden
in the JSON or RDF is allowed.

So I think clients should be written in a forward-compatible way: they should
handle unknown constructs or values gracefully.


In this vein, I would like to propose a few guiding principles for the design of
client libraries that consume Wikibase RDF and particularly JSON output:

* When encountering an unknown structure, such as an unexpected key in a JSON
encoded object, the consumer SHOULD skip that structure. Depending on context
and use case, a warning MAY be issued to alert the user that some part of the
data was not processed.

* When encountering a malformed structure, such as missing a required key in a
JSON encoded object, the consumer MAY skip that structure, but then a warning
MUST be issued to alert the user that some part of the data was not processed.
If the structure is not skipped, the consumer MUST fail with a fatal error.

* Clients MUST make a clear distinction of data types and values types: A Snak's
data type determines the interpretation of the value, while the type of the
Snak's data value specifies the structure of the value representation.

* Clients SHOULD be able to process a Snak about a Property of unknown data
type, as long as the value type is known. In such a case, the client SHOULD fall
back to the behaviour defined for the value type. If this is not possible, the
Snak MUST be skipped and a warning SHOULD be issued to alert the user that some
part of the data could not be interpreted.

* When encountering an unknown type of data value (value type), the client MUST
either ignore the respective Snak, or fail with a fatal error. A warning SHOULD
be issued to alert the user that some part of the data could not be processed.


Do you think these guidelines are reasonable? It seems to me that adopting them
should save everyone some trouble.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] On interface stability and forward compatibility

2016-02-05 Thread Daniel Kinzler
Am 05.02.2016 um 14:55 schrieb Tom Morris:
> Sounds a lot like a restatement of Postel's Law
> 
> https://en.wikipedia.org/wiki/Robustness_principle

Yes indeed: "Be conservative in what you send, be liberal in what you accept"


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] On interface stability and forward compatibility

2016-02-05 Thread Daniel Kinzler
Am 05.02.2016 um 14:24 schrieb Markus Krötzsch:
> I feel that this tries to evade the real issue by making formal rules about 
> what
> kind of "breaking" you have to care about. It would be better to define
> "breaking change" based on its consequences: if important services will stop
> working, then you should make sure you announce it in time so this will not
> happen. This requires you to talk to people on this list. I think the whole
> proposal below is mainly trying to give you some justification to avoid
> communication with your stakeholders. This is not the way to go.

It's a way to prevent unpleasant surprises, and avoid unnecessary work.

Talking about planned changes early on is certainly good, and we should get more
organized at this.

However, I would like to avoid having to treat *any* change like a breaking
change. Breaking changes should be communicated a lot earlier, and a lot more
carefully, then, say, additions and extensions.

I tried to write down what clients *shouldn't* rely on. As Tom pointed out,
these are really general design principles. They are not really specific to
Wikibase, except for the "data type vs. value type" thing. Any software
processing third party data should follow them.

> how should a SPARQL Web service communicate problems that occurred when
> importing the data?

By informing whoever maintains the import, by writing to a log file or sending
mail. That's the person who can fix the problem. That's who should be informed.

> Our tools rely on being able to use all data, and the easiest way to ensure
> that they will work is to announce technical changes to the JSON format well
> in advance using this list. For changes that affect a particular subset of
> widely used tools, it would also be possible to seek the feedback from the
> main contributors of these tools at design/development time.

Any we do that for breaking changes. I did not expect additional data types to
cause any trouble. After all, you can still inject the data, since the value
type is know. For a long time, out dumps didn't even mention the data type at 
all.

> I am sure everybody here is trying their best to keep up with whatever
> changes you implement, but it is not always possible for all of us to
> sacrifice part of our weekend on short notice for making a new release before
> next Wednesday.


To avoid this problem in the future, I tried to spell out what guaranties we
*don't* give, so a simple addition doesn't things don't break horribly.

That doesn't mean we don't plan to communicate such changes at all, or better
than we did now. We do. But this kind of thing is clearly distinct from actual
"breaking changes" in my mind.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Last call for objections against DataModel changes.

2015-12-03 Thread Daniel Kinzler
A couple of weeks ago, I proposed to change our PHP data model bindings to allow
extra info to be attached using the concept of "facets" simmilar to the "role
object" and "extension object" pattern.

Code experiments showcasing this idea can be found on github:
* https://github.com/wmde/WikibaseDataModel/pull/576
* https://github.com/wmde/WikibaseDataModelSerialization/pull/174

This is the final call for objections against using this approach. The rationale
behind it can be found on <https://phabricator.wikimedia.org/T118860> and
related tickets.

Implementation details can still change later, but after nearly 3 months, we
finally need a decision on the conceptual level. If there are no substantial
objections, this will become definite on Tuesday, December 8.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Using the Role Object Pattern to represent derived information in the data model

2015-11-17 Thread Daniel Kinzler
Hi all!

For weeks and months now, we have been discussion how to best represent "extra"
information in (or associated with) the wikibase data model. After some more
discussion and a bit of research, I think I have found what we need: The Role
Object Pattern aka Role Class Model, see
<https://en.wikipedia.org/wiki/Role_Class_Model>.

Please have a look at https://phabricator.wikimedia.org/T118860 and let me know
if you have any objections. If not, let's use this sprint to discuss the details
of the implementations, and do a task breakdown.

PS: I came across quite a few famous names when during my research. Looks like
we are not first in having this need...

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] [Wikidata] how to map other identifiers to Wikidata entity IDs

2015-11-09 Thread Daniel Kinzler
Am 09.11.2015 um 03:26 schrieb S Page:
> I think these other identifiers are all "Wikidata property representing a 
> unique
> identifier" and there are about 350 of them [2] But surprisingly, I couldn't
> find an easy way to look up a Wikidata item using these other identifiers.

We discussed some loose plans for implementing this in Currus when Stas was in
Berlin a few weeks ago. On Special:Search, you would ask for
property:P212:978-2-07-027437-6, and that would find the item with that ISBN.

Stas: do we have a ticket for this somewhere? All I can find are the notes in
the etherpad.

> Also, is this a temporary thing? Will Wikidata eventually have items for every
> book published, every musical recording, etc. and become a superset of all 
> those
> unique identifiers?

It's highly unlikely that wikidata will become a superset of any and all
vocuabularies in existance. Better integration of external identifiers is high
on our priority list right now. The first step will however be to property
expose URIs for them, so we are no longer a dead end in the linked data web.

But since we need to work on Cirrus integration anyway, I expect that we will
have search-by-property soonish, too. I certrainly hope so.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] how is the datetime value with precision of one year stored

2015-08-31 Thread Daniel Kinzler
Hello Raul.

While there is indeed some inconsistency with year-precision dates (some use
01-01 for month and day, some use 00-00), I cannot reproduce the issue you
report. Looking at the JSON form of Q216, I see +2014-00-00, as expected. I
connot find 2013 anywhere in the JSON. Am I missing something?

Here is the entire statement in JSON:

[
  {
"mainsnak": {
  "snaktype": "value",
  "property": "P1082",
  "datavalue": {
"value": {
  "amount": "+539939",
  "unit": "1",
  "upperBound": "+539940",
  "lowerBound": "+539938"
},
"type": "quantity"
  },
  "datatype": "quantity"
},
"type": "statement",
"qualifiers": {
  "P585": [
{
  "snaktype": "value",
  "property": "P585",
  "hash": "a1c4aa51810ae8ef53dd5e243264e9d977c02081",
  "datavalue": {
"value": {
  "time": "+2014-00-00T00:00:00Z",
  "timezone": 0,
  "before": 0,
  "after": 0,
  "precision": 9,
  "calendarmodel": "http:\/\/www.wikidata.org\/entity\/Q1985727"
},
"type": "time"
  },
  "datatype": "time"
}
  ]
},
"qualifiers-order": [
  "P585"
],
"id": "Q216$2a0bbe8d-4281-d178-93b0-9e6ff904ea91",
"rank": "normal",
"references": [
  {
"hash": "3c680f0b30bc470385ebab96c739ddd1c84be724",
"snaks": {
  "P854": [
{
  "snaktype": "value",
  "property": "P854",
  "datavalue": {
"value":
"http:\/\/db1.stat.gov.lt\/statbank\/selectvarval\/saveselections.asp?MainTable=M3010211=1===9116===ST===",
"type": "string"
  },
  "datatype": "url"
}
  ]
},
"snaks-order": [
  "P854"
]
  }
]
  }
]

Am 31.08.2015 um 19:19 schrieb Raul Kern:
> Hi,
> how is the datetime value with precision of one year stored?
> 
> For example for birt date in https://www.wikidata.org/wiki/Q299687
> fine grain value for "1700" is "1.01.1700"
> 
> 
> But for population date field in https://www.wikidata.org/wiki/Q216
> the fine grain value for "2014" is "30.11.2013"
> Which is kind of unexpected.
> 
> 
> 
> --
> Raul
> 
> ___
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
> 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Dump requirements

2015-06-29 Thread Daniel Kinzler
There's an ongoing discussion in ops about improving the dump process, see

 https://phabricator.wikimedia.org/T88728
 https://phabricator.wikimedia.org/T93396
 https://phabricator.wikimedia.org/T17017
 
https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improve_dumps

I would like to join in and add our requirements and thoughts to the list, and
would like some input on that. So far I have:

Make it easier to register a new type of dump via a config change.
A dump should define:
* a script(s) to run
* output file(s)
* the dump schedule
* a short name
* brief description (wikitext or HTML? translatable?)
* required input files (maybe)

Make clear timelines of consistent dumps.
* drop the misleading one dir with one timestamp for all dumps appraoch
* have one timeline per dump instead
* for dumps that are guaranteed to be consistent (one generated from the other),
generate a timeline of directories with symlinks to the actual files.

Make dumps discoverable:
* There should be a machine readable overview of which dumps exist in which
versions for each project.
* This overview should be a JSON document (may even be static)
* Perhaps we also want a DCAT-AP description of our dumps

Promote stable URLs:
* The latest dump of any type should be available under a stable, predictable 
URL.
* TBD: latest URL could point to a symlink, get rewritten to the actual file,
or trigger an HTTP redirect.



Thoughts? Comments? Additions?


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Daniel Kinzler
Am 11.03.2015 um 10:43 schrieb Markus Krötzsch:
 I was referring to the investigations that have led to this spreadsheet:
 
 https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0

That's the backend evaluation spreadsheet. I'm not arguing against BlazeGraph as
a backend at all.

I'm questioning the outcome of the public query language evaluation as shown in
this sheet:

https://docs.google.com/a/wikimedia.de/spreadsheets/d/16bbifhuoAiO7bRQ2-0mYU5FJ9ILczC-u9oCJsPdn9IU/edit#gid=0

Have a look at the weights, and st the comments, especially Gabriel's.

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-11 Thread Daniel Kinzler
Am 11.03.2015 um 10:08 schrieb Markus Krötzsch:
 What I don't see is how the use of a WDQ API on top of SPARQL would make the
 overall setup any less vulnerable; it mainly introduces an additional 
 component
 on top of SPARQL, and we can have a simpler SPARQL-based filter component 
 there
 if we want, which is likely to be more effective in controlling usage. 

I disagree on both points: I believe it would be neither simpler, nor more
effective. That's pretty much the core of it.

However, I admit that this is currently a gut feeling, a concern I want to share
and discuss. It should be investigated before making a decision.

 There is a huge cost to
 designing a query API from scratch, and I would really like to avoid this.

Which is why I want to use one that already exists (WDQ), and back it by
something that already exists (SPARQL).

 Supporting WDQ on top of SPARQL would retain WDQ in its current form and still
 support standards -- 

That's exactly what I propose.

 if we want to develop an official custom API, we will give
 up on both of these benefits, and at the same time push the ETA for Wikidata
 queries far into the future.

I disagree. If, as I believe, sandboxing WDQ is simpler than sandboxing SPARQL,
using WDQ would allow us  to have a public query API sooner. But whether my
believe is correct needs to be investigated, of course.

 All of this has been discussed and considered in the past. I don't see why one
 would be kicking off discussions now that question everything decided in
 meetings and telcos over the past weeks. There is absolutely no new 
 information
 compared to what has led to the consensus that we all (including Daniel) had
 reached.

The consensus as I remember it was we should be able to expose SPARQL safely,
if we invest enough time to sandbox it. The issue of lock-in was mentioned but
not really assessed. The relative cost for sandboxing WDQ vs SPARQL, and the
impact on the ETA, was not discussed much. The ad-hoc evaluation spreadsheet
shows WDQ as a second to SPARQL (before MQL and ASK), mainly because SPARQL is
more powerful.

The downside of that power doesn't factor into the evaluation, nor does the
factor of lock-in. Shifting the relative weight in the spreadsheet from power to
sustainability makes WDQ come out at the top.

After the initial enthusiasm, this has made me increasingly uneasy over the last
weeks. Hence my mail to this list.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-10 Thread Daniel Kinzler
Am 10.03.2015 um 18:22 schrieb Thomas Tanon:
 I support Magnus point of view. WDQ is a very good proof of concept but is,
 I think, to limited to be the primary language of the Wikidata query
 system.

It can be extended. What I want is a limited domain specific language tailored
to our primary use cases. Having it largely compatible with WDQ would be great.

I did not mean to imply that we have to accept the current limitations of WDQ.
I'm arguing that we should impose sensible limitations on queries, instead of
committing to support everything that is possible with SPARQL.

 A possible solution is maybe to support two query languages as primary: 1
 WDQ, at first, in order to have something working quickly 2 A safe subset
 of SPARQL (if it is possible) that would be implemented later using the
 experience got form the deployment of the first version of the query
 system. Or, if it is not possible, an improved version of WDQ that would
 break its current limitations.

Absolutely. I'd like to avoid any commitment to keeping the SPARQL interface
stable, though. That's why I'd limit it to labs-based usage.

-- daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-10 Thread Daniel Kinzler
Hi all!

After the initial enthusiasm, I have grown increasingly wary of the prospect of
exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to
share my (personal and unfinished) thoughts about this on this list, as food for
thought and a basis for discussion.

Basically, I fear that exposing SPARQL will lock us in with respect to the
backend technology we use. Once it's there, people will rely on it, and taking
it away would be very harsh. That would make it practically impossible to move
to, say, Neo4J in the future. This is even more true if if expose vendor
specific extensions like RDR/SPARQL*.

Also, exposing SPARQL as our primary query interface probably means abruptly
discontinuing support for WDQ. It's pretty clear that the original WDQ service
is not going to be maintained once the WMF offers infrastructure for wikidata
queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will
need major modifications, or would just die.


So, my proposal is to expose a WDQ-like service as our primary query interface.
This follows the general principle having narrow interfaces to make it easy to
swap out the implementation.

But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could
be exposed to Labs, just like we provide access to replicated SQL databases
there: on Labs, you get raw access, with added performance and flexibility,
but no guarantees about interface stability.

In terms of development resources and timeline, exposing WDQ may actually get us
a public query endpoint more quickly: sandboxing full SPARQL may likely turn out
to be a lot harder than sandboxing the more limited set of queries WDQ allows.

Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically
tailored to our domain and use case, and there already is an ecosystem of tools
that use it. We'd want to refine it a bit I suppose, but by and large, it's
pretty much exactly what we need, because it was built around the actual demand
for querying wikidata.


So far my current thoughts. Note that this is not a decision or recommendation
by the Wikidata team, just my personal take.

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Thoughts on (not) exposing a SPARQL endpoint

2015-03-10 Thread Daniel Kinzler
Am 10.03.2015 um 21:09 schrieb Stas Malyshev:
 People would ask us for full SPARQL as soon as they'd know we're
 running SPARQL db.

Sure. And I'D tell them you can use SPARQL on labs, but beware that it may
change or go away.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Globe coordinates precision question (technical)

2015-01-12 Thread Daniel Kinzler
Am 12.01.2015 15:09, schrieb Markus Krötzsch:
 Great, this clarifies a lot for me. The other question was what to make of 
 null
 values for precision. Do they mean no precision known or something else?

IIRC, null is a bug here. Not sure how to handle that - we don't have the
original string, and we can't really guess the precision based on the float
values.

Looking at GeoCoordinateFormatter, I see this:

if ( $precision = 0 ) {
$precision = 1 / 3600;
}

I.e. it assumes 1 arc sec if no percision is given. Not great, but not much else
we can do at this point.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Things to get merged before the branch next week

2014-12-05 Thread Daniel Kinzler
Hey!

Here's a few performance relevant changes I think should get merged before we
branch next week:

https://gerrit.wikimedia.org/r/#/c/170961/ Determine update actions based on
usage aspects.  --- the last bit missing for usage tracking

https://gerrit.wikimedia.org/r/#/c/176650/ Use wb_terms table for label
lookup. --- should improve memory consumption a lot, and possibly also speed.

https://gerrit.wikimedia.org/r/#/c/167224/ Defer entity deserialization ---
should reduce memory footprint and improve speed of trivial operations like
checkign whether something is a redirect.

Are there any other performance improvements that we should get in? I imagine
that this will be the last time we branch until the third week of January.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Parsing Entity IDs

2014-10-17 Thread Daniel Kinzler
Am 17.10.2014 04:45, schrieb Jeroen De Dauw:
 Hey,
 
 I just noticed this commit [0], which gets rid of a pile of direct
 BasicEntityIdParser usages for performance reasons.

Yay, thanks Katie!

 Of course this also means that no new code that introduces such occurrences
 should be allowed through review, even if it contains a fix this later TODO
 (for new code there is no excuse to do it wrong).

There's no excuse to do it wrong, but there will always be things left to do
later. TODOs are a good thing, it's just bad to put them in and forget about
them (which I'm quite guilty of, I know).

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Wikibase changesAsJson

2014-10-13 Thread Daniel Kinzler
Yes, as far as I known, we have moved Change serialization to JSON a long time
ago, and we can and should drop support for PHP serialization there.

Double check with Katie though, she knows best what is currently deployed.

Am 12.10.2014 23:59, schrieb Jeroen De Dauw:
 Hey,
 
 I was wondering if we still used PHP serialization in our change replication
 mechanism. (We need to be very careful making changes to the objects in WB DM 
 if
 that is the case.) Looking at the code, I discovered we have a changesAsJson
 setting, presumably introduced to migrate away from the PHP serialization. Has
 such a migration happened? Can we get rid of the setting an the old PHP
 serialize code?
 
 Cheers
 
 --
 Jeroen De Dauw - http://www.bn2vs.com
 Software craftsmanship advocate
 Evil software architect at Wikimedia Germany
 ~=[,,_,,]:3
 
 
 ___
 Wikidata-tech mailing list
 Wikidata-tech@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] alternatives to memcached for caching entity objects across projects

2014-10-04 Thread Daniel Kinzler
Am 02.10.2014 17:55, schrieb Jeroen De Dauw:
 Hey,
 
 We use two CachingEntityRevisionLookup nested into each other: the 
 outer-most
 uses a HashBagOStuff to implement in-process caching, the second level 
 uses
 memcached.
 
 
 It is odd to have two different decorator instances for caching around th
 EntityRevisionLookup. I suggest to have only a single decorator for caching,
 which writes to a caching interface. Then this caching interface can have an
 implementation what uses multiple caches, and perhaps have a decorator on that
 level.

Went that way first. Didn't work out nicely. I forget why exactly.

I don't care much either way.


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] alternatives to memcached for caching entity objects across projects

2014-10-02 Thread Daniel Kinzler
Hey Ori!

Am 02.10.2014 06:45, schrieb Ori Livneh:
 I'm embarrassed to say that I don't know nearly enough about Wikidata to be 
 able
 to make a recommendation. Where would you recommend I look if I wanted to
 understand the caching architecture?

And I'm embarressed to say that we have very little high level documentation.
There is no document on the overall caching architecture.

The use case in question is accessing data Items (and other Entities, like
Properties) from client wikis like Wikipedia. Entities are accessed through an
EntityRevisionLookup service; CachingEntityRevisionLookup is an implementation
of EntityRevisionLookup that takes an actual EntityRevisionLookup (e.g. a
WikiPageEntityRevisionLookup) and a BagOStuff, and implements a caching layer.

We use two CachingEntityRevisionLookup nested into each other: the outer-most
uses a HashBagOStuff to implement in-process caching, the second level uses
memcached. The objects that are cached there are instances of EntityRevision,
which is a thin wrapper around an Entity (usually, an Item) plus a revision ID.

Please let me know if you have further questions!

-- daniel

PS: what do you think, where should this info go? Wikibase/docs/caching.md or
some such?


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] [Multimedia] From the MW Core Backlog....

2014-09-28 Thread Daniel Kinzler
/Wikimedia_MediaWiki_Core_Team/Backlog#Structured_license_metadata

 I'm assuming everything that he describes fits nicely into what is
 planned for Structured Data.  Assuming that's true, should I just
 copy/paste into a new card in Mingle, or a new page on mw.org
 http://mw.org/ or what?


 This seems to be about article text, or mainly about article text
 (articles imported from other wikis and so on).

 The plan for the structured data project is to create Wikidata properties
 for legalese, install Wikibase on Commons (and possibly other wikis which
 have local images), make that Wikibase use Wikidata properties (and
 sometimes Wikidata items as values), create a new entity type called
 mediainfo (which is like a Wikibase item, but associated with a file), 
 and
 add legal information to the mediainfo entries.

 Part of that (the Wikidata properties) could be reused for articles and
 other non-file content - the source, license etc. properties are generic
 enough. However, if we want to use this structure to attribute files, we
 would either have to make mediainfo into some more generic thing that can
 be attached to any wiki page, or abuse the langlink/badge feature to 
 serve
 a similar purpose. That is a major course correction; if we want to do
 something like that, that should be discussed (with the involvement of 
 the
 Wikidata team) as soon as possible.


 Thanks for the analysis, Gergo!  I was going to split Luis' proposal into a
 separate wiki page, but I see Nemo has linked to this page as the Canonical
 page on the topic:
 https://www.mediawiki.org/wiki/Files_and_licenses_concept

 Without a deep reading that I'm admittedly just not going to have time for,
 it's hard to tell how related the page that Nemo linked to is to the concepts
 that Luis is trying to capture.  Could someone (Nemo? Luis?) merge Luis's
 requirements into the canonical page to Luis' satisfaction, so I can delete
 most of the information from our backlog?  I'll keep the item on the MW Core
 backlog, since I don't know where else to put it, but it's probably going to
 be relatively low priority for that team.

 Multimedia team and Wikidata team, could you make sure you're considering the
 requirements that Luis brought up as you build your solution?  Even if you
 decide to punt on some of the things that aren't strictly necessary for 
 files,
 it's still good to make sure you don't paint us in a corner when if/when we 
 do
 try to do something more sophisticated for articles.

 One thing I'll note, though, before we get too complacent in thinking that
 files are somehow simpler than articles, we should consider these relatively
 common scenarios:
 *  Group photo with potentially different per-person personality rights
 *  PDF of a slide deck with many images
 *  PDF of a Wikipedia article  :-)

 Rob
 
 
 
 
 
 ___
 Wikidata-tech mailing list
 Wikidata-tech@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Removing 3rd party dependencies from WikibaseQueryEngine

2014-09-09 Thread Daniel Kinzler
Am 09.09.2014 19:20, schrieb Daniel Kinzler:
 Hi Rob, thanks for clarifying!
 
 I guess I just oversimplified what was said in our discussion. I'll try to
 summarize what you now wrote:
 
 If there is a package for dbal/symfony/whatever in Ubuntu LTS, we have a good
 chance, but no guarantee, that TechOps is fine with deploying it.

Quick update on that: If I understand correctly, the cluster is running Ubuntu
12.04, which doesn't have the packages in question, but an upgrade to 14.04 is
in the pipeline.

So, there are two things we need to know in order to make an informed decision:

1) can we use the Ubuntu LTS packages for symfony and dbal?

2) when is 14.04 going to be rolled out?

Who can answer these questions? How do we poke TechOps?

-- daniel



-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Removing 3rd party dependencies from WikibaseQueryEngine

2014-09-05 Thread Daniel Kinzler
Am 04.09.2014 20:03, schrieb Jeroen De Dauw:
 Hey,
 
 I'm also curious to if WMF is indeed not running any CLI tools on the cluster
 which happen to use Symfony Console.

As far as I know, no unreviewed 3rd party php code is running on the public
facing app servers. Anything that has a debian package is ok. Don't know about
PEAR...

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] How to record redirects in the database

2014-07-14 Thread Daniel Kinzler
This is about the question of how to best store entity redirects in the
database. Below I try to describe the problem and the possible solutions. Any
input welcome.


A quick primer:

Wikibase redirects are really Entity-ID aliases. They correspond to, but are not
the same as, MediaWiki's page based redirects.

If Q3 is a redirect to (an alias for) Q5, the page Item:Q3 is also a redirect to
Item:Q5. The JSON blob on Item:Q3 would store a redirect entry *instead* of an
entity. Entities never *are* redirects.

Wikibase currently stores a mapping of entity ids to page ids in the
wb_entity_per_page table (epp table for short).

MediaWiki core stores redirects as a kind of link table, with rd_from being a
page_id, and rd_to+rd_namespace being name+namespace of the redirect target.


Requirements:

* When looking up an EntityId, we need to be able to load the corresponding JSON
blob, and for that we need to find the corresponding wiki page (either by id, or
by name+namespace). We need to be able to do this cross-wiki, so we may not have
the repo's configuration (wrt namespaces, etc) available when constructing the
query.

* We need an efficient way to list all entity IDs on a wiki (without redirects).
In particular, the mechanism for listing entities must support efficient paging.

* We need an efficient way to resolve redirects in bulk, or at least, to discern
redirects from unknown/deleted entity ids.


Options:

1) No redirects in the epp table (current). This means we need to use the
name+namespace when loading the entity-or-redirect from a page, since we don't
know the page ID if it's a redirects. We also can't use core's redirect table,
because for that, we also need to know the page id first. In order to use
name+namespace for looking up page IDs for entities, client wikis would need to
know the namespace IDs used on the repo, in order to generate queries against
the repo's database.

2) Put redirects into the epp table as well, without any special marking. This
makes lookups easy, but gives us no efficient way to list all entities without
redirects. We'd need to check and skip redirects while iterating. This would add
complexity to several maintenance and upgrade scripts.

3) Put redirects into the epp table, with a marker (or target id) in a new
column. This would allow for both, simple lookup and efficient listing, but it
means adding a column (and an index) to an already large table in production. It
also means having the overhead of a column that's mostly null.

4) Put redirects into epp *and* a separate table. Provides simple lookup, but
means a potentially slow join when listing entities. This join would happen
multiple times each time we need to list all entities, because of paged access -
compare how JsonDumpGenerator works.

5) Put redirects into a special table but not into epp. This means fast/simple
listing of entities, but requires a not-so-nice try logic when looking up
entities: if no entry is found in the epp table, we then need to go on and try
the entity-redirect table, to see whether the id is redirected or 
unknown/deleted.


Assessment:

1) is nasty in terms of cross-wiki configuration. It's the simplest solution on
the code and database levels, but seems brittle.

2) adds complexity to everything that lists entities. Big performance impact in
cases where entity blobs would otherwise not have been loaded, but are loaded
now to check whether they contain redirects.

3) is somewhat wasteful on the database level, and needs a schema change
deployment on a large table. Don't know how bad that would be, though.

4) may cause performance issues because it adds complexity to big queries on
large tables. Needs trivial schema change deployment (new table).

5) adds complexity to the code that reads entity blobs from the database,
impacts performance for the redirect and missing entity cases by adding a
database query. Could be acceptable if these cases are rare. Needs trivial
schema change deployment (new table).

-- daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] reviews needed for pubsubhubbub extension

2014-07-10 Thread Daniel Kinzler
Am 09.07.2014 19:39, schrieb Dimitris Kontokostas:
 On Wed, Jul 9, 2014 at 6:13 PM, Daniel Kinzler daniel.kinz...@wikimedia.de
 mailto:daniel.kinz...@wikimedia.de wrote:
 
 Am 09.07.2014 08:14, schrieb Dimitris Kontokostas:
  Maybe I am biased with DBpedia but by doing some experiments on English
  Wikipedia we found that the ideal update with OAI-PMH time was every ~5
 minutes.
  OAI aggregates multiple revisions of a page to a single edit
  so when we ask: get me the items that changed the last 5 minutes we 
 skip the
  processing of many minor edits
  It looks like we lose this option with PubSubHubbub right?
 
 I'm not quite positive on this point, but I think with PuSH, this is done 
 by the
 hub. If the hub gets 20 notifications for the same resource in one 
 minute, it
 will only grab and distribute the latest version, not all 20.
 
 But perhaps someone from the PuSH development team could confirm this.
 
 
 It 'd be great if the dev team can confirm this. 
 Besides push notifications, is polling an option in PuSH? I briefed through 
 the
 spec but couldn't find this.

Yes. You can just poll the interface that the hub uses to fetch new data.

-- daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] reviews needed for pubsubhubbub extension

2014-07-09 Thread Daniel Kinzler
Am 09.07.2014 08:14, schrieb Dimitris Kontokostas:
 Hi,
 
 Is it easy to brief the added value (or supported use cases) by switching
 to PubSubHubbub?

* It's easier to handle than OAI, because it uses the standard dump format.
* It's also push-based, avoiding constant polling on small wikis.
* The OAI extension has been deprecated for a long time now.

 The edit stream in Wikidata is so huge that I can hardly think of anyone 
 wanting
 to be in *real-time* sync with Wikidata
 With 20 p/s their infrastructure should be pretty scalable to not break.

The push aspect is probably most useful for small wikis. It's true, for large
wikis, you could just poll, since you would hardly ever poll in vain.

IT would be very nice if the sync could be filtered by namespace, category, etc.
But PubSubHubbub (i'll use PuSH from now on) doesn't really support this, 
sadly.

 Maybe I am biased with DBpedia but by doing some experiments on English
 Wikipedia we found that the ideal update with OAI-PMH time was every ~5 
 minutes.
 OAI aggregates multiple revisions of a page to a single edit 
 so when we ask: get me the items that changed the last 5 minutes we skip the
 processing of many minor edits
 It looks like we lose this option with PubSubHubbub right?

I'm not quite positive on this point, but I think with PuSH, this is done by the
hub. If the hub gets 20 notifications for the same resource in one minute, it
will only grab and distribute the latest version, not all 20.

But perhaps someone from the PuSH development team could confirm this.

 As we already asked before, does PubSubHubbub supports mirroring a wikidata
 clone? The OAI-PMH extension has this option

Yes, there is a client extension for PuSH, allowing for seemless replication of
one wiki into another, including creation and deletion (I don't know about
moves/renames).

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] PubSubHubbub what is it all about ? - Re: Wikidata-tech Digest, Vol 15, Issue 3

2014-07-09 Thread Daniel Kinzler
Hi Pram!

Am 09.07.2014 17:13, schrieb Param:
 Hi,
 
 I am new member to wikidata and would like to know all about “*PubSubHubbub*”
 the new project. 

PubSubHubbub (PuSH for short) is a push-based notification mechanism. See
https://en.wikipedia.org/wiki/PubSubHubbub. We plan to implement it for
wikidata.org. The code is at
https://git.wikimedia.org/tree/mediawiki%2Fextensions%2FPubSubHubbub

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Resolving Redirects

2014-06-23 Thread Daniel Kinzler
Hi all. I'm writing to get input on a conceptual issue regarding the resolution
of redirects.

I'm currently in the process of implementing redirects for Wikibase Items
(bugzilla 66067). My present task is to add support for redirect resolution to
the EntityLookup service interface (and possibly the related
EntityRevisionLookup service interface; bugzilla 66075).

Currently, the two interfaces in question look like this (with some irrelevant
stuff omitted):

  interface EntityLookup {
public function getEntity( EntityId $entityId, $revision = 0 );
public function hasEntity( EntityId $entityId );
  }

  interface EntityRevisionLookup extends EntityLookup {
public function getEntityRevision( EntityId $entityId, $revisionId = 0 );
public function getLatestRevisionId( EntityId $entityId );
  }

Note that getEntityRevision returns an EntityRevision object (an Entity with
some revision meta data), while getEntity just returns an Entity object.

Also note that the $revision parameter in EntityLookup::getEntity is deprecated
and being removed (see patch Iafdcb5b38), while $revision in
EntityRevisionLookup::getEntityRevision is supposed to stay.

Presently, the attempt to look up an Entity via an ID that has been turned into
a redirect will result in an exception being thrown. To implement redirect
resolution, original intention was to leave the EntityRevisionLookup as is, and
change EntityLookup like this:

  interface EntityLookup {
public function getEntity( EntityId $entityId, $resolveRedirects = 1 );
public function hasEntity( EntityId $entityId, $resolveRedirects = 1 );
  }

...with the $resolveRedirects parameter indicating how many levels of redirects
should be resolved before giving up.

This gives use a convenient way to get the current revision of an entity,
following redirects; And it keeps the interface for requesting a specific, or
the latest, version of an Entity, with meta info attached.

However, it means we have to implement the logic for redirect resolution in
every implementation class, generally using the same code over and over (there
are currently three implementations of EntityRevisionLookup: the actual lookup,
a caching wrapper, and an in-memory fake).

Also, it does not give us a straight-forward way to get the meta-data of the
current revision while following redirects. For that, we'd have to modify
EntityRevisionLookup::getEntityReevision:

public function getEntityRevision(
EntityId $entityId,
$revisionId = 0,
$resolveRedirects = 0
);

This is ugly, and annoying since we'll want to *either* resolve redirects *or*
specify a revision. We could use a special value for $revisionId to indicate
that we not only want the current revision (indicated by 0), but also want to
have redirects resolved (indicated by follow or -1 or whatever):

public function getEntityRevision(
EntityId $entityId,
$revisionIdOrRedirects = 0,
);

That's concise, but somewhat magical. Or we could add another method:

public function getEntityRevisionAfterFollowingAnyRedirects(
EntityId $entityId,
$resolveRedirects = 1,
);

That's not quite obvious, and the awkward name indicates that this isn't really
what we want either.


Perhaps we can get around all this mess by making redirect resolution something
the interface doesn't know about? An implementation detail? The logic for
resolving redirects could be implemented in a Proxy/Wrapper that would implement
EntityRevisionLookup (and thus also EntityLookup). The logic would have to be
implemented only once, in one implementation class, that could be wrapped around
any other implementation.

From the implementation's point of view, this is a lot more elegant, and removes
all the issues of how to fit the flag for redirect resolution into the method
signatures.

However, this means that the caller does not have control over whether redirects
are resolved or not. It would then be the responsibility of bootstrap code to
provide an instances that does, or doesn't, do redirect resolution to the
appropriate places. That's impractical, since the decisions whether redirects
should be resolved may be dynamic (e.g. depend on a parameter in an web API
call), or the caller may wish to handle redirects explicitly, by first looking
up without redirect, and then with redirect resolution, after some special
treatment.

So, it seems that the ugly variant with an extra parameter in
getEntityRevision() is the most practical, even though it's not the most elegant
from an OO design perspective.

What's your take on this? Got any better ideas?

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Constructing Entities from their serialization

2014-06-06 Thread Daniel Kinzler
Am 06.06.2014 06:07, schrieb Jeroen De Dauw:
 $item = new Item( array(
 
 ) );

Some tests I touched recently use this, and I didn't change it, just moved
things around.

I agree that knowing about a specific serialization format in tests is bad.

On the other hand, it's nice to be able to construct an entity in a single
statement, instead of building it iteratively.

Also, some test cases take the array data as input, and only construct the
entity when running the test. This is convenient in cases when the data provider
does not know the concrete type of entity under test. I guess that's why this
was introduced. I'm moving tests away from that, though.

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Using the canonical JSON model in dumps.

2014-04-14 Thread Daniel Kinzler
Hi all!

Context: We plan to change the XML dumps (and Special:Export) to use the same
JSON serialization that is used by the API, instead of the terse but brittle
internal format. This is about the mechanism we plan to use for the 
conversion.

SO, I just went and checked my assertion that WikiExporter will use the Content
object's serialize method to generate output. I WAS WRONG. It doesn't. I'll use
the text from the database, as-is (for reference, find the call to
Revision::getRevisionText in Export.php).

In order to force a conversion to the new format, we'll need to patch core to a)
inject a hook here to override the default behavior or b) make it always use a
Content object (unless, perhaps, told explicitly not to).

This is not hard to code, but doing it Right (tm) may need some discussion, and
getting it merged may need some time.

Sorry for not checking this earlier.
Daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Managing dependencies when extending Wikibase

2014-03-13 Thread Daniel Kinzler
I have discussed the dependency issue with Jeroen, so here is what I took away
from the conversation:

* factoring the service interfaces out of the Wikibase extension would be nice
to have, but is not necessary to resolve the present issue.

* An extension/plugin (in this case, the property suggester) will indeed
typically have a dependency on the application/framework it was written for (in
this case, wikibase).

* When installing a plugin stand-alone, the application (here: wikibase) would
be installed somewhere in the vendor directory. This is fine for runnign unit
tests against the plugin (the property suggester), but of course doesn't make
much sense when we want to use wikibase and the suggester as mediawiki
extensions (especially not if MediaWiki itself was pulled in as a dependency).

* In order to install extensions for an application in a way that the extensions
are installed under the application, even though they depend on the app, and
not vice versa, a local build can be used:

* We would create a composer manifest that defines the app (wikibase) and the
extensions (the suggester, etc) as dependencies, and then use composer to
install that. This will cause wikibase and the suggester to be installed
together, side by side, rather than putting wikibase under the suggester.

* In fact, we already do something like this with the Wikidata extension,
which is just a build of Wikibase with all the dependencies and additions we 
want.

HTH
-- daniel

Am 06.03.2014 16:03, schrieb Daniel Kinzler:
 The folks of the Wikidata.lib project at the Hasso Plattner Institut have
 developed an extension to Wikibase that allows us to suggest properties to add
 to items, based on the properties already present (a very cool project, btw).
 
 This is, conceptually, and extension to the Wikibase extension. This raises
 problems for managing dependencies:
 
 * conceptually, the extension (property suggester) depends *on* wikibase.
 * practically, we want to install the property suggester as an optional
 dependency (feature/plugin/extension) *of* wikibase.
 
 So, how do we best express this? How can composer handle this?
 
 I think the most obvious/important thing to do is to have a separate module 
 for
 the interface wikibase exposes to plugins/extensions. This would include the
 interfaces of the service objects, and some factory/registry classes for
 accessing these.
 
 What's the best practice for exposing such a set of interfaces? How is this 
 best
 expressed in terms of composer manifests? What are the next steps to resolve 
 the
 circular/inverse dependencies we currently have?
 
 -- daniel
 
 PS: Below is an email in which Moritz Finke listed the dependencies the 
 property
 suggester currently has:
 
 
  Original-Nachricht 
 Betreff: PropertySuggester Dependencies
 Datum: Thu, 6 Mar 2014 11:07:56 +
 Von: Finke, Moritz moritz.fi...@student.hpi.uni-potsdam.de
 An: Daniel Kinzler daniel.kinz...@wikimedia.de
 
 Hi,
 
 unten sind die Abhängigkeiten des PropertySuggesters nach Klassen sortiert...
 
 Grüße Moritz
 
 Abhängigkeiten PropertySuggester:
 
 GetSuggestions:
 
 use Wikibase\DataModel\Entity\ItemId;
 use Wikibase\DataModel\Entity\Property;
 use Wikibase\DataModel\Entity\PropertyId;
 use Wikibase\EntityLookup;
 use Wikibase\Repo\WikibaseRepo;
 use Wikibase\StoreFactory;
 use Wikibase\Utils;
 
 use ApiBase;
 use ApiMain;
 use DerivativeRequest;
 
 WikibaseRepo::getDefaultInstance()-getEntityContentFactory();
 StoreFactory::getStore( 'sqlstore' )-getEntityLookup();
 StoreFactory::getStore()-getTermIndex()-getTermsOfEntities( $ids, 
 'property',
 $language );
 Utils::getLanguageCodes()
 'type' = Property::ENTITY_TYPE )
 
 
 SuggesterEngine:
 
 use Wikibase\DataModel\Entity\Item;
 use Wikibase\DataModel\Entity\PropertyId;
 
 
 Suggestion:
 
 use Wikibase\DataModel\Entity\PropertyId;
 
 
 SimplePHPSuggester:
 
 use Wikibase\DataModel\Entity\Item;
 use Wikibase\DataModel\Entity\PropertyId;
 
 use DatabaseBase;
 use InvalidArgumentException;
 
 
 GetSuggestionsTest:
 
 use Wikibase\Test\Api\WikibaseApiTestCase;
 
 
 SimplePHPSuggesterTest:
 
 use Wikibase\DataModel\Entity\PropertyId;
 use Wikibase\DataModel\Entity\Item;
 use Wikibase\DataModel\Claim\Statement;
 use Wikibase\DataModel\Snak\PropertySomeValueSnak;
 
 use DatabaseBase;
 use MediaWikiTestCase;
 
 JavaScript:
 
 wikibase.entityselector
 wbEntityId
 
 
 


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Managing dependencies when extending Wikibase

2014-03-06 Thread Daniel Kinzler
The folks of the Wikidata.lib project at the Hasso Plattner Institut have
developed an extension to Wikibase that allows us to suggest properties to add
to items, based on the properties already present (a very cool project, btw).

This is, conceptually, and extension to the Wikibase extension. This raises
problems for managing dependencies:

* conceptually, the extension (property suggester) depends *on* wikibase.
* practically, we want to install the property suggester as an optional
dependency (feature/plugin/extension) *of* wikibase.

So, how do we best express this? How can composer handle this?

I think the most obvious/important thing to do is to have a separate module for
the interface wikibase exposes to plugins/extensions. This would include the
interfaces of the service objects, and some factory/registry classes for
accessing these.

What's the best practice for exposing such a set of interfaces? How is this best
expressed in terms of composer manifests? What are the next steps to resolve the
circular/inverse dependencies we currently have?

-- daniel

PS: Below is an email in which Moritz Finke listed the dependencies the property
suggester currently has:


 Original-Nachricht 
Betreff: PropertySuggester Dependencies
Datum: Thu, 6 Mar 2014 11:07:56 +
Von: Finke, Moritz moritz.fi...@student.hpi.uni-potsdam.de
An: Daniel Kinzler daniel.kinz...@wikimedia.de

Hi,

unten sind die Abhängigkeiten des PropertySuggesters nach Klassen sortiert...

Grüße Moritz

Abhängigkeiten PropertySuggester:

GetSuggestions:

use Wikibase\DataModel\Entity\ItemId;
use Wikibase\DataModel\Entity\Property;
use Wikibase\DataModel\Entity\PropertyId;
use Wikibase\EntityLookup;
use Wikibase\Repo\WikibaseRepo;
use Wikibase\StoreFactory;
use Wikibase\Utils;

use ApiBase;
use ApiMain;
use DerivativeRequest;

WikibaseRepo::getDefaultInstance()-getEntityContentFactory();
StoreFactory::getStore( 'sqlstore' )-getEntityLookup();
StoreFactory::getStore()-getTermIndex()-getTermsOfEntities( $ids, 'property',
$language );
Utils::getLanguageCodes()
'type' = Property::ENTITY_TYPE )


SuggesterEngine:

use Wikibase\DataModel\Entity\Item;
use Wikibase\DataModel\Entity\PropertyId;


Suggestion:

use Wikibase\DataModel\Entity\PropertyId;


SimplePHPSuggester:

use Wikibase\DataModel\Entity\Item;
use Wikibase\DataModel\Entity\PropertyId;

use DatabaseBase;
use InvalidArgumentException;


GetSuggestionsTest:

use Wikibase\Test\Api\WikibaseApiTestCase;


SimplePHPSuggesterTest:

use Wikibase\DataModel\Entity\PropertyId;
use Wikibase\DataModel\Entity\Item;
use Wikibase\DataModel\Claim\Statement;
use Wikibase\DataModel\Snak\PropertySomeValueSnak;

use DatabaseBase;
use MediaWikiTestCase;

JavaScript:

wikibase.entityselector
wbEntityId



attachment: winmail.dat___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] wbsetclaim

2014-02-26 Thread Daniel Kinzler
Am 26.02.2014 18:41, schrieb Jeroen De Dauw:
 Uh, didn't we fix this a long time ago? Client-Supplied GUIDs are evil :(
 
 This has come up at some point, and as far as I recall, we dropped the
 requirement to provide the GUID. So I suspect one can provide a claim without 
 a
 GUID, else something went wrong somewhere.

I have filed https://bugzilla.wikimedia.org/show_bug.cgi?id=61950 now.

-- daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Badges

2014-02-05 Thread Daniel Kinzler
Am 05.02.2014 22:40, schrieb Bene*:
 Am 05.02.2014 18:00, schrieb Bene*:
 Interesting, so in your opinion the actual display of items should happen via
 the common.css? I think this can work though I don't know if we should leave
 this implementation detail to the local wikis. At least, it would prevent
 another config to be added to the client which is very recommened from my
 side. Also the wiki could rank the badges easier. (New css properties 
 override
 old ones.) Thus I support your idea leaving this to the client wikis.

I think that it's up to the local wiki to decide which badges to show, and how.
Being able to manage this on-wiki seems like a good idea.

 Another question, however, is which tooltip title should be added to the
 badges sitelink. We could use the description of the wikidata item but I am
 not sure if we can access it easily from client. However, it would provide an
 easy way to translate the tooltip without some hacky mediawiki messages.

 Best regards,
 Bene*
 In addition to the previous message, we still have to decide on one badge if 
 we
 want to add a tooltip title. However, I don't think it makes sense to add a
 config variable only for the tooltip. Do you have any idea how to fix this?

A bit of JS code could set the tooltip based on the css classes. Access to the
label associated with the badge would be possible by querying wikidata, but it
would be nicer if we could somehow cache that info along with the page content.
Otherwise, it would have to be fetched for every page view on wikipedia... not 
good.

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Report from the Architecture Summit

2014-01-31 Thread Daniel Kinzler
(reposting, accidentally posted this to the internal list at first)

Hey. Here's a brief summary of what I talked to folks in SF about, what the
result was, or who we should contact to move forward.

* At the architecture summit, there seemed to be wide agreement that we need to
improve modularity in core. The TitleValue proposal was viewed as going to far
to the dark side of javafication, but it was generally seen to be moving in
the right direction. I will update the change soon to address some comments.

* Furthermore, we (the core developers) should see out service interfaces that
can and should be factored out of existing classes, starting with pathological
cases like EditPage, Title, or User. Several people agreed to look into that
(and at the same time watch out to avoid javafication), Nik Everett
vonunteered to lead the discussion.

* Gabriel Wicke has interesting plans for factoring out storage services (both
low level blob storage as well as higher level revision storage) into separate
HTTP/REST services.

* Jurik is working on a library/extension for JSON based configuration storage
for extensions. Needs review/feedback, I'm looking into that.

* I asked Aaron to provide a JobSpecification interface, so jobs can be
scheduled without having to instantiate the class that will be used to execute
the job. This makes it easier to post jobs from one wiki to another. Aaron has
already implemented this now, yay!

* Yurik wants us to rework the Wikibase API to be compatible with the core APIs
query infrastructure. This would allow use to use item lists generated by one
module as the input for another module. See
https://www.mediawiki.org/wiki/Requests_for_comment/Wikidata_API

* After talking to Chad, I'm now pretty sure we should go for ElasticSearch for
implementing queries right away. It just seems a lot simpler than using MySQL
for the baseline implementation. This however makes ElasticSearch a dependency
of WikibaseQuery, making it harder for third parties to set up queries (though
setting up Elastic seems pretty simple).

* Brion would like to be in the loop on the PubsubHubbub project. For the
operations side, and the question whether WMF would want to run their own hub,
he pointed me to Ori and Mark Bergsma.

* I didn't make progress wrt the JSON dumps. Need to get hold of Ariel, he
wasn't around. We need to find out what makes the dumps so slow. Aaron Schulz
agreed to help with that. One problematic aspect of the current implementation
is that it tries to retrieve all entity IDs with a single DB query. We might
need to chunk that.

* For the future use of composer, we should be in touch with Markus Glaser and
Hexmode (Mark Hershberger), as well as with Hashar.

* Hashar is quite interested in switching to composer and perhaps also Travis.
He was happy to hear that travis is Berlin based and sympathetic. The WMF might
even be ready to invest a bit into making Travis work with our workflow. Hashar
may come and visit us, poke him about it!

* For access to the new log stash service, we should talk to Ken Snider

* For shell access we should talk to Quim.

* I discussed allowing queries on page_prove by property value with Tim as well
as Roan. Tim suggested to add a pp_sortkey column to page_props (a float, but
nullable), and index by pp_propname+pp_sortkey. That should cover most use cases
nicely, without big schema changes.


So, lots to follow up on!

Cheers
Daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Adding configuration to WikibaseLib

2014-01-30 Thread Daniel Kinzler
Am 30.01.2014 17:15, schrieb Jeroen De Dauw:
 Hey,
 
 It has long since been clear it is harmful to add configuration into
 WikibaseLib. It is a library, not an application, and its users might well 
 want
 to use it with different config.
 
 This means that no additional entries should be added to
 WikibaseLib.default.php, and that commits that do should not be merged.

I see your point (library code should not access Settings objects, but use
explicit parameters), but this will make it difficult to manage settings shared
by repo and client in a single place. Having these in one place makes sure they
are consistent, which is especially important when running both repo and client
on the same wiki.

Do you have a suggestion how to solve this? We have had different saettigns for
the same thing in repo and client before, I would like to avoid this in the 
future.

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Wikibase data exchange format

2014-01-16 Thread Daniel Kinzler
Am 16.01.2014 17:18, schrieb Brueckner, Sebastian:
 Hey everyone,
 
 As we have just introduced ourselves on wikidata-l, we are currently working 
 on
 a PubSubHubbub [1] extension [2] to MediaWiki. Currently, the extension only
 works on MediaWiki articles, not on Wikibase objects. For those articles we 
 are
 using the wiki markup as exchange format (using URLs with action=raw), but
 currently there is no equivalent in Wikibase.

Jeroen already explained about the canonical JSON format. In the context I would
like to add some information about the URI scheme we use for our linked data
interface, which should also be used for PuSH, I think:

The canonical URI of the *description* of a Wikidata item has the form
http://www.wikidata.org/wiki/Special:EntityData/Q64. This URI is
format-agnostic, content negotiation is used to redirect to the appropriate
concrete URL (in a web browser, the redirect will typically take you to the
item's nromal page). A format can also be specified directly by giving a file
extension, e.g. http://www.wikidata.org/wiki/Special:EntityData/Q64.json

In contrast, the canonical URI of the *concept* described by a Wikidata item
follows the form https://www.wikidata.org/entity/Q64.

I suggest to use format-agnostic canonical *description* URI for PuSH 
notifications.

The URI scheme is described at
https://meta.wikimedia.org/wiki/Wikidata/Notes/URI_scheme, but please note
that that document was a working draft, and some aspects may be outdated.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Wikibase data exchange format

2014-01-16 Thread Daniel Kinzler
Am 16.01.2014 17:18, schrieb Brueckner, Sebastian:
 For those articles we are
 using the wiki markup as exchange format (using URLs with action=raw), but
 currently there is no equivalent in Wikibase.

I'm actually not sure action=raw is a good choice for wikitext - it's an old,
deprecated interface, and has several shortcomings. I'd suggest to use a
canonical document URI - such as the plain article URL. The URI just identifies
what was changed, the client may have (and even need) additional knowledge to
retrieve the updated content in the desired form. At least that is my
understanding on how PuSH works - if this is not the case, I see no good way to
support multiple content formates.

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Wikibase data exchange format

2014-01-16 Thread Daniel Kinzler
Am 17.01.2014 08:39, schrieb Daniel Kinzler:
 I suggest to use format-agnostic canonical *description* URI for PuSH 
 notifications.

I just realized that this will not work well, since the hub will retrieve that
data, and all clients would then receive the data in the format the hub (not the
clients/subscribers) prefers.

To avoid this, a format-specific description URL can be used, e.g.
http://www.wikidata.org/wiki/Special:EntityData/Q64.json
-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] locally run lua scripts

2014-01-05 Thread Daniel Kinzler
Am 05.01.2014 15:02, schrieb Liangent:
 On Sun, Jan 5, 2014 at 9:34 PM, Voß, Jakob jakob.v...@gbv.de
 mailto:jakob.v...@gbv.de wrote:
 
  If what you're executing is not something huge, doesn't require (m)any
  external dependencies, and doesn't have user interaction, you can try
  to (ab)use Scribunto's console AJAX interface:
 
 Thanks, I used your example to set up a git repository with notes. I
 planned to clone the full module-namespace with git, 
 
 
 Huh this makes me think of a git-mediawiki tool (compare with git-svn).
 
 There's already an (inactive) wikipediafs http://wikipediafs.sourceforge.net/

There's also the (inactive) levitation project:
https://github.com/scy/levitation - a project to convert Wikipedia database
dumps into Git
repositories. It doesn't scale for Wikipedia, but should work fine for smaller
dumps.

-- daniel


-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Improving leaky test cases

2013-12-19 Thread Daniel Kinzler
I'm about to sign off for the holidays, until January 6th, so here's a quick
heads up:

For investigating sporadic failures of test cases, I have created a branch of
wikibase on github, which has travis set up for testing:

https://github.com/wmde/Wikibase/tree/fixtravis
https://travis-ci.org/wmde/Wikibase

This branch contains quite a few fixes/improvements to test cases. It would be
good to have them on gerrit soon.

The following tests were identified of (probably) using hard coded entity IDs in
an unhealthy way, but I didn't get around to fixing them yet:

repo/tests/phpunit/includes/api/SetClaimTest.php
repo/tests/phpunit/includes/api/SetQualifierTest.php
repo/tests/phpunit/includes/api/SetReferenceTest.php
repo/tests/phpunit/includes/api/SetSiteLinkTest.php

They should probably be fixed along the same lines as MergeItemsTest, using the
new EntityTestHelper::injectIds method to inject real ids for placeholders in
the data the providers return.

Cheers!
Daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] [Wikitech-l] Help needed with ParserCache::getKey() and ParserCache::getOptionsKey()

2013-12-11 Thread Daniel Kinzler
Am 10.12.2013 22:38, schrieb Brad Jorsch (Anomie):
 Looking at the code, ParserCache::getOptionsKey() is used to get the
 memc key which has a list of parser option names actually used when
 parsing the page. So for example, if a page uses only math and
 thumbsize while being parsed, the value would be array( 'math',
 'thumbsize' ).

Am 11.12.2013 02:35, schrieb Tim Starling:
 No, the set of options which fragment the cache is the same for all
 users. So if the user language is included in that set of options,
 then users with different languages will get different parser cache
 objects.

Ah, right, thanks! Got myself confused there.

The thing is: we are changing what's in the list of relevant options. Before the
deployment, there was nothing in it, while with the new code, the user language
should be there. I suppose that means we need to purge these pointers.

Would bumping wgCacheEpoch be sufficient for that? Note that we don't care much
about puring the actual parser cache entries, we want to purge the pointer
entries in the cache.

 We just tried to enable the use of the parser cache for wikidata, and it 
 failed,
 resulting in page content being shown in random languages.
 
 That's probably because you incorrectly used $wgLang or
 RequestContext::getLanguage(). The user language for the parser is the
 one you get from ParserOptions::getUserLangObj().

Oh, thanks for that hint! Seems our code is inconsistent about this, using the
language from the parser options in some places, the one from the context in
others. Need to fix that!

 It's not necessary to call ParserOutput::recordOption().
 ParserOptions::getUserLangObj() will call it for you (via
 onAccessCallback).

Oh great, magic hidden information flow :)

Thanks for the info, I'll get hacking on it!

-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Help needed with ParserCache::getKey() and ParserCache::getOptionsKey()

2013-12-10 Thread Daniel Kinzler
Hi.

I (rather urgently) need some input from someone who understands how parser
caching works. (Rob: please forward as appropriate).

tl;dr:

what is the intention behind the current implementation of
ParserCache::getOptionsKey()? It's based on the page ID only, not taking into
account any options. This seems to imply that all users share the same parser
cache key, ignoring all options that may impact cached content. Is that
correct/intended? If so, why all the trouble with ParserOutput::recordOption, 
etc?


Background:

We just tried to enable the use of the parser cache for wikidata, and it failed,
resulting in page content being shown in random languages.

I tried to split the parser cache by user language using
ParserOutput:.recordOption to include userlang in the cache key. When tested
locally, and also on our test system, that seemed to work fine (which seems
strange now, looking at the code of getOptionsKey()).

On the life site however, it failed.

Judging by its name, getOptionsKey should generate a key that includes all
options relevant to caching page content in the parser cache. But it seems it
forces the same parser cache entry for all users. Is this intended?


Possible fix:

ParserCache::getOptionsKey could delegate to ContentHandler::getOptionsKey,
which could then be used to override the default behavior. Would that be a
sensible approach?

And if so, would it be feasible to push out such a change before the holidays?

Thanks,
Daniel

-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] future of the entity suggester

2013-11-25 Thread Daniel Kinzler
Hello Nilesh!

Good to hear from you. I was off for a couple of days, and asked Lydia to make
introductions. Thanks Lydia!

A quick heads up:

The architecture we have discussed with the team at the HPI is a bit different
from what we designed for the GSoC project. The idea is to have a MediaWiki
extension that relies directly on the data in a MySQL table, and generates
suggestions from that. It does not care where the data comes from, so the
database table(s) server as an interface between the front (mediawiki) part
and the back (data analysis) part. This has two advantages: 1) front and back
are decoupled and only have to agree on the structure and interpretation of the
data in the database (this is the current TODO). 2) No new services need to be
deployed in the public-facing subnet.

I think your expertise with data ingestion could help the folks at the HPI quite
a bit. Also, the modular architecture allows for data analysis components to be
swapped out easily, and we would like to try and compare different approaches
for data analysis. One based on Hadoop and/or Myrrix could well be an option -
though I'm not sure whether Myrrix would be very useful, since the actual
generation of suggestions from the pre-processed data would already be covered.

This is just an idea, I think you can best figure things out among yourself.

Cheers,
Daniel

Am 25.11.2013 17:01, schrieb Lydia Pintscher:
 Hey everyone,
 
 I have the feeling it would be good to make an official introduction.
 Nilesh has been working on the Wikidata entity suggester. There is now
 a team of students who are working on the entity suggester to get it
 finished and ready for production as part of their bachelor project.
 It would be good if you could work together and coordinate on the
 public wikidata-tech list. I'm sure with you all working together we
 can provide the Wikidata community with the great entity suggester
 they are waiting for.
 Virginia and co: Are you still having issues with the data import?
 Maybe Nilesh can help you with that as a first good step.
 
 
 Cheers
 Lydia
 


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] RFC: TitleValue

2013-09-25 Thread Daniel Kinzler
Hi all!

As discussed at the MediaWiki Architecture session at Wikimania, I have created
an RFC for the TitleValue class, which could be used to replace the heavy-weight
Title class in many places. The idea is to show case the advantages (and
difficulties) of using true value objects as opposed to active records. The
idea being that hair should not know how to cut itself.

You can find the proposal here:
https://www.mediawiki.org/wiki/Requests_for_comment/TitleValue

Any feedback would be greatly appreciated.

-- daniel


PS: I have included the some parts of the proposal below, to give a quick
impression.

--

== Motivation ==

The old Title class is huge and has many dependencies. It relies on global state
for things like namespace resolution and permission checks. It requires a
database connection for caching.

This makes it hard to use Title objects in a different context, such as unit
tests. Which in turn makes it quite difficult to write any clean unit tests (not
using any global state) for MediaWiki since Title objects are required as
parameters by many classes.

In a more fundamental sense, the fact that Title has so many dependencies, and
everything that uses a Title object inherits all of these dependencies, means
that the MediaWiki codebase as a whole has highly tangled dependencies, and it
is very hard to use individual classes separately.

Instead of trying to refactor and redefine the Title class, this proposal
suggest to introduce an alternative class that can be used instead of Title
object to represent the title of a wiki page. The implementation of the old
Title class should be changed to rely on the new code where possible, but its
interface and behavior should not change.

== Architecture ==

The proposed architecture consists of three parts, initially:

# The TitleValue class itself. As a value object, this has no knowledge about
namespaces, permissions, etc. It does not support normalization either, since
that would require knowledge about the local configuration.

# A TitleParser service that has configuration knowledge about namespaces and
normalization rules. Any class that needs to turn a string into a TitleValue
should require a TitleParser service as a constructor argument (dependency
injection). Should that not be possible, a TitleParser can be obtained from a
global registry.

# A TitleFormatter service that has configuration knowledge about namespaces and
normalization rules. Any class that needs to turn a TitleValue into a string
should require a TitleFormatter service as a constructor argument (dependency
injection). Should that not be possible, a TitleFormatter can be obtained from a
global registry.



___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] [Wikidata-l] [Pywikipedia-l] wbsearchentities()

2013-09-13 Thread Daniel Kinzler
Hi all!

We have to impose a fixed limit on search result, since search results can not
be ordered by a unique ID, so paging is expensive.

The default for this limit is 50, but it SHOULD be 500 for bots. But the higher
limit for bots is currently not applied by the wbsearchentities module - that's
a bug, see https://bugzilla.wikimedia.org/show_bug.cgi?id=54096.

We should be able to fix this soon. Please poke us again if nothing happens for
a couple of weeks.

-- daniel

Am 12.09.2013 12:12, schrieb Merlijn van Deen:
 On 11 September 2013 20:31, Chinmay Naik chin.nai...@gmail.com wrote:
 
 Can i retreive more than 100 items using this? I notice the
 'search-continue' returned by the search result disappears after 50 items.
 for ex
 https://www.wikidata.org/wiki/Special:ApiSandbox#action=wbsearchentitiesformat=jsonsearch=abclanguage=entype=itemlimit=10continue=50


 The api docs at https://www.wikidata.org/w/api.php explicitly state the
 highest value for 'continue' is 50:
 
   limit   - Maximal number of results
 The value must be between 0 and 50
 Default: 7
   continue- Offset where to continue a search
 The value must be between 0 and 50
 Default: 0
 
 which indeed suggests there is a hard limit of 100 entries. Maybe someone
 in the Wikidata dev team can explain the reason behind this?
 
 Merlijn
 
 
 
 ___
 Wikidata-l mailing list
 wikidat...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l
 


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] IRI, URL, sameAs and the identifier mess

2013-09-13 Thread Daniel Kinzler
Am 03.09.2013 21:43, schrieb David Cuenca:
 A couple of months ago there was a conversation about what to do with the
 identifiers that should be owl:sameAs [1]

It's unclear to me where owl:sameAs would be used... it should definitely NOT
be used to point to descriptions of the same thing in other repositories.

See
https://www.wikidata.org/w/index.php?title=Wikidata%3AProject_chat%2FArchive%2F2013%2F07diff=70181630oldid=66375829.

 Then there is another discussion about using a formatter URL property to
 use any catalogue/db as an id instead of creating a property [2]

That seems fine to me.

 Now there is another property proposal to implement sameAs as a property
 taking a url. [3]

Ick! That's just utterly wrong! I'll leave a message.

 And this is all related to the recent thread in this mailing list about
 IRI-value or string-value for URLs.

That is a misunderstanding. That was purely about the internal representation of
these values in code. It has nothing to to with whether or not the data type
itself will support URI values or just strict URIs or URLs.

The URL data type should support any URL you can use in wikitext (there are some
known issues with non-ascii domains right now, but that's a bug and being worked
on).

 So, in the end, what is the preferred approach?

I can't tell you what the Wikidata community currently views as the best way.
Personally, i would use separate properties for different identifiers, and
document how each such identifier maps to a URI/URL.

The url data type can be used for URLs, URIs, IRIs, etc. It's just a question
of convention and of how you interpret the respective properties.

-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] [Wikidata-l] [Pywikipedia-l] wbsearchentities()

2013-09-13 Thread Daniel Kinzler
Am 13.09.2013 18:24, schrieb Benjamin Good:
 Daniel,
 
 Even 500 seems like a very low limit for this system unless I'm
 misunderstanding something.  Unless there is another way to execute queries
 that return more rows than that, this would negate the possibility of a
 huge number of applications - all of ours in particular.  If we want to
 say, request something like all human genes (about 20,000 items), how
 would we do that?

You are looking for actual *query* support, not just a search by name. This is
on the road map, and I hope we will be able to deploy it by the end of the year.
But it's not possible yet.

Supporting queries like all people born in hamburg or all cities in europe
is an obvious goal for wikidata. And we are working on it, but it's not trivial
to make this scale to the number of entries, queries and different properties we
are dealing with.

 Within Wikipedia, we do this via the mediawiki API based on
 contains-template or category queries without any issue.  Certainly
 wikidata will be more useful for queries than raw mediawiki???

See above.

 I'm certain I am missing something, please clarify.
 
 This is currently standing in the way of our GSoC student completing his
 summer project - due next week.  A little disappointing for him..

Sorry, but we have never hidden the fact that our query interface is not ready
yet. wbsearchentities is a label lookup designed for find-as-you-type
suggestions. It's not a query interface, and was never supposed to be.

I understand the disappointment, but there is little we can do about this now.

All I can suggest is working from a dump right now (and sadly, we only have
mediawiki's raw json-in-xml dumps at the moment. I'm working on native JSON and
RDF dumps, but they are not ready).

-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] BREAKING CHANGE: Wikidata API changing top upper-case IDs.

2013-09-10 Thread Daniel Kinzler
Hi all.


With today's deployment, the Wikibase API modules used on wikidata.org will
change from using lower-case IDs (q12345) to upper-case IDs (Q12345). This is
done for consistency with the way IDs are shown in the UI and used in URLs.

The API will continue to accept entity IDs in lower-case as well as upper-case.
Any bot or other client that has no property or item IDs hardcoded or configured
in lower case should be fine.

If however your code looks for some specific item or property in the output
returned from the API, and it's using a lower-case ID to do so, it may now fail
to match the respective ID.

There is potential for similar problems with Lua code, depending on how the data
structure is processed by Lua. We are working to minimize the impact there.

Sorry for the short notice.

Please test your code against test.wikidata.org and let us know if you find any
issues.


Thanks,
Daniel


PS: issue report on bugzilla: 
https://bugzilla.wikimedia.org/show_bug.cgi?id=53894

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] IRI-value or string-value for URLs?

2013-09-03 Thread Daniel Kinzler

Am 03.09.2013 11:50, schrieb Lydia Pintscher:

On Mon, Sep 2, 2013 at 11:56 AM, Denny Vrandečić
denny.vrande...@wikimedia.de wrote:

OK, based on the discussion so far, we will add the data type to the snak in
the external export, and keep the string data value for the URL data type.
That should satisfy all use cases that have been brought up.


Just so I know what's coming: Is this doable for the deployment in a week?


If we push back something else, yes. But I think this is mainly useful in JSON 
dumps - which we don't have yet. Not hard to do, but won't happen in a week.


-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] IRI-value or string-value for URLs?

2013-09-01 Thread Daniel Kinzler

Am 30.08.2013 17:21, schrieb Denny Vrandečić:

I do see an advantage of stating the property datatype in a snak in the
external JSON representation, and am trying to understand what prevents us
from doing so.


Not much, the SnakSerializer would need access to the PropertyDataTypeLookup 
service, injected via the SerializerFactory. SnakSerializer already has:

// TODO: we might want to include the data type of the property here as well

-- daniel

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Is assert() allowed?

2013-07-31 Thread Daniel Kinzler

Am 31.07.2013 13:42, schrieb Tim Starling:

We could have a library of PHPUnit-style assertion functions which
throw exceptions and don't act like eval(), I would be fine with that.
Maybe MWAssert::greaterThan( $foo, $bar ) or something.


I like that! Should support an error message as an optional parameter. I suppose 
we could just steal the method signatures from phpunit.


-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Jenkins failing for no good reason.

2013-07-16 Thread Daniel Kinzler
There seems to be an issue with Jenkins. It appears to use an old version of
other extensions under some circumstances.

It's like this:

If you submit change 33 for extension A, which needs change 44 in extension B
(which isn't merged yet), jenkins will fail correctly fail.

BUT: When change 44 got merged into extension B, and you force Jenkins to re-run
(e.g. by rebasing change 33), it will *still* fail, apparently using an old
version of extension B.

It seems this is only the case for the testextensions-master job, not the
standalone repo and client jobs.

Here are some examples:

https://gerrit.wikimedia.org/r/#/c/72962/ fails for no good reason
https://integration.wikimedia.org/ci/job/mwext-Wikibase-testextensions-master/3099/console

https://gerrit.wikimedia.org/r/#/c/73772/ fails for no good reason
https://integration.wikimedia.org/ci/job/mwext-Wikibase-testextensions-master/3093/console

Please gather more evidence/insights if you come across this issue.

Thanks,
Daniel

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] WikiVOYAGE deployment plan

2013-06-28 Thread Daniel Kinzler
Am 28.06.2013 11:45, schrieb Denny Vrandečić:
 * Wikipedia will not automatically and suddenly display links to 
 Wikivoyage.
 The behavior on Wikipedia actually remains completely unchanged by this
 deployment.

Let's make sure we have thorough tests  for this, I'm not 100% sure how this is
currently handled on the client.

-- daniel

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Representing invalid Snaks

2013-06-25 Thread Daniel Kinzler
A quick follow up to this morning's mail:

I discussed this issue with Denny for a while, and we came up with this:

* I'll explore the possibility of using a BadValue object instead of a BadSnak,
that is, model the error on the DataValue level. My initial impression was that
this would be more work, but I'm no longer sure, and will just try and see.

* We will represent the error as a string inside the BadValue/BadSnak object.
There seem to be no immediate benefits or obvious use cases for wrapping that
in an Error object. (This in reply to an earlier discussion on Gerrit).

-- daniel



___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Getting the property (data) type of a PropertyValueSnak.

2013-06-13 Thread Daniel Kinzler
Am 13.06.2013 06:38, schrieb Jeroen De Dauw:
 Hey,
 
 Putting the DataType id in PropertyValueSnaks at this point seems like a bad
 idea for several reasons. Doing so would cost us quite some work end end up 
 with
 a more complicated system as foundation. 

Changing it now would be hard.

But I think it would have been simpler and cleaner if we had gone that route
from the start.

Why would it be a bad idea? To me, it's just a self-contained data structure
that knows it's own type, as it should.

 If you have a use case for which the
 current code is not well suited, I suggest writing new code for that specific
 use case. I strongly suspect this will both be simpler and less work.

Any code I can write for this now will involve injecting knowledge about
properties into the snaks post-hoc. That's going to suck.

 I remember a lengthy discussion about this, but I don't recall the 
 outcome (yes,
 we really need to write this stuff down).
 
 
 There was no decision at any point to change this, though it indeed has been
 brought up before.

Well, at some point, the decision was made, right? Was it disucssed? Were the
implications of each approach compared? Is this documented somewhere?

I recall a lengthy skype call with Markus and Denny about this, and I *seem* to
recall that we  decided to store the type in the snaks - but as too often, i
don't think this is documented anywhere.

So, what's the point of whining now, since it's too late anyway? I'd like to
understand the rationale for going with the current system. And I would like to
make the case for more communication and documentation about design decisions
like this. Especially since anything concerning the internal data tsructure that
goes into the DB is very hard to change later.

-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Getting the property (data) type of a PropertyValueSnak.

2013-06-13 Thread Daniel Kinzler
Am 13.06.2013 03:22, schrieb Daniel Werner:
  -1
 Had to deal with this in the frontend as well and don't think this is
 inconvenient.  It seems like the cleanest approach. Polluting the Snaks with
 information like this for performance or convenience reasons will probably 
 cause
 more trouble in the end than keeping it as simple and pure as possible.

You think that giving a data structure information about its type is polluting
it? Why so? This seems pretty basic and streight forward to me.

-- daniel


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech